Acefit extension #46

tjjarvinen · 2023-09-16T22:48:53Z

@cortner This adds training support for AtomsBase input. It has only overloads the function in ACEfit. Later on we can support for ACEpotentials specific functions.

There was some issue with implementing this. I did stumble on a bug that caused assemble to give some random results (can be triggered by changing pmap in asseble to Folds.map). This is probably some bug in ACE data allocation, but it can also be in Julia general - I don't understand how the bug works. But I am very concerned of it, as appear somewhere else and possible changes in Julia can cause it to emerge in the existing "working" code.

Here is a example code and comparison to old way

using ACEmd
using ACEfit
using ACEpotentials

model = acemodel(
    elements = [:Ti, :Al],
	order = 3,
	totaldegree = 6,
	rcut = 5.5,
	Eref = [:Ti => -1586.0195, :Al => -105.5954]
)
basis = model.basis

data_j, _, meta = ACEpotentials.example_dataset("TiAl_tutorial")
weights = Dict( "default" => Dict("E" => 5.0, "F" => 1.0 , "V" => 1.0 ) )
datakeys = (energy_key = "energy", force_key = "force", virial_key = "virial")
train_old = [ACEpotentials.AtomsData(t; weights=weights, v_ref=model.Vref, datakeys...) for t in data_j[1:5:end]]
train_new = [  FlexibleSystem(x.atoms) for x in train_old  ]

# Old style
A, Y, W = ACEfit.assemble(train_old, basis);

# New style
a, y, w = ACEfit.assemble(train_new, basis; energy_default_weight=5, energy_ref=model.Vref)

# Compare results
maximum(abs2, A - a)
maximum(abs2, Y - y)
maximum(abs2, W - w)

The new way is way more faster than the old one, between several tens to several hundred (if you don't use parallel processing). The is also a way to make it even faster, but I will skip for now. The main point is that assembly is now faster than solving the linear equation, changing the dynamics of fitting potentials completely.

This could be wrong, but cell matrix needs this, so adding it here.

cortner · 2023-09-16T23:00:36Z

Hi Teemy - Thank you for exploring this. I'll will need to find time to go through the code very carefully. In the meantime can you please explain a few things?

How does the fitting data enter the new assembly? I don't see this. Does FlexibleSystem keep the metadata like reference energies and forces?
Similarly, how do the fitting weights enter the new assembly?
Can you explain briefly where the speedup comes from?

Regarding the bug, I have some doubts about there being a bug in ACE1 allocations, it's been in use for many years now, but you never know - as you say changes in Julia or somewhere else might bring it out. But possibly also some incorrect assumptions made about how temporary storage is to be pre-allocated and how outputs are allowed to be used. Either way, I'd be grateful if you can open a separate issue or PR with a reproducable example so we can investigate it.

CC @wcwitt - would you mind following this discussion as well?

cortner · 2023-09-16T23:02:05Z

another question - why add the example training data to the repo when you can just get it via ACEpotentials.example_dataset?

tjjarvinen · 2023-09-16T23:57:04Z

How does the fitting data enter the new assembly? I don't see this. Does FlexibleSystem keep the metadata like reference energies and forces?

Yes, AtomsBase is defined so that you can store whatever data in the structures. The documentation page of this PR has the details. But in short, if data is AtomsBase structure is should have training data behind keys: data[:energy], data[:force] and data[:virial]. ExtXYZ does have a bug at the moment that makes it fail to read some of this data, like force, when used for AtomsBase structure (works for JuLIP), so I need to submit a PR for it, to get AtomsBase interface to work fully.

Fitting weights are also explained in the documentation page. In short, general weights are given as keywords, like in above example. In addition you can give each structure a separate weight parameter that is used only for that structure.

Can you explain briefly where the speedup comes from?

I am not completely sure. I suspect that it has something to do with the garbage collection issue. Here is a benchmark comparison with the above posted example

# old impelentation
julia> @benchmark ACEfit.assemble($train_old, $basis)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 14.657 s (98.42% GC) to evaluate,
 with a memory estimate of 445.05 MiB, over 5253010 allocations.

# new implementation
julia> @benchmark ACEfit.assemble($train_new, $basis; energy_default_weight=$5, energy_ref=$model.Vref) 
BenchmarkTools.Trial: 32 samples with 1 evaluation.
 Range (min … max):  136.020 ms … 180.292 ms  ┊ GC (min … max): 18.63% … 16.11%
 Time  (median):     156.335 ms               ┊ GC (median):    23.22%
 Time  (mean ± σ):   157.097 ms ±   9.635 ms  ┊ GC (mean ± σ):  22.56% ±  2.36%

                     █ ▃▃▃             █   █
  ▇▁▁▁▁▁▁▁▁▇▁▁▇▁▇▁▇▁▁█▇███▁▁▁▇▇▇▁▁▇▇▇▁▇█▁▁▁█▇▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
  136 ms           Histogram: frequency by time          180 ms <

 Memory estimate: 1.03 GiB, allocs estimate: 5453582.

Use of multiprocessing in the old implementation makes it hard to see, where the difference comes. But it is probably the garbage collection.

why add the example training data to the repo when you can just get it via ACEpotentials.example_dataset?

It is the first 3 structures from the ACEpotentials training data. Its function is to implement tests for correctness of assembly. I just thought that it is so little data that it was better to just add here is as a copy rather than load it over network every time you do tests.

wcwitt · 2023-09-17T15:37:00Z

Can someone explain the context here? Is the aim to replace JuLIP?

The new way is way more faster than the old one, between several tens to several hundred

I am fully prepared to believe this is possible, but we should agree on some larger benchmarks. This could become one: ACEsuit/ACEfit.jl#54 (comment).

The assembly, particularly in parallel, has felt confusingly sluggish to me for some time. I've been attributing it to something in ACE1/JuLIP (perhaps involving GC), where I've been reluctant to dig deeply, but not sure.

cortner · 2023-09-17T16:15:59Z

Yes we want to kill JuLIP by moving small pieces into ACE packages or into community packages.

cortner · 2023-09-17T16:17:22Z

I also had the impression that Teemu s weights are I compatible with ours so this may need to be fixed. But that’s a minor issue.

cortner · 2023-09-18T08:53:05Z

It's really funny actually, the old assembly uses less memory but spends the majority of time in GC :

  9.282455 seconds (5.27 M allocations: 450.554 MiB, 96.17% gc time)
  0.952607 seconds (5.57 M allocations: 1.058 GiB, 4.73% gc time)

The reducing of GC from 95% to 5% more or less explains the factor-10 speedup.

cortner · 2023-09-18T08:54:43Z

@tjjarvinen -- can you also please open a separate issue for the bug you found? If we trace it to ACE1 then I need to look into it.

cortner · 2023-09-18T08:56:15Z

One small remark, to make the tests entirely fair one should remove the cached neighbourlist after each assembly,

rm_nlist(at::Atoms) = delete!(at.data, "nlist:default")
rm_nlist.(data_j)

but this doesn't actually seem to make a measurable difference.

wcwitt · 2023-09-18T09:16:06Z

If this speedup holds, I'm excited, because it likely also means better strong scaling. Currently, we are at least a factor of 5 worse in that dimension than we should be, perhaps more.

cortner · 2023-09-18T14:10:59Z

The numbers change a bit when you make the model bigger, to about a factor 3. Still very very nice speedup.

# length(basis) = 4676,
# length(data) = 65
 14.784368 seconds (14.64 M allocations: 8.669 GiB, 64.56% gc time)
  3.377460 seconds (18.59 M allocations: 30.495 GiB, 25.45% gc time)

# length(basis) = 12888
# length(data) = 65
 26.681512 seconds (25.48 M allocations: 23.483 GiB, 38.05% gc time)
  9.563755 seconds (37.73 M allocations: 83.549 GiB, 23.07% gc time)

# length(basis) = 12888
# length(data) = 329
119.029084 seconds (119.29 M allocations: 105.846 GiB, 43.04% gc time)
 40.032188 seconds (148.74 M allocations: 339.525 GiB, 20.97% gc time)

wcwitt · 2023-09-18T14:44:21Z

I tried on some other datasets and hit errors (e.g., still needs support for custom energy/force/virial keys). But encouraging nonetheless.

wcwitt · 2023-09-18T15:01:23Z

Hm, am I correct that the new assembly routine does not force garbage collection after each structure, unlike the old one? That is likely a factor - perhaps the main factor - particularly for small tests.

We know forcing GC hurts performance, but it has been the only way to avoid crashes (for both distributed and threaded) during larger assemblies.

cortner · 2023-09-18T18:07:08Z

What if we add a counter and GC every once in a while?

wcwitt · 2023-09-18T21:09:43Z

Yeah, perhaps. Or I will merge in one of the batching ideas and we can GC after batches.

Still holding out hope this PR fixes the underlying issue somehow - let's see.

wcwitt · 2023-09-18T22:17:19Z

Just tried the small benchmark again (from the top of this thread):

# old (looks insane, but I triple checked)
20.635631 seconds (5.49 M allocations: 786.066 MiB, 97.01% gc time)
# old, without forcing GC
0.605159 seconds (5.49 M allocations: 786.014 MiB, 17.42% gc time)
# new
0.747427 seconds (5.61 M allocations: 1.094 GiB, 17.34% gc time)

So for the tiny model the forced GC explains everything.

Same trend, but less definitive for the bigger model:

length(basis) = 12888
length(train_old) = 66

# old
82.048485 seconds (31.27 M allocations: 41.684 GiB, 29.60% gc time)
# old, without forcing GC
62.509240 seconds (31.27 M allocations: 41.684 GiB, 3.69% gc time)
# new
49.417405 seconds (45.56 M allocations: 81.406 GiB, 2.45% gc time)

tjjarvinen · 2023-09-18T22:48:32Z

That points that the issue might be with the communication. For bigger system more time is spend on ACE evaluations, which will overshadow the other parts of the code. The new code has still unused optimizations that can be done, so we should get more out of it once they have been added.

The issue with the old version is that there is a Distributed array to which all processes write at the same time. This by definition needs that there has to be some kind of locking and broadcasting of changes made to the array. This is probably the point where the issue comes. However this would be the case only when there is more than 1 process, but the issue is present even with one process.

One point to note here is that you cannot implement the old asseble version with Rust, because in Rust only the owner can change the content and thus the compiler would not allow you to compile. This is in general a huge red fag that there are possible issues in the code and in general should be avoided.

tjjarvinen · 2023-09-18T22:52:53Z

I added support for customizing data keys

ACEfit.assemble(data, basis;
    energy_key=:energy,
    force_key=:force,
    virial_key=:virial
)

Changing those should allow you to use the other data.

tjjarvinen · 2023-09-19T01:47:37Z

I added a small performance improvement that is in use when both force and virial are calculated. The basis here is that both need evaluate_d call and instead to calling it two times, just collect force and virial from the same call. This nearly eliminate the time needed for virial calculation. It is about one third off from the time, so not much, but also it was very easy to do.

wcwitt · 2023-09-19T07:47:08Z

The basis here is that both need evaluate_d call and instead to calling it two times, just collect force and virial from the same call.

Smart, thanks.

The issue with the old version is that there is a Distributed array to which all processes write at the same time.

I understand the reasoning, but I don't think it's the main bottleneck (yet). In the past, I've tested by eliminating the write step, such that the assembly blocks are created and then immediately discarded, but the times don't improve significantly. There is some discussion of that here (but in the threading context): ACEsuit/ACEfit.jl#49 (comment)

cortner · 2023-09-19T07:50:04Z

Yes - caching virials and forces is a long-standing issue :) - thanks for solving this here. Though we should really have this at a more general level. Until then this is a great improvement.

cortner · 2023-09-19T07:51:27Z

Regardless of understanding the origin of the improvement, can we please agree on a few tests that need to pass before we can make these changes the default for ACEfit and ACEpotentials

tjjarvinen · 2023-09-19T08:45:21Z

We should not hurry for the default yet. The best way is to add this as an alternative first, for some testing period. During this period we tell people to try it and once we are comfortable we can move it to be the default.

I also need to fix the issue with ExtXYZ until we can go for full AtomsBase workflow too. But I should be able to fix that this week, after which it needs for James to release a new version, but that should not be an issue.

I understand the reasoning, but I don't think it's the main bottleneck (yet). In the past, I've tested by eliminating the write step, such that the assembly blocks are created and then immediately discarded, but the times don't improve significantly. There is some discussion of that here (but in the threading context): ACEsuit/ACEfit.jl#49 (comment)

Yes I know, there is something that I don't understand either. But it is somehow related to the way data is moved/transformed...

cortner · 2023-09-19T08:46:49Z

We should not hurry for the default yet.

fair enough but it still needs equivalent functionality and identical results on a range of tests.

tjjarvinen added 16 commits September 15, 2023 14:58

Add kwargs to energy calculation with base

f768dcb

training assembly methods

a9b6dba

small fix

dd9d1d5

fix bug

2afe77a

superbasis support

3f9fb9b

add multiprocessing support

97de259

fix energy weight key

0b5baed

add transpose to virial vector cat

3abcc7c

This could be wrong, but cell matrix needs this, so adding it here.

add tests

87c0c2b

fix tests

c8814f0

small addition to tests

38df7ea

ACEpotential costructor

b28b29d

add documentation

0a55cce

add more docs

e7be718

fix typo

6f8b6a8

fix typo

f91e8a1

fix doc build

7b07a74

add individual parameter weights to doc

8be3c6e

add keys to control reference data names

e8ec890

tjjarvinen mentioned this pull request Sep 18, 2023

Bug in assemble when using threads #47

Open

improve performance

55350c9

tjjarvinen added 2 commits September 20, 2023 10:07

adjust to AtomsBase fixes

c0378b9

strip unit from energy reference

f872b9d

tjjarvinen merged commit 7dad29d into main Oct 4, 2023
4 checks passed

tjjarvinen deleted the acefit-extension branch October 17, 2023 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acefit extension #46

Acefit extension #46

tjjarvinen commented Sep 16, 2023

cortner commented Sep 16, 2023

cortner commented Sep 16, 2023

tjjarvinen commented Sep 16, 2023

wcwitt commented Sep 17, 2023

cortner commented Sep 17, 2023

cortner commented Sep 17, 2023

cortner commented Sep 18, 2023 •

edited

Loading

cortner commented Sep 18, 2023

cortner commented Sep 18, 2023

wcwitt commented Sep 18, 2023

cortner commented Sep 18, 2023 •

edited

Loading

wcwitt commented Sep 18, 2023

wcwitt commented Sep 18, 2023

cortner commented Sep 18, 2023

wcwitt commented Sep 18, 2023

wcwitt commented Sep 18, 2023

tjjarvinen commented Sep 18, 2023

tjjarvinen commented Sep 18, 2023

tjjarvinen commented Sep 19, 2023

wcwitt commented Sep 19, 2023

cortner commented Sep 19, 2023

cortner commented Sep 19, 2023

tjjarvinen commented Sep 19, 2023

cortner commented Sep 19, 2023

Acefit extension #46

Acefit extension #46

Conversation

tjjarvinen commented Sep 16, 2023

cortner commented Sep 16, 2023

cortner commented Sep 16, 2023

tjjarvinen commented Sep 16, 2023

wcwitt commented Sep 17, 2023

cortner commented Sep 17, 2023

cortner commented Sep 17, 2023

cortner commented Sep 18, 2023 • edited Loading

cortner commented Sep 18, 2023

cortner commented Sep 18, 2023

wcwitt commented Sep 18, 2023

cortner commented Sep 18, 2023 • edited Loading

wcwitt commented Sep 18, 2023

wcwitt commented Sep 18, 2023

cortner commented Sep 18, 2023

wcwitt commented Sep 18, 2023

wcwitt commented Sep 18, 2023

tjjarvinen commented Sep 18, 2023

tjjarvinen commented Sep 18, 2023

tjjarvinen commented Sep 19, 2023

wcwitt commented Sep 19, 2023

cortner commented Sep 19, 2023

cortner commented Sep 19, 2023

tjjarvinen commented Sep 19, 2023

cortner commented Sep 19, 2023

cortner commented Sep 18, 2023 •

edited

Loading

cortner commented Sep 18, 2023 •

edited

Loading