Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acefit extension #46

Merged
merged 22 commits into from
Oct 4, 2023
Merged

Acefit extension #46

merged 22 commits into from
Oct 4, 2023

Conversation

tjjarvinen
Copy link
Collaborator

@cortner This adds training support for AtomsBase input. It has only overloads the function in ACEfit. Later on we can support for ACEpotentials specific functions.

There was some issue with implementing this. I did stumble on a bug that caused assemble to give some random results (can be triggered by changing pmap in asseble to Folds.map). This is probably some bug in ACE data allocation, but it can also be in Julia general - I don't understand how the bug works. But I am very concerned of it, as appear somewhere else and possible changes in Julia can cause it to emerge in the existing "working" code.

Here is a example code and comparison to old way

using ACEmd
using ACEfit
using ACEpotentials

model = acemodel(
    elements = [:Ti, :Al],
	order = 3,
	totaldegree = 6,
	rcut = 5.5,
	Eref = [:Ti => -1586.0195, :Al => -105.5954]
)
basis = model.basis

data_j, _, meta = ACEpotentials.example_dataset("TiAl_tutorial")
weights = Dict( "default" => Dict("E" => 5.0, "F" => 1.0 , "V" => 1.0 ) )
datakeys = (energy_key = "energy", force_key = "force", virial_key = "virial")
train_old = [ACEpotentials.AtomsData(t; weights=weights, v_ref=model.Vref, datakeys...) for t in data_j[1:5:end]]
train_new = [  FlexibleSystem(x.atoms) for x in train_old  ]

# Old style
A, Y, W = ACEfit.assemble(train_old, basis);

# New style
a, y, w = ACEfit.assemble(train_new, basis; energy_default_weight=5, energy_ref=model.Vref)

# Compare results
maximum(abs2, A - a)
maximum(abs2, Y - y)
maximum(abs2, W - w)

The new way is way more faster than the old one, between several tens to several hundred (if you don't use parallel processing). The is also a way to make it even faster, but I will skip for now. The main point is that assembly is now faster than solving the linear equation, changing the dynamics of fitting potentials completely.

@cortner
Copy link
Member

cortner commented Sep 16, 2023

Hi Teemy - Thank you for exploring this. I'll will need to find time to go through the code very carefully. In the meantime can you please explain a few things?

  • How does the fitting data enter the new assembly? I don't see this. Does FlexibleSystem keep the metadata like reference energies and forces?
  • Similarly, how do the fitting weights enter the new assembly?
  • Can you explain briefly where the speedup comes from?

Regarding the bug, I have some doubts about there being a bug in ACE1 allocations, it's been in use for many years now, but you never know - as you say changes in Julia or somewhere else might bring it out. But possibly also some incorrect assumptions made about how temporary storage is to be pre-allocated and how outputs are allowed to be used. Either way, I'd be grateful if you can open a separate issue or PR with a reproducable example so we can investigate it.

CC @wcwitt - would you mind following this discussion as well?

@cortner
Copy link
Member

cortner commented Sep 16, 2023

another question - why add the example training data to the repo when you can just get it via ACEpotentials.example_dataset?

@tjjarvinen
Copy link
Collaborator Author

  • How does the fitting data enter the new assembly? I don't see this. Does FlexibleSystem keep the metadata like reference energies and forces?

Yes, AtomsBase is defined so that you can store whatever data in the structures. The documentation page of this PR has the details. But in short, if data is AtomsBase structure is should have training data behind keys: data[:energy], data[:force] and data[:virial]. ExtXYZ does have a bug at the moment that makes it fail to read some of this data, like force, when used for AtomsBase structure (works for JuLIP), so I need to submit a PR for it, to get AtomsBase interface to work fully.

Fitting weights are also explained in the documentation page. In short, general weights are given as keywords, like in above example. In addition you can give each structure a separate weight parameter that is used only for that structure.

  • Can you explain briefly where the speedup comes from?

I am not completely sure. I suspect that it has something to do with the garbage collection issue. Here is a benchmark comparison with the above posted example

# old impelentation
julia> @benchmark ACEfit.assemble($train_old, $basis)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 14.657 s (98.42% GC) to evaluate,
 with a memory estimate of 445.05 MiB, over 5253010 allocations.

# new implementation
julia> @benchmark ACEfit.assemble($train_new, $basis; energy_default_weight=$5, energy_ref=$model.Vref) 
BenchmarkTools.Trial: 32 samples with 1 evaluation.
 Range (min  max):  136.020 ms  180.292 ms  ┊ GC (min  max): 18.63%  16.11%
 Time  (median):     156.335 ms               ┊ GC (median):    23.22%
 Time  (mean ± σ):   157.097 ms ±   9.635 ms  ┊ GC (mean ± σ):  22.56% ±  2.36%

                     █ ▃▃▃             █   █
  ▇▁▁▁▁▁▁▁▁▇▁▁▇▁▇▁▇▁▁█▇███▁▁▁▇▇▇▁▁▇▇▇▁▇█▁▁▁█▇▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
  136 ms           Histogram: frequency by time          180 ms <

 Memory estimate: 1.03 GiB, allocs estimate: 5453582.

Use of multiprocessing in the old implementation makes it hard to see, where the difference comes. But it is probably the garbage collection.

why add the example training data to the repo when you can just get it via ACEpotentials.example_dataset?

It is the first 3 structures from the ACEpotentials training data. Its function is to implement tests for correctness of assembly. I just thought that it is so little data that it was better to just add here is as a copy rather than load it over network every time you do tests.

@wcwitt
Copy link

wcwitt commented Sep 17, 2023

Can someone explain the context here? Is the aim to replace JuLIP?

The new way is way more faster than the old one, between several tens to several hundred

I am fully prepared to believe this is possible, but we should agree on some larger benchmarks. This could become one: ACEsuit/ACEfit.jl#54 (comment).

The assembly, particularly in parallel, has felt confusingly sluggish to me for some time. I've been attributing it to something in ACE1/JuLIP (perhaps involving GC), where I've been reluctant to dig deeply, but not sure.

@cortner
Copy link
Member

cortner commented Sep 17, 2023

Yes we want to kill JuLIP by moving small pieces into ACE packages or into community packages.

@cortner
Copy link
Member

cortner commented Sep 17, 2023

I also had the impression that Teemu s weights are I compatible with ours so this may need to be fixed. But that’s a minor issue.

@cortner
Copy link
Member

cortner commented Sep 18, 2023

It's really funny actually, the old assembly uses less memory but spends the majority of time in GC :

  9.282455 seconds (5.27 M allocations: 450.554 MiB, 96.17% gc time)
  0.952607 seconds (5.57 M allocations: 1.058 GiB, 4.73% gc time)

The reducing of GC from 95% to 5% more or less explains the factor-10 speedup.

@cortner
Copy link
Member

cortner commented Sep 18, 2023

@tjjarvinen -- can you also please open a separate issue for the bug you found? If we trace it to ACE1 then I need to look into it.

@cortner
Copy link
Member

cortner commented Sep 18, 2023

One small remark, to make the tests entirely fair one should remove the cached neighbourlist after each assembly,

rm_nlist(at::Atoms) = delete!(at.data, "nlist:default")
rm_nlist.(data_j)

but this doesn't actually seem to make a measurable difference.

@wcwitt
Copy link

wcwitt commented Sep 18, 2023

If this speedup holds, I'm excited, because it likely also means better strong scaling. Currently, we are at least a factor of 5 worse in that dimension than we should be, perhaps more.

@cortner
Copy link
Member

cortner commented Sep 18, 2023

The numbers change a bit when you make the model bigger, to about a factor 3. Still very very nice speedup.

# length(basis) = 4676,
# length(data) = 65
 14.784368 seconds (14.64 M allocations: 8.669 GiB, 64.56% gc time)
  3.377460 seconds (18.59 M allocations: 30.495 GiB, 25.45% gc time)

# length(basis) = 12888
# length(data) = 65
 26.681512 seconds (25.48 M allocations: 23.483 GiB, 38.05% gc time)
  9.563755 seconds (37.73 M allocations: 83.549 GiB, 23.07% gc time)

# length(basis) = 12888
# length(data) = 329
119.029084 seconds (119.29 M allocations: 105.846 GiB, 43.04% gc time)
 40.032188 seconds (148.74 M allocations: 339.525 GiB, 20.97% gc time)

@wcwitt
Copy link

wcwitt commented Sep 18, 2023

I tried on some other datasets and hit errors (e.g., still needs support for custom energy/force/virial keys). But encouraging nonetheless.

@wcwitt
Copy link

wcwitt commented Sep 18, 2023

Hm, am I correct that the new assembly routine does not force garbage collection after each structure, unlike the old one? That is likely a factor - perhaps the main factor - particularly for small tests.

We know forcing GC hurts performance, but it has been the only way to avoid crashes (for both distributed and threaded) during larger assemblies.

@cortner
Copy link
Member

cortner commented Sep 18, 2023

What if we add a counter and GC every once in a while?

@wcwitt
Copy link

wcwitt commented Sep 18, 2023

Yeah, perhaps. Or I will merge in one of the batching ideas and we can GC after batches.

Still holding out hope this PR fixes the underlying issue somehow - let's see.

@wcwitt
Copy link

wcwitt commented Sep 18, 2023

Just tried the small benchmark again (from the top of this thread):

# old (looks insane, but I triple checked)
20.635631 seconds (5.49 M allocations: 786.066 MiB, 97.01% gc time)
# old, without forcing GC
0.605159 seconds (5.49 M allocations: 786.014 MiB, 17.42% gc time)
# new
0.747427 seconds (5.61 M allocations: 1.094 GiB, 17.34% gc time)

So for the tiny model the forced GC explains everything.

Same trend, but less definitive for the bigger model:

length(basis) = 12888
length(train_old) = 66

# old
82.048485 seconds (31.27 M allocations: 41.684 GiB, 29.60% gc time)
# old, without forcing GC
62.509240 seconds (31.27 M allocations: 41.684 GiB, 3.69% gc time)
# new
49.417405 seconds (45.56 M allocations: 81.406 GiB, 2.45% gc time)

@tjjarvinen
Copy link
Collaborator Author

That points that the issue might be with the communication. For bigger system more time is spend on ACE evaluations, which will overshadow the other parts of the code. The new code has still unused optimizations that can be done, so we should get more out of it once they have been added.

The issue with the old version is that there is a Distributed array to which all processes write at the same time. This by definition needs that there has to be some kind of locking and broadcasting of changes made to the array. This is probably the point where the issue comes. However this would be the case only when there is more than 1 process, but the issue is present even with one process.

One point to note here is that you cannot implement the old asseble version with Rust, because in Rust only the owner can change the content and thus the compiler would not allow you to compile. This is in general a huge red fag that there are possible issues in the code and in general should be avoided.

@tjjarvinen
Copy link
Collaborator Author

I added support for customizing data keys

ACEfit.assemble(data, basis;
    energy_key=:energy,
    force_key=:force,
    virial_key=:virial
)

Changing those should allow you to use the other data.

@tjjarvinen
Copy link
Collaborator Author

I added a small performance improvement that is in use when both force and virial are calculated. The basis here is that both need evaluate_d call and instead to calling it two times, just collect force and virial from the same call. This nearly eliminate the time needed for virial calculation. It is about one third off from the time, so not much, but also it was very easy to do.

@wcwitt
Copy link

wcwitt commented Sep 19, 2023

The basis here is that both need evaluate_d call and instead to calling it two times, just collect force and virial from the same call.

Smart, thanks.

The issue with the old version is that there is a Distributed array to which all processes write at the same time.

I understand the reasoning, but I don't think it's the main bottleneck (yet). In the past, I've tested by eliminating the write step, such that the assembly blocks are created and then immediately discarded, but the times don't improve significantly. There is some discussion of that here (but in the threading context): ACEsuit/ACEfit.jl#49 (comment)

@cortner
Copy link
Member

cortner commented Sep 19, 2023

Yes - caching virials and forces is a long-standing issue :) - thanks for solving this here. Though we should really have this at a more general level. Until then this is a great improvement.

@cortner
Copy link
Member

cortner commented Sep 19, 2023

Regardless of understanding the origin of the improvement, can we please agree on a few tests that need to pass before we can make these changes the default for ACEfit and ACEpotentials

@tjjarvinen
Copy link
Collaborator Author

We should not hurry for the default yet. The best way is to add this as an alternative first, for some testing period. During this period we tell people to try it and once we are comfortable we can move it to be the default.

I also need to fix the issue with ExtXYZ until we can go for full AtomsBase workflow too. But I should be able to fix that this week, after which it needs for James to release a new version, but that should not be an issue.

I understand the reasoning, but I don't think it's the main bottleneck (yet). In the past, I've tested by eliminating the write step, such that the assembly blocks are created and then immediately discarded, but the times don't improve significantly. There is some discussion of that here (but in the threading context): ACEsuit/ACEfit.jl#49 (comment)

Yes I know, there is something that I don't understand either. But it is somehow related to the way data is moved/transformed...

@cortner
Copy link
Member

cortner commented Sep 19, 2023

We should not hurry for the default yet.

fair enough but it still needs equivalent functionality and identical results on a range of tests.

@tjjarvinen tjjarvinen merged commit 7dad29d into main Oct 4, 2023
4 checks passed
@tjjarvinen tjjarvinen deleted the acefit-extension branch October 17, 2023 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants