acesuit / acefit.jl Goto Github PK
View Code? Open in Web Editor NEWGeneric Codes for Fitting ACE models
License: MIT License
Generic Codes for Fitting ACE models
License: MIT License
So far we have been using direct methods for computing the logdet
and related terms required for maximizing the Bayesian evidence. However, for very large matrices, it could be advantageous to try something else that would be more compatible with SGD-type optimizers. Some useful references
Probably a good time to fix a notation for ACEfit. I view this as not completely trivial since it'd be best for the (eventual) documentation to match the code. Some considerations:
My proposals are in bold below, and I've quickly sketched out some alternatives and reasoning.
A
. As in Ax = b
. Used by ACEfit
currently.X
. Simple, seems standard in some contexts\Phi
. Another standard choice, used in the Deringer et al. Chem Review.\Psi
. Used by IPFitting
currently.x
. As in Ax = b
.c
. Probably the best choice?w
. Used by some GP literature ("weight-space view"). Seems best to avoid because we use "weights" for something else.\theta
y
. Best for case to match that of c
?Y
. Used currently by both IPFitting
and ACEfit
.b
t
I wonder whether ACEfit needs the JuLIP dependency. The observation classes that need JuLIP could just be moved into ACE1.jl or ACE1pack.jl. Is there any other reason?
I would really like it if ACEfit could be entirely abstract. (just as HAL should be as well)
Reported by @CheukHinHoJerry in ACEsuit/ACE1x.jl#7.
I got this error multiple times with the ACEfit.assemble function with multiple workers for large lsq system and I remember there was an issue about this so I think it's better to post it here. It happens when I am in the middle of assembling the design matrix. This is the full error log:
Worker 18 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
[1] (::Base.var"#wait_locked#715")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base ./stream.jl:947
[2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base ./stream.jl:955
[3] unsafe_read
@ ./io.jl:761 [inlined]
[4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base ./io.jl:760
[5] read!
@ ./io.jl:762 [inlined]
[6] deserialize_hdr_raw
@ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/messages.jl:167 [inlined]
[7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:172
[8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:133
[9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed ./task.jl:514
Progress: 21%|████████████████████████▌ | ETA: 0:52:08ERROR: Lo18Progress: 21%|████████████████████████▌ | ETA: 0:51:57)
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:920
[2] wait()
@ Base ./task.jl:984
[3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
@ Base ./condition.jl:130
[4] wait
@ ./condition.jl:125 [inlined]
[5] take_buffered(c::Channel{Any})
@ Base ./channels.jl:456
[6] take!(c::Channel{Any})
@ Base ./channels.jl:450
[7] take!(::Distributed.RemoteValue)
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:726
[8] remotecall_fetch(f::Function, w::Distributed.Worker, args::ACEfit.DataPacket{AtomsData}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:461
[9] remotecall_fetch(f::Function, w::Distributed.Worker, args::ACEfit.DataPacket{AtomsData})
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
[10] #remotecall_fetch#162
@ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
[11] remotecall_fetch(f::Function, id::Int64, args::ACEfit.DataPacket{AtomsData})
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492
[12] remotecall_pool(rc_f::Function, f::Function, pool::WorkerPool, args::ACEfit.DataPacket{AtomsData}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:126
[13] remotecall_pool
@ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:123 [inlined]
[14] #remotecall_fetch#200
@ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:232 [inlined]
[15] remotecall_fetch
@ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:232 [inlined]
[16] #208#209
@ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:288 [inlined]
[17] #208
@ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:288 [inlined]
[18] (::Base.var"#978#983"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, ProgressMeter.var"#56#59"{RemoteChannel{Channel{Bool}}, ACEfit.var"#3#4"{JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedMatrix{Float64}}}}}})(r::Base.RefValue{Any}, args::Tuple{ACEfit.DataPacket{AtomsData}})
@ Base ./asyncmap.jl:100
[19] macro expansion
@ ./asyncmap.jl:234 [inlined]
[20] (::Base.var"#994#995"{Base.var"#978#983"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, ProgressMeter.var"#56#59"{RemoteChannel{Channel{Bool}}, ACEfit.var"#3#4"{JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedMatrix{Float64}}}}}}, Channel{Any}, Nothing})()
@ Base ./task.jl:514
Stacktrace:
[1] (::Base.var"#988#990")(x::Task)
@ Base ./asyncmap.jl:177
[2] foreach(f::Base.var"#988#990", itr::Vector{Any})
@ Base ./abstractarray.jl:3073
[3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{ACEfit.DataPacket{AtomsData}})
@ Base ./asyncmap.jl:177
[4] wrap_n_exec_twice
@ ./asyncmap.jl:153 [inlined]
[5] #async_usemap#973
@ ./asyncmap.jl:103 [inlined]
[6] async_usemap
@ ./asyncmap.jl:84 [inlined]
[7] #asyncmap#972
@ ./asyncmap.jl:81 [inlined]
[8] asyncmap
@ ./asyncmap.jl:80 [inlined]
[9] pmap(f::Function, p::WorkerPool, c::Vector{ACEfit.DataPacket{AtomsData}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:126
[10] pmap(f::Function, p::WorkerPool, c::Vector{ACEfit.DataPacket{AtomsData}})
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:99
[11] pmap(f::Function, c::Vector{ACEfit.DataPacket{AtomsData}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:156
[12] pmap(f::Function, c::Vector{ACEfit.DataPacket{AtomsData}})
@ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:156
[13] macro expansion
@ ~/.julia/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1015 [inlined]
[14] macro expansion
@ ./task.jl:476 [inlined]
[15] macro expansion
@ ~/.julia/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1014 [inlined]
[16] macro expansion
@ ./task.jl:476 [inlined]
[17] progress_map(::Function, ::Vararg{Any}; mapfun::Function, progress::ProgressMeter.Progress, channel_bufflen::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ ProgressMeter ~/.julia/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1007
[18] assemble(data::Vector{AtomsData}, basis::JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis})
@ ACEfit ~/.julia/packages/ACEfit/ID48n/src/assemble.jl:31
[19] make_train(model::ACE1x.ACE1Model)
@ Main ~/julia_ws/ACEworkflows/Fe_pure_jerry/asm_all_lsq.jl:54
[20] top-level scope
@ ~/julia_ws/ACEworkflows/Fe_pure_jerry/asm_all_lsq.jl:91
[21] include(fname::String)
@ Base.MainInclude ./client.jl:478
[22] top-level scope
@ REPL[2]:1
in expression starting at /zfs/users/jerryho528/jerryho528/julia_ws/ACEworkflows/Fe_pure_jerry/asm_all_lsq.jl:71
[e3f9bc04] ACE1 v0.11.12
[8c4e8d19] ACE1pack v0.4.1
[5cc4c08c] ACE1x v0.1.4
[ad31a8ef] ACEfit v0.1.1
[f67ccb44] HDF5 v0.16.15
[682c06a0] JSON v0.21.4
[898213cb] LowRankApprox v0.5.3
[91a5bcdd] Plots v1.38.16
[08abe8d2] PrettyTables v2.2.5
[de0858da] Printf
Additionally:
It happens every time so it stops me from assembling a large lsq.
need to bring lsqdb back.
Currently, only virials are supported.
From Slack: "Don't forget the negative sign (assuming you're using ASE's sign convention for the stress) and be careful about Voigt 6-vector vs. 3x3 matrix"
Everyone agrees that the coefficients needs to be regularised (or 'small') and this is incorporated into the BRR/ARD prior. It's also possible to incorporate other 'expert' information into the prior. One idea floating around which @WillBaldwin0 is looking into I think is to fit to dimer data first, incorporate it into the prior, and then do the full solve. We'd need an interface allowing us to provide this prior c-vector before fitting. This procedure is quite simple and turns out to be really only a change of variables.
Probably worth it to wait and see what Will thinks about this idea first before implementing it properly.
... should be allowed to be scalars, vectors (diagonal?!) or matrices. Maybe enforce them to always be matrices? scalars can just be represented as w * I
@showprogress map(f, 1:length(data))
this line counts how many structure have been assembled. What it should count is the total number of atoms left. In a dataset with vastly varying size structures this the progressmeter can be very deceiving.
After I added Distributed
and DistributedArrays
as dependencies, the docs
part of the CI stopped working, and I can't figure out how to fix it. Sorry @cortner - is there something obvious I'm missing?
Hi @cortner, would you mind tagging a new version? I just fixed one of the warnings that was causing problems in ACE1pack.
The unit tests are failing on MacOS for reasons I don't understand. I have commented the relevant line from the CI for now.
There's a strong connection between the GAP sigmas and ACE weights. In principle weights are the square root inverse of the sigmas, but I don't think it's quite as simple. One can dial the weights up for a very simple ACE model, but this will not lead to good training errors as the ACE model is too constrainted to fit the underlying data. Using ARD/BRR this would become apparent because the associated noise term would be large. I think it'd be nice to propagate the noise term through the weights matrix and display the "optimised sigmas" after an ACE fit. GAP users should relate to this quite well I think.
TODO @cortner
Does the showprogress line
@showprogress pmap(packets) do p
correctly represent progress? I.e. does it account for the fact that some structures are much bigger than others and more expensive to assemble?
Note in #54 I do this manually.
My evidence for this is that my distributed assembly starts with > 1.5h, then drops to ca 1h for a while and then completes in around 25 minutes total.
... using threadpool_info()
Checklist for useful output given by IPFitting that isn't yet in ACEfit
how much - or not - could ACEfit.jl
leverage MLUtils.jl
?
https://github.com/JuliaLinearAlgebra/MKL.jl
By default, Julia
uses openblas
Distributed has a lot of overhead. I think we should return to providing also multi-threaded assembly of linear systems
Note I've decided in the end to start this package from scratch. IPFitting is too messy to fork from it. Instead my proposal it to maintain IPFitting purely for ACE v0.8 but focus all work for the latest ACE version to ACEfit.jl
.
This issue is to explain the design philosophy I propose, get feedback, and ask opinions on a few questions that this leaves open. None of the following is set in stone and all comments and criticism are welcome!
LSQR has a maxiter parameter but the Bayesian solvers do not. For some problems they just seem to not converge (cf Slack discussion). They should all get this parameter please, and then they should fail with a nice user-friendly message, something along the lines of "even when the solver hasn't converged the quality of the solution may be good, please test this before changing solver parameters"
where will the weight hooks go in ACEfit.jl
?
https://github.com/JuliaGaussianProcesses/BayesianLinearRegressors.jl
A student of mine just found this and asked me about it. Should compare functionality and performance?
@cortner would you please tag a new version. thanks!
For parallelising loops over configurations or observations we need cost estimates. It is not clear how this should be implemented when we no longer have concrete Atoms objects and just the three basic E, F, V observations. Also needs some testing how important this really in in practise. Cf. ACEfit.jl
, data.jl
, function cost
.
As a step towards a distributed ACEFit for linear models, we thought it would be useful---just with toy problems to start---to investigate multiple options for distributed linear algebra in Julia.
Some todos:
@cortner, would you please tag and publish 0.0.2? Thanks!
implement a toy example to test the serial, mt, and help develop the parallel
I spent a week hand-tuning a LSQ fit. To do so, I manually managed what IPFitting used to provide within LsqDB. I'd like us to re-introduce such a functionality, but maybe go a step further and make this a lazy datastructure that assembles the design matrix "as needed".
The "standard usage" would remain mostly unaffected by this I think or it could even become an option that need not be used by most users.
For now this is just a note - we can discuss it before doing anything.
@wcwitt -- I'm currently trying to implement a fitting script for a new project and noticed for the first time how much the structure of ACEfit has changed. The new AtomsData
is now very restrictive and moreover seems to require far more code overhead that the old code that was inspired by IPFitting. I'm guess there were good reasons for those changes though but I don't remember the discussion. Can you remind me please?
Depending on this, I may bring the old datastructures back. As far as I can tell they can easily live side-by-side with your new framework.
Distributed versions of:
Should give speedups and memory savings
Along with training interatomic potentials, "physics-inspired atomic descriptors can be used to rationalize the relationship between atom configurations and material properties" Physics-Inspired Structural Representations for Molecules and Materials
Ideally, like the Dscribe python library, where given an ASE ATOMS object containing positional and atom-type information, a per-atom ACE description an be generated.
The latest linear models in ACE.jl will allow to be parameterised by different c-vectors, and I think ACEfit.jl should have a function to sample these c-vectors from the posterior after having optimised the hyperparameters.
This is a copy of ACE1pack.jl#127.
It's not clear to me where the issue should be.
Hi @cortner, there are quite a few changes here. Would you please tag as 0.1.0. Thanks!
I think we have two usages for Iteratively reweighted least squares (IRLS) in mind, the first is to optimise any p-norm which @cortner will know a lot about. I think it'd be quite interesting to use IRLS to try and "even out" the relative error on the force components in the training database. After optimising with IRLS we'd have say 10% relative error on both large liquid and small vibrational forces hopefully resulting in both a good liquid rdf and phonon spectrum, without having to specify the weights.
It would be useful to be able to fit energy differences, or perhaps other more complex functions of "raw" fitting targets. A simple but still useful subset of this would be combinations like E(config_1) - E(config_bulk) * N_atom_1 / N_atom_bulk
, or maybe E(config_1) - E(config_bulk) * arb_factor
.
The trickiest aspect is probably to come up with a syntax for this that isn't super cumbersome. That's the motivation to suggest the simpler forms above, which would help for things like defect energies. Those would still cancel out most of the bulk energy and come much closer to the fitting target being the defect energy without necessarily having to come up with a syntax to precisely specify all the messy chemical potential reference structure details. The user could get with having to specify only the "bulk" config, and perhaps the arbitrary factor, which can default to the N_atom_*
ratio isn't appropriate.
@cortner, would you please bump ACEfit to v0.0.4? There aren't many new changes from v0.0.3, but @gelzinyte and I are trying to bring ACE1pack back to a state where the tests pass, and this is a prerequisite.
the line assemble.jl#L25
(nprocs() > 1) && sendto(workers(), basis = basis)
fails for some not-entirely-standard models.
A while back we discussed serializing models to JSON, and then transferring those to the processes.
Again we may want input from a Julia expert here on how this is best done instead of hacking something together.
CC @tjjarvinen
Python dependencies are ok but should not be required. Is this currently guaranteed?
@andresrossb I've moved and slightly edited the iteration interface from IPFitting to ACEfit. Would you be willing to put your draft for the distributed iteration in here as well? We will iterate on it a bit, so please don't push to main
directly but make a PR. But you do have push access to ACEfit, so you can create a branch in this repo.
(NB -- I'm thinking we might want to design the nonlinear solvers interface first in ACEfit, since for linear problems people have IPfitting anyhow... We will want to discuss what should be here, vs in ACEflux)
I've experimented with a few ways of implementing Bayesian ridge (for example), and the proliferation of functions has gotten a bit confusing. Now that more people are starting to use them, it's crucial to reorganize and document them.
This is to split off the alternative assembly suggestion of @tjjarvinen in #55 from the sendto
issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.