acesuit / acefit.jl Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 5.0 457 KB

Generic Codes for Fitting ACE models

License: MIT License

Julia 100.00%

acefit.jl's People

Contributors

Stargazers

Watchers

Forkers

wcwitt tjjarvinen jpdarby cheukhinhojerry yangshuaiwang

acefit.jl's Issues

Investigate stochastic methods for Bayesian regression

So far we have been using direct methods for computing the logdet and related terms required for maximizing the Bayesian evidence. However, for very large matrices, it could be advantageous to try something else that would be more compatible with SGD-type optimizers. Some useful references

Notation

Probably a good time to fix a notation for ACEfit. I view this as not completely trivial since it'd be best for the (eventual) documentation to match the code. Some considerations:

Something appropriate for linear and various planned nonlinear?
Preferences for ASCII/Unicode?

My proposals are in bold below, and I've quickly sketched out some alternatives and reasoning.

Design matrix/feature matrix

A. As in Ax = b. Used by ACEfit currently.
X. Simple, seems standard in some contexts
\Phi. Another standard choice, used in the Deringer et al. Chem Review.
\Psi. Used by IPFitting currently.

Regression coefficients or equivalent

x. As in Ax = b.
c. Probably the best choice?
w. Used by some GP literature ("weight-space view"). Seems best to avoid because we use "weights" for something else.
\theta

Observation/target vector

y. Best for case to match that of c?
Y. Used currently by both IPFitting and ACEfit.
b
t

Making ACEfit.jl Problem agnostic

I wonder whether ACEfit needs the JuLIP dependency. The observation classes that need JuLIP could just be moved into ACE1.jl or ACE1pack.jl. Is there any other reason?

I would really like it if ACEfit could be entirely abstract. (just as HAL should be as well)

Problem with distributed assembly

Reported by @CheukHinHoJerry in ACEsuit/ACE1x.jl#7.

I got this error multiple times with the ACEfit.assemble function with multiple workers for large lsq system and I remember there was an issue about this so I think it's better to post it here. It happens when I am in the middle of assembling the design matrix. This is the full error log:

Worker 18 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#715")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:947
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:955
 [3] unsafe_read
   @ ./io.jl:761 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:760
 [5] read!
   @ ./io.jl:762 [inlined]
 [6] deserialize_hdr_raw
   @ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:514
Progress:  21%|████████████████████████▌                                                                                           |  ETA: 0:52:08ERROR: Lo18Progress:  21%|████████████████████████▌                                                                                           |  ETA: 0:51:57)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:920
  [2] wait()
    @ Base ./task.jl:984
  [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
    @ Base ./condition.jl:130
  [4] wait
    @ ./condition.jl:125 [inlined]
  [5] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:456
  [6] take!(c::Channel{Any})
    @ Base ./channels.jl:450
  [7] take!(::Distributed.RemoteValue)
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:726
  [8] remotecall_fetch(f::Function, w::Distributed.Worker, args::ACEfit.DataPacket{AtomsData}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:461
  [9] remotecall_fetch(f::Function, w::Distributed.Worker, args::ACEfit.DataPacket{AtomsData})
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
 [10] #remotecall_fetch#162
    @ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
 [11] remotecall_fetch(f::Function, id::Int64, args::ACEfit.DataPacket{AtomsData})
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492
 [12] remotecall_pool(rc_f::Function, f::Function, pool::WorkerPool, args::ACEfit.DataPacket{AtomsData}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:126
 [13] remotecall_pool
    @ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:123 [inlined]
 [14] #remotecall_fetch#200
    @ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:232 [inlined]
 [15] remotecall_fetch
    @ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:232 [inlined]
 [16] #208#209
    @ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:288 [inlined]
 [17] #208
    @ ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/workerpool.jl:288 [inlined]
 [18] (::Base.var"#978#983"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, ProgressMeter.var"#56#59"{RemoteChannel{Channel{Bool}}, ACEfit.var"#3#4"{JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedMatrix{Float64}}}}}})(r::Base.RefValue{Any}, args::Tuple{ACEfit.DataPacket{AtomsData}})
    @ Base ./asyncmap.jl:100
 [19] macro expansion
    @ ./asyncmap.jl:234 [inlined]
 [20] (::Base.var"#994#995"{Base.var"#978#983"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, ProgressMeter.var"#56#59"{RemoteChannel{Channel{Bool}}, ACEfit.var"#3#4"{JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedVector{Float64}, SharedArrays.SharedMatrix{Float64}}}}}}, Channel{Any}, Nothing})()
    @ Base ./task.jl:514
Stacktrace:
  [1] (::Base.var"#988#990")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#988#990", itr::Vector{Any})
    @ Base ./abstractarray.jl:3073
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{ACEfit.DataPacket{AtomsData}})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] #async_usemap#973
    @ ./asyncmap.jl:103 [inlined]
  [6] async_usemap
    @ ./asyncmap.jl:84 [inlined]
  [7] #asyncmap#972
    @ ./asyncmap.jl:81 [inlined]
  [8] asyncmap
    @ ./asyncmap.jl:80 [inlined]
  [9] pmap(f::Function, p::WorkerPool, c::Vector{ACEfit.DataPacket{AtomsData}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:126
 [10] pmap(f::Function, p::WorkerPool, c::Vector{ACEfit.DataPacket{AtomsData}})
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:99
 [11] pmap(f::Function, c::Vector{ACEfit.DataPacket{AtomsData}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:156
 [12] pmap(f::Function, c::Vector{ACEfit.DataPacket{AtomsData}})
    @ Distributed ~/julia_ws/julia-1.9.0/share/julia/stdlib/v1.9/Distributed/src/pmap.jl:156
 [13] macro expansion
    @ ~/.julia/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1015 [inlined]
 [14] macro expansion
    @ ./task.jl:476 [inlined]
 [15] macro expansion
    @ ~/.julia/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1014 [inlined]
 [16] macro expansion
    @ ./task.jl:476 [inlined]
 [17] progress_map(::Function, ::Vararg{Any}; mapfun::Function, progress::ProgressMeter.Progress, channel_bufflen::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ ProgressMeter ~/.julia/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1007
 [18] assemble(data::Vector{AtomsData}, basis::JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis})
    @ ACEfit ~/.julia/packages/ACEfit/ID48n/src/assemble.jl:31
 [19] make_train(model::ACE1x.ACE1Model)
    @ Main ~/julia_ws/ACEworkflows/Fe_pure_jerry/asm_all_lsq.jl:54
 [20] top-level scope
    @ ~/julia_ws/ACEworkflows/Fe_pure_jerry/asm_all_lsq.jl:91
 [21] include(fname::String)
    @ Base.MainInclude ./client.jl:478
 [22] top-level scope
    @ REPL[2]:1
in expression starting at /zfs/users/jerryho528/jerryho528/julia_ws/ACEworkflows/Fe_pure_jerry/asm_all_lsq.jl:71

  [e3f9bc04] ACE1 v0.11.12
  [8c4e8d19] ACE1pack v0.4.1
  [5cc4c08c] ACE1x v0.1.4
  [ad31a8ef] ACEfit v0.1.1
  [f67ccb44] HDF5 v0.16.15
  [682c06a0] JSON v0.21.4
  [898213cb] LowRankApprox v0.5.3
  [91a5bcdd] Plots v1.38.16
  [08abe8d2] PrettyTables v2.2.5
  [de0858da] Printf

Additionally:

It happens every time so it stops me from assembling a large lsq.

Caching of Design Matrix

need to bring lsqdb back.

Add option to load stresses from xyz

Currently, only virials are supported.

From Slack: "Don't forget the negative sign (assuming you're using ASE's sign convention for the stress) and be careful about Voigt 6-vector vs. 3x3 matrix"

PyCall

make python/sklearn dependency a J 1.9 type extension
replace PyCall with PythonCall

Incorporating prior knowledge

Everyone agrees that the coefficients needs to be regularised (or 'small') and this is incorporated into the BRR/ARD prior. It's also possible to incorporate other 'expert' information into the prior. One idea floating around which @WillBaldwin0 is looking into I think is to fit to dimer data first, incorporate it into the prior, and then do the full solve. We'd need an interface allowing us to provide this prior c-vector before fitting. This procedure is quite simple and turns out to be really only a change of variables.

Probably worth it to wait and see what Will thinks about this idea first before implementing it properly.

Regression Weights for Forces

... should be allowed to be scalars, vectors (diagonal?!) or matrices. Maybe enforce them to always be matrices? scalars can just be represented as w * I

Progressmeters are not representative of progress

       @showprogress map(f, 1:length(data))

this line counts how many structure have been assembled. What it should count is the total number of atoms left. In a dataset with vastly varying size structures this the progressmeter can be very deceiving.

Proper way to add new dependencies?

After I added Distributed and DistributedArrays as dependencies, the docs part of the CI stopped working, and I can't figure out how to fix it. Sorry @cortner - is there something obvious I'm missing?

Tag new version

Hi @cortner, would you mind tagging a new version? I just fixed one of the warnings that was causing problems in ACE1pack.

MacOS

The unit tests are failing on MacOS for reasons I don't understand. I have commented the relevant line from the CI for now.

Sigmas and weights

There's a strong connection between the GAP sigmas and ACE weights. In principle weights are the square root inverse of the sigmas, but I don't think it's quite as simple. One can dial the weights up for a very simple ACE model, but this will not lead to good training errors as the ACE model is too constrainted to fit the underlying data. Using ARD/BRR this would become apparent because the associated noise term would be large. I think it'd be nice to propagate the noise term through the weights matrix and display the "optimised sigmas" after an ACE fit. GAP users should relate to this quite well I think.

CI Failing

TODO @cortner

Show progress in assembly

Does the showprogress line

    @showprogress pmap(packets) do p

correctly represent progress? I.e. does it account for the fact that some structures are much bigger than others and more expensive to assemble?

Note in #54 I do this manually.

My evidence for this is that my distributed assembly starts with > 1.5h, then drops to ca 1h for a while and then completes in around 25 minutes total.

Make python solvers print BLAS thread usage

... using threadpool_info()

Missing output

Checklist for useful output given by IPFitting that isn't yet in ACEfit

Pretty-printed list of configs read from .xyz file
Pretty-printed error tables

MLUtils

how much - or not - could ACEfit.jl leverage MLUtils.jl?

https://github.com/JuliaML/MLUtils.jl

Investigate MKL.jl

https://github.com/JuliaLinearAlgebra/MKL.jl

By default, Julia uses openblas

Multi-threaded parameter estimation

Distributed has a lot of overhead. I think we should return to providing also multi-threaded assembly of linear systems

Package Design Fundamentals

Note I've decided in the end to start this package from scratch. IPFitting is too messy to fork from it. Instead my proposal it to maintain IPFitting purely for ACE v0.8 but focus all work for the latest ACE version to ACEfit.jl.

This issue is to explain the design philosophy I propose, get feedback, and ask opinions on a few questions that this leaves open. None of the following is set in stone and all comments and criticism are welcome!

Maxiter for Bayesian Solvers

LSQR has a maxiter parameter but the Bayesian solvers do not. For some problems they just seem to not converge (cf Slack discussion). They should all get this parameter please, and then they should fail with a nice user-friendly message, something along the lines of "even when the solver hasn't converged the quality of the solution may be good, please test this before changing solver parameters"

Weighthooks

where will the weight hooks go in ACEfit.jl?

BayesianLinearRegressors.jl

https://github.com/JuliaGaussianProcesses/BayesianLinearRegressors.jl

A student of mine just found this and asked me about it. Should compare functionality and performance?

v0.0.3

@cortner would you please tag a new version. thanks!

Cost Estimates

For parallelising loops over configurations or observations we need cost estimates. It is not clear how this should be implemented when we no longer have concrete Atoms objects and just the three basic E, F, V observations. Also needs some testing how important this really in in practise. Cf. ACEfit.jl, data.jl, function cost.

Investigate distributed linear algebra in Julia

As a step towards a distributed ACEFit for linear models, we thought it would be useful---just with toy problems to start---to investigate multiple options for distributed linear algebra in Julia.

Some todos:

Investigate ScaLAPACK in Julia
Find something for distributed QR

0.0.2

@cortner, would you please tag and publish 0.0.2? Thanks!

Simple Example for Testing

implement a toy example to test the serial, mt, and help develop the parallel

Lazy LSQ System

I spent a week hand-tuning a LSQ fit. To do so, I manually managed what IPFitting used to provide within LsqDB. I'd like us to re-introduce such a functionality, but maybe go a step further and make this a lazy datastructure that assembles the design matrix "as needed".

The "standard usage" would remain mostly unaffected by this I think or it could even become an option that need not be used by most users.

For now this is just a note - we can discuss it before doing anything.

What happened to `Dat`?

@wcwitt -- I'm currently trying to implement a fitting script for a new project and noticed for the first time how much the structure of ACEfit has changed. The new AtomsData is now very restrictive and moreover seems to require far more code overhead that the old code that was inspired by IPFitting. I'm guess there were good reasons for those changes though but I don't remember the discussion. Can you remind me please?

Depending on this, I may bring the old datastructures back. As far as I can tell they can easily live side-by-side with your new framework.

Add distributed solvers to ACEFit

Distributed versions of:

Thin SVD for BayesianRidge

Should give speedups and memory savings

Functionality for outputting ACE "descriptor"

Along with training interatomic potentials, "physics-inspired atomic descriptors can be used to rationalize the relationship between atom configurations and material properties" Physics-Inspired Structural Representations for Molecules and Materials

Ideally, like the Dscribe python library, where given an ASE ATOMS object containing positional and atom-type information, a per-atom ACE description an be generated.

ACE1Pack Integration

Make branch in ACE1Pack that uses ACEFit
Identify ACE1Pack tests that we will use to declare "success"

Sampling coefficients from posterior distribution

The latest linear models in ACE.jl will allow to be parameterised by different c-vectors, and I think ACEfit.jl should have a function to sample these c-vectors from the posterior after having optimised the hyperparameters.

Processes are mostly idle

This is a copy of ACE1pack.jl#127.

It's not clear to me where the issue should be.

Tagging 0.1.0

Hi @cortner, there are quite a few changes here. Would you please tag as 0.1.0. Thanks!

Iteratively reweighted least squares

I think we have two usages for Iteratively reweighted least squares (IRLS) in mind, the first is to optimise any p-norm which @cortner will know a lot about. I think it'd be quite interesting to use IRLS to try and "even out" the relative error on the force components in the training database. After optimising with IRLS we'd have say 10% relative error on both large liquid and small vibrational forces hopefully resulting in both a good liquid rdf and phonon spectrum, without having to specify the weights.

fitting to energy differences

It would be useful to be able to fit energy differences, or perhaps other more complex functions of "raw" fitting targets. A simple but still useful subset of this would be combinations like E(config_1) - E(config_bulk) * N_atom_1 / N_atom_bulk, or maybe E(config_1) - E(config_bulk) * arb_factor.

The trickiest aspect is probably to come up with a syntax for this that isn't super cumbersome. That's the motivation to suggest the simpler forms above, which would help for things like defect energies. Those would still cancel out most of the bulk energy and come much closer to the fitting target being the defect energy without necessarily having to come up with a syntax to precisely specify all the messy chemical potential reference structure details. The user could get with having to specify only the "bulk" config, and perhaps the arbitrary factor, which can default to the N_atom_* ratio isn't appropriate.

v0.0.4

@cortner, would you please bump ACEfit to v0.0.4? There aren't many new changes from v0.0.3, but @gelzinyte and I are trying to bring ACE1pack back to a state where the tests pass, and this is a prerequisite.

Sendto of some models fails

the line assemble.jl#L25

    (nprocs() > 1) && sendto(workers(), basis = basis)

fails for some not-entirely-standard models.

A while back we discussed serializing models to JSON, and then transferring those to the processes.

Again we may want input from a Julia expert here on how this is best done instead of hacking something together.

CC @tjjarvinen

Python dependencies

Python dependencies are ok but should not be required. Is this currently guaranteed?

Iterate in parallel (distributed)

@andresrossb I've moved and slightly edited the iteration interface from IPFitting to ACEfit. Would you be willing to put your draft for the distributed iteration in here as well? We will iterate on it a bit, so please don't push to main directly but make a PR. But you do have push access to ACEfit, so you can create a branch in this repo.

(NB -- I'm thinking we might want to design the nonlinear solvers interface first in ACEfit, since for linear problems people have IPfitting anyhow... We will want to discuss what should be here, vs in ACEflux)

Clarify naming of Bayesian solvers

I've experimented with a few ways of implementing Bayesian ridge (for example), and the proliferation of functions has gotten a bit confusing. Now that more people are starting to use them, it's crucial to reorganize and document them.

Julia-based implementation of BRR/ARD

BRR

Julia-based serial BRR (simple reference implementation) [f2b9bd9]
Optimized serial BRR [3f4e0cc]

ARD

Julia-based serial ARD (simple reference implementation) [f2b9bd9]
Optimized serial ARD [3f4e0cc]

Alternative Assembly Routine

This is to split off the alternative assembly suggestion of @tjjarvinen in #55 from the sendto issue.

Add serial solvers to ACEFit

Add the following solvers:

Other tasks:

improve tests for serial solvers
make these tests part of CI

Set up distributed matrix assembly

Get something that works quickly (@wcwitt) 691bcb7
Revise after discussions with Christoph/others