GithubHelp home page GithubHelp logo

acesuit / mace Goto Github PK

View Code? Open in Web Editor NEW
311.0 23.0 102.0 102.61 MB

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.

License: Other

Python 99.94% Shell 0.06%

mace's Introduction

MACE

GitHub release Paper License GitHub issues Documentation Status

Table of contents

About MACE

MACE provides fast and accurate machine learning interatomic potentials with higher order equivariant message passing.

This repository contains the MACE reference implementation developed by Ilyes Batatia, Gregor Simm, and David Kovacs.

Also available:

  • MACE in JAX, currently about 2x times faster at evaluation, but training is recommended in Pytorch for optimal performances.
  • MACE layers for constructing higher order equivariant graph neural networks for arbitrary 3D point clouds.

Documentation

A partial documentation is available at: https://mace-docs.readthedocs.io

Installation

Requirements:

  • Python >= 3.7
  • PyTorch >= 1.12 (training with float64 is not supported with PyTorch 2.1).

(for openMM, use Python = 3.9)

pip installation

To install via pip, follow the steps below:

pip install --upgrade pip
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install mace-torch

For CPU or MPS (Apple Silicon) installation, use pip install torch torchvision torchaudio instead.

conda installation

If you do not have CUDA pre-installed, it is recommended to follow the conda installation process:

# Create a virtual environment and activate it
conda create --name mace_env
conda activate mace_env

# Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia

# (optional) Install MACE's dependencies from Conda as well
conda install numpy scipy matplotlib ase opt_einsum prettytable pandas e3nn

# Clone and install MACE (and all required packages)
git clone https://github.com/ACEsuit/mace.git
pip install ./mace

pip installation from source

To install via pip, follow the steps below:

# Create a virtual environment and activate it
python -m venv mace-venv
source mace-venv/bin/activate

# Install PyTorch (for example, for CUDA 11.6 [cu116])
pip3 install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

# Clone and install MACE (and all required packages)
git clone https://github.com/ACEsuit/mace.git
pip install ./mace

Note: The homonymous package on PyPI has nothing to do with this one.

Usage

Training

To train a MACE model, you can use the mace_run_train script, which should be in the usual place that pip places binaries (or you can explicitly run python3 <path_to_cloned_dir>/mace/cli/run_train.py)

mace_run_train \
    --name="MACE_model" \
    --train_file="train.xyz" \
    --valid_fraction=0.05 \
    --test_file="test.xyz" \
    --config_type_weights='{"Default":1.0}' \
    --E0s='{1:-13.663181292231226, 6:-1029.2809654211628, 7:-1484.1187695035828, 8:-2042.0330099956639}' \
    --model="MACE" \
    --hidden_irreps='128x0e + 128x1o' \
    --r_max=5.0 \
    --batch_size=10 \
    --max_num_epochs=1500 \
    --swa \
    --start_swa=1200 \
    --ema \
    --ema_decay=0.99 \
    --amsgrad \
    --restart_latest \
    --device=cuda \

To give a specific validation set, use the argument --valid_file. To set a larger batch size for evaluating the validation set, specify --valid_batch_size.

To control the model's size, you need to change --hidden_irreps. For most applications, the recommended default model size is --hidden_irreps='256x0e' (meaning 256 invariant messages) or --hidden_irreps='128x0e + 128x1o'. If the model is not accurate enough, you can include higher order features, e.g., 128x0e + 128x1o + 128x2e, or increase the number of channels to 256.

It is usually preferred to add the isolated atoms to the training set, rather than reading in their energies through the command line like in the example above. To label them in the training set, set config_type=IsolatedAtom in their info fields. If you prefer not to use or do not know the energies of the isolated atoms, you can use the option --E0s="average" which estimates the atomic energies using least squares regression.

If the keyword --swa is enabled, the energy weight of the loss is increased for the last ~20% of the training epochs (from --start_swa epochs). This setting usually helps lower the energy errors.

The precision can be changed using the keyword --default_dtype, the default is float64 but float32 gives a significant speed-up (usually a factor of x2 in training).

The keywords --batch_size and --max_num_epochs should be adapted based on the size of the training set. The batch size should be increased when the number of training data increases, and the number of epochs should be decreased. An heuristic for initial settings, is to consider the number of gradient update constant to 200 000, which can be computed as $\text{max-num-epochs}*\frac{\text{num-configs-training}}{\text{batch-size}}$.

The code can handle training set with heterogeneous labels, for example containing both bulk structures with stress and isolated molecules. In this example, to make the code ignore stress on molecules, append to your molecules configuration a config_stress_weight = 0.0.

To use Apple Silicon GPU acceleration make sure to install the latest PyTorch version and specify --device=mps.

Evaluation

To evaluate your MACE model on an XYZ file, run the mace_eval_configs:

mace_eval_configs \
    --configs="your_configs.xyz" \
    --model="your_model.model" \
    --output="./your_output.xyz"

Tutorial

You can run our Colab tutorial to quickly get started with MACE.

We also have a more detailed user and developer tutorial at https://github.com/ilyes319/mace-tutorials

Weights and Biases for experiment tracking

If you would like to use MACE with Weights and Biases to log your experiments simply install with

pip install ./mace[wandb]

And specify the necessary keyword arguments (--wandb, --wandb_project, --wandb_entity, --wandb_name, --wandb_log_hypers)

Pretrained Foundation Models

MACE-MP: Materials Project Force Fields

We have collaborated with the Materials Project (MP) to train a universal MACE potential covering 89 elements on 1.6 M bulk crystals in the MPTrj dataset selected from MP relaxation trajectories. The models are releaed on GitHub at https://github.com/ACEsuit/mace-mp. If you use them please cite our paper which also contains an large range of example applications and benchmarks.

Example usage in ASE

from mace.calculators import mace_mp
from ase import build

atoms = build.molecule('H2O')
calc = mace_mp(model="medium", dispersion=False, default_dtype="float32", device='cuda')
atoms.calc = calc
print(atoms.get_potential_energy())

MACE-OFF: Transferable Organic Force Fields

There is a series (small, medium, large) transferable organic force fields. These can be used for the simulation of organic molecules, crystals and molecular liquids, or as a starting point for fine-tuning on a new dataset. The models are released under the ASL license. The models are releaed on GitHub at https://github.com/ACEsuit/mace-off. If you use them please cite our paper which also contains detailed benchmarks and example applications.

Example usage in ASE

from mace.calculators import mace_off
from ase import build

atoms = build.molecule('H2O')
calc = mace_off(model="medium", device='cuda')
atoms.calc = calc
print(atoms.get_potential_energy())

Development

We use black, isort, pylint, and mypy. Run the following to format and check your code:

bash ./scripts/run_checks.sh

We have CI set up to check this, but we highly recommend that you run those commands before you commit (and push) to avoid accidentally committing bad code.

We are happy to accept pull requests under an MIT license. Please copy/paste the license text as a comment into your pull request.

References

If you use this code, please cite our papers:

@inproceedings{Batatia2022mace,
  title={{MACE}: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields},
  author={Ilyes Batatia and David Peter Kovacs and Gregor N. C. Simm and Christoph Ortner and Gabor Csanyi},
  booktitle={Advances in Neural Information Processing Systems},
  editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
  year={2022},
  url={https://openreview.net/forum?id=YPpSngE-ZU}
}

@misc{Batatia2022Design,
  title = {The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials},
  author = {Batatia, Ilyes and Batzner, Simon and Kov{\'a}cs, D{\'a}vid P{\'e}ter and Musaelian, Albert and Simm, Gregor N. C. and Drautz, Ralf and Ortner, Christoph and Kozinsky, Boris and Cs{\'a}nyi, G{\'a}bor},
  year = {2022},
  number = {arXiv:2205.06643},
  eprint = {2205.06643},
  eprinttype = {arxiv},
  doi = {10.48550/arXiv.2205.06643},
  archiveprefix = {arXiv}
 }

Contact

If you have any questions, please contact us at [email protected].

For bugs or feature requests, please use GitHub Issues.

License

MACE is published and distributed under the MIT License.

mace's People

Contributors

bernstei avatar bigd4 avatar chaitjo avatar davkovacs avatar felixmusil avatar ilyes319 avatar janosh avatar jharrymoore avatar larsschaaf avatar mariogeiger avatar sandipde avatar sivonxay avatar stenczelt avatar wcwitt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mace's Issues

Out of memory halfway through

I am using A100 (40G) to fit MACE. The program didn't report any errors at the beginning, but ran out of memory at the 170th epoch. The error message is as follows:
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.75 GiB (GPU 0; 39.41 GiB total capacity; 29.35 GiB already allocated; 2.65 GiB free; 35.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Why would it run out of memory halfway through? Is there any solution to this problem?

Change name of swa to loss schedule

Is your feature request related to a problem? Please describe.
The current name of the loss scheduler (swa) is confusing.

Describe the solution you'd like
Change all occurence of it to a proper name. We first need to invent an alias that is both non confusing and usable. Ideas @gabor1 @davkovacs ?

Describe alternatives you've considered
None

Additional context
@gabor1 request

LAMMPS implementation terminates unexpectedly

Describe the bug
I have installed the LAMMPS implementation of MACE on Archer2 and compiled the potential as per the instructions and I receive the following error when I run a small LAMMPS script:

Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.4


The following have been reloaded with a version change:
  1) cray-dsmml/0.1.4 => cray-dsmml/0.2.2
  2) cray-libsci/21.04.1.1 => cray-libsci/21.08.1.2
  3) cray-mpich/8.1.4 => cray-mpich/8.1.15
  4) craype/2.7.6 => craype/2.7.15
  5) gcc/10.2.0 => gcc/11.2.0


The following have been reloaded with a version change:
  1) cray-dsmml/0.2.2 => cray-dsmml/0.2.1        4) craype/2.7.15 => craype/2.7.10
  2) cray-fftw/3.3.8.9 => cray-fftw/3.3.8.11     5) gcc/11.2.0 => gcc/9.3.0
  3) cray-mpich/8.1.15 => cray-mpich/8.1.9
  
Currently Loaded Modules:
  1) craype-x86-rome         3) craype-network-ofi   5) epcc-setup-env     7) PrgEnv-cray/8.1.0   9) cray-dsmml/0.2.1       11) cray-mpich/8.1.9  13) cpe-cuda/21.09
  2) libfabric/1.11.0.4.71   4) bolt/0.8             6) load-epcc-module   8) cce/12.0.3         10) cray-libsci/21.08.1.2  12) craype/2.7.10
 
LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::NotImplementedError'
  what():  Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, QuantizedCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:30798 [kernel]
Meta: registered at aten/src/ATen/RegisterMeta.cpp:26815 [kernel]
QuantizedCPU: registered at aten/src/ATen/RegisterQuantizedCPU.cpp:929 [kernel]
BackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:726 [kernel]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:140 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:488 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:291 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: fallthrough registered at ../aten/src/ATen/ConjugateFallback.cpp:22 [kernel]
Negative: fallthrough registered at ../aten/src/ATen/native/NegateFallback.cpp:22 [kernel]
ZeroTensor: fallthrough registered at ../aten/src/ATen/ZeroTensorFallback.cpp:90 [kernel]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradHIP: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradMPS: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradIPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradVE: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradMeta: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:16890 [kernel]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:482 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:324 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:743 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1064 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:189 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:148 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:484 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]

Exception raised from reportError at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:511 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2b7cd821156e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x11bc130 (0x2b7cde4de130 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x2476823 (0x2b7cdf798823 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x90 (0x2b7cdf9c2510 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x29b2dbe (0x2b7cdfcd4dbe in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: at::_ops::empty_strided::call(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x1bb (0x2b7cdfa0582b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1c19cd7 (0x2b7cdef3bcd7 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x177b (0x2b7cdf2a32cb in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2b6d60b (0x2b7cdfe8f60b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x2b7cdf6dce35 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x29b2b33 (0x2b7cdfcd4b33 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x2b7cdf6dce35 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x3d73aab (0x2b7ce1095aab in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x3d73f1e (0x2b7ce1095f1e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x2b7cdf75d399 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #15: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0xc7 (0x2b7cdf29b6a7 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2d2ba69 (0x2b7ce004da69 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #17: at::_ops::to_device::call(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0x1ba (0x2b7cdf8c48aa in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #18: torch::jit::Unpickler::readInstruction() + 0x1af0 (0x2b7ce20995d0 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #19: torch::jit::Unpickler::run() + 0x90 (0x2b7ce209a3b0 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #20: torch::jit::Unpickler::parse_ivalue() + 0x18 (0x2b7ce209a508 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #21: torch::jit::readArchiveAndTensors(std::string const&, std::string const&, std::string const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::string const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) + 0x45a (0x2b7ce205799a in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x4d20507 (0x2b7ce2042507 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x4d23362 (0x2b7ce2045362 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #24: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x3a2 (0x2b7ce2046c82 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #25: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2b7ce204739b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #26: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2b7ce2047475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #27: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #28: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #29: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #30: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #31: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #32: __libc_start_main + 0xea (0x2b7cfc4e534a in /lib64/libc.so.6)
frame #33: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

srun: error: nid002072: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=3216084.0

To Reproduce
Steps to reproduce the behavior:

  1. I followed the install instructions found here
  2. My CMake settings are as follows:
(mace-venv) zem@ln03:/work/e89/e89/zem/lammps/build> cmake -DCMAKE_INSTALL_PREFIX=$(pwd) \
>       -DBUILD_MPI=ON \
>       -DBUILD_OMP=ON \
>       -DPKG_OPENMP=ON \
>       -DPKG_ML-MACE=ON \
>       -DCMAKE_PREFIX_PATH=$(pwd)/../../libtorch \
>       ../cmake
-- The CXX compiler identification is Clang 11.0.0
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.6/bin/CC
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.6/bin/CC -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.26.2") 
-- Running check for auto-generated files from make-based build system
-- Running in virtual environment: /mnt/lustre/a2fs-work3/work/e89/e89/zem/mace-venv
   Setting Python interpreter to: /mnt/lustre/a2fs-work3/work/e89/e89/zem/mace-venv/bin/python
-- Found MPI_CXX: /opt/cray/pe/craype/2.7.6/bin/CC (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Looking for C++ include omp.h
-- Looking for C++ include omp.h - found
-- Found OpenMP_CXX: -fopenmp  
-- Found OpenMP: TRUE  found components:  CXX 
-- Found JPEG: /usr/lib64/libjpeg.so  
-- Found GZIP: /usr/bin/gzip  
-- Could NOT find FFMPEG (missing: FFMPEG_EXECUTABLE) 
Hello from ML-MACE.cmake.
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Found Torch: /work/e89/e89/zem/libtorch/lib/libtorch.so  
-- Looking for C++ include cmath
-- Looking for C++ include cmath - found
-- Generating style headers...
-- Generating package headers...
-- Generating lmpinstalledpkgs.h...
-- Could NOT find ClangFormat: Found unsuitable version "0.0", but required is at least "8.0" (found /opt/cray/pe/cce/11.0.4/cce-clang/x86_64/bin/clang-format)
-- The following tools and libraries have been found and configured:
 * Git
 * MPI
 * OpenMP
 * JPEG
 * Threads
 * Caffe2
 * Torch

-- <<< Build configuration >>>
   LAMMPS Version:   20221222
   Operating System: Linux SLES 15.1
   CMake Version:    3.10.2
   Build type:       RelWithDebInfo
   Install path:     /work/e89/e89/zem/lammps/build
   Generator:        Unix Makefiles using /usr/bin/gmake
-- Enabled packages: ML-MACE;OPENMP
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /opt/cray/pe/craype/2.7.6/bin/CC
      Type:          Clang
      Version:       11.0.0
      C++ Flags:     -O2 -g -DNDEBUG
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_JPEG;LAMMPS_GZIP;LMP_OPENMP
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Static library flags:    
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     
-- MPI libraries:    ;
-- Configuring done
-- Generating done
-- Build files have been written to: /work/e89/e89/zem/lammps/build

Cheers,
Zak

Multi gpu support for training

Hi,
You mentioned at Psi-K that a PL version with multi-gpu support already existed. It would be great if you could merge it or make the branch public.
best and thank you very much,
Jonathan

hidden_irreps with different number of channels for each degree

Hi @ilyes319 and co., thank you for the amazing work and clean codebase!

I have been playing around with MACE and seem to encounter an issue when setting the hidden_irreps to not have the same number of channels for each degree (l). E.g. I cannot set hidden_irreps to o3.Irreps("128x0e + 64x1o + 32x2e"). On the other hand, setting the same number of channels for each degree works well. E.g. o3.Irreps("128x0e + 128x1o + 128x2e") is okay, as the README recommends.

Is this expected behaviour?

Unable to access colab notebook

I was unable to access the colab tutorial notebook.
temp1

Could you put the ipynb file on github so that I can view it or download it?

overly restrictive data file suffix requirements

Is there a particular reason why load_from_xyz() restricts the file suffix to ".xyz" and passes a format explicitly, rather than just letting ASE do what it can when reading? This breaks files that are named ".extxyz", which ase.io.read() handles without a problem.

crash predicting for a configuration with all atoms outside of cutoff

Trying to predict energy/forces/virial on a configuration with two atoms that are farther than the cutoff crashes (see below). I'm guessing that the code that fills zeros for the gradient needs to take the true shape of the positions array into account?

Traceback (most recent call last):
  File "/home/cluster2/bernstei/src/work/GeSbTe/GAP_ACE_MACE/./evaluate_potentials.py", line 86, in <module>
    plot_dimers(calc_use, dimer_configs, pot_file_name)
  File "/home/cluster2/bernstei/src/work/GeSbTe/GAP_ACE_MACE/./evaluate_potentials.py", line 52, in plot_dimers
    Es.append(at.get_potential_energy())
  File "/home/Software/python/system/extra/lib/python3.9/site-packages/ase/atoms.py", line 730, in get_potential_energy
    energy = self._calc.get_potential_energy(self)
  File "/home/Software/python/system/extra/lib/python3.9/site-packages/ase/calculators/abc.py", line 24, in get_potential_energy
    return self.get_property(name, atoms)
  File "/home/Software/python/system/extra/lib/python3.9/site-packages/ase/calculators/calculator.py", line 501, in get_property
    self.calculate(atoms, [name], system_changes)
  File "/home/cluster2/bernstei/.local/lib/python3.9/site-packages/mace/calculators/mace.py", line 70, in calculate
    out = self.model(batch, compute_stress=True)
  File "/home/Software/python/system/extra/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cluster2/bernstei/.local/lib/python3.9/site-packages/mace/modules/models.py", line 304, in forward
    forces, virials, stress = get_outputs(
  File "/home/cluster2/bernstei/.local/lib/python3.9/site-packages/mace/modules/utils.py", line 135, in get_outputs
    forces, virials, stress = compute_forces_virials(
  File "/home/cluster2/bernstei/.local/lib/python3.9/site-packages/mace/modules/utils.py", line 72, in compute_forces_virials
    torch.zeros_like(positions).expand(1, 1, 3),
RuntimeError: The expanded size of the tensor (1) must match the existing size (2) at non-singleton dimension 1.  Target sizes: [1, 1, 3].  Tensor sizes: [2, 3]

Change the config_from_atoms function such that you can also run .traj files and other trajectory file types

Hi
I would like to use MACE with .traj files instead of xyz files. I think it can be made possible if the forces, stress, energies and charges are extracted from the Atom objects using the commands instead of the arrays and info dictionaries.
I suggest to change the function to this:

def config_from_atoms(
atoms: ase.Atoms,
energy_key="energy",
forces_key="forces",
stress_key="stress",
virials_key="virials",
dipole_key="dipole",
charges_key="charges",
config_type_weights: Dict[str, float] = None,
) -> Configuration:
"""Convert ase.Atoms to Configuration"""
if config_type_weights is None:
config_type_weights = DEFAULT_CONFIG_TYPE_WEIGHTS
energy = atoms.info.get(energy_key, atoms.get_potential_energy()) # eV
forces = atoms.arrays.get(forces_key, atoms.get_forces()) # eV / Ang
stress = atoms.info.get(stress_key, atoms.get_stress()) # eV / Ang
virials = atoms.info.get(virials_key, None)
dipole = atoms.info.get(dipole_key, atoms.get_dipole_moment()) # Debye
# Charges default to 0 instead of None if not found
try:
charges = atoms.arrays.get(charges_key, atoms.get_charges()) # atomic unit
except:
charges = np.zeros(len(atoms))
atomic_numbers = np.array(
[ase.data.atomic_numbers[symbol] for symbol in atoms.symbols]
)
pbc = tuple(atoms.get_pbc())
cell = np.array(atoms.get_cell())
config_type = atoms.info.get("config_type", "Default")
weight = atoms.info.get("config_weight", 1.0) * config_type_weights.get(
config_type, 1.0
)
energy_weight = atoms.info.get("config_energy_weight", 1.0)
forces_weight = atoms.info.get("config_forces_weight", 1.0)
stress_weight = atoms.info.get("config_stress_weight", 1.0)
virials_weight = atoms.info.get("config_virials_weight", 1.0)
# fill in missing quantities but set their weight to 0.0
if energy is None:
energy = 0.0
energy_weight = 0.0
if forces is None:
forces = np.zeros(np.shape(atoms.positions))
forces_weight = 0.0
if stress is None:
stress = np.zeros(6)
stress_weight = 0.0
if virials is None:
virials = np.zeros((3, 3))
virials_weight = 0.0
return Configuration(
atomic_numbers=atomic_numbers,
positions=atoms.get_positions(),
energy=energy,
forces=forces,
stress=stress,
virials=virials,
dipole=dipole,
charges=charges,
weight=weight,
energy_weight=energy_weight,
forces_weight=forces_weight,
stress_weight=stress_weight,
virials_weight=virials_weight,
config_type=config_type,
pbc=pbc,
cell=cell,
)

Support for dipole moments? (and transition dipoles)

Hi,

Inspired by excellent talks on MACE at PsiK I gave it a go on the train home - it's a testament to how easy it is to use that I had a model trained on existing data for chromophores in solvent by the time I got back - excellent job!

To make it useful for the applications I have in mind in theoretical spectroscopy, it would need to be able to predict dipole moments, and ideally transition dipoles associated with excited states. My understanding from the theory behind MACE is that this should not be particularly hard to do, but I don't see anything to suggest it exists as a capability right now - is it in your plans?

Best wishes

Nick Hine (University of Warwick)

mismatch of 0e & 1o feature length breaks the model

Describe the bug
Training of a model with different length of 0e and 1o features breaks the network after construction. Happens on the current development branch.

Should this be prevented by validating the inputs to the training script (i.e. if there is any conceptual limitation against it) or be just working?

I have tested withe the following options:

  • 17x0e + 13x1o - two primes
  • 17x0e + 51x1o - x + 5*x
  • 17x0e + 34x1o- x + 2*x
  • 17x0e + 169x1o - x + x * x
  • 16x0e + 32x1o - 16 + 16 * 2

To Reproduce
Steps to reproduce the behavior:

  1. try training a model with --hidden_irreps="Ax0e + Bx1o" having A != B

Expected behavior
One of the following

  • exception raised upon validation of training script input / model construction
  • the model just working as with

Try by running the following:

#!/bin/bash

python ../scripts/run_train.py --name="MACE-prime-sizes"
  --train_file="structures/train.xyz" \
  --valid_fraction=0.05 \
  --test_file="structures/test.xyz" \
  --E0s="average" \
  --model="ScaleShiftMACE" \
  --hidden_irreps="17x0e + 13x1o" \
  --r_max=4.0 \
  --batch_size=5 \
  --max_num_epochs=10 \
  --ema \
  --ema_decay=0.99 \
  --amsgrad \
  --default_dtype="float64" \
  --seed=0

Exception:

2023-01-02 18:44:27.343 INFO: Started training
Traceback (most recent call last):
  File "/Users/tks32/work/mace/scripts/run_train.py", line 488, in <module>
    main()
  File "/Users/tks32/work/mace/scripts/run_train.py", line 427, in main
    tools.train(
  File "/Users/tks32/work/mace/mace/tools/train.py", line 69, in train
    _, opt_metrics = take_step(
  File "/Users/tks32/work/mace/mace/tools/train.py", line 211, in take_step
    output = model(
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/tks32/work/mace/mace/modules/models.py", line 281, in forward
    node_feats = product(
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/tks32/work/mace/mace/modules/blocks.py", line 181, in forward
    return self.linear(node_feats) + sc
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/e3nn/o3/_linear.py", line 276, in forward
    return self._compiled_main(features, weight, bias)
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.45", line 8, in forward
    getitem = getattr_1[slice(None, -1, None)];  getattr_1 = None
    add = getitem + (56,);  getitem = None
    reshape = x.reshape(-1, 56);  x = None
              ~~~~~~~~~ <--- HERE
    getattr_2 = reshape.shape
    getitem_1 = getattr_2[0];  getattr_2 = None
RuntimeError: shape '[-1, 56]' is invalid for input of size 21080

Full log

/Users/tks32/opt/miniconda3/envs/mace/bin/python /Users/tks32/work/mace/scripts/run_train.py --name=MACE-prime-sizes --train_file=structures/train.xyz --valid_fraction=0.05 --test_file=structures/test.xyz --E0s=average --model=ScaleShiftMACE --hidden_irreps=17x0e + 13x1o --r_max=4.0 --batch_size=5 --max_num_epochs=10 --ema --ema_decay=0.99 --amsgrad --default_dtype=float64 --seed=3 
2023-01-02 18:44:25.298 INFO: MACE version: 0.1.0
2023-01-02 18:44:25.298 INFO: Configuration: Namespace(name='MACE-prime-sizes', seed=3, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cpu', default_dtype='float64', log_level='INFO', error_table='PerAtomRMSE', model='ScaleShiftMACE', r_max=4.0, num_radial_basis=8, num_cutoff_basis=5, interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', hidden_irreps='17x0e + 13x1o', gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=False, compute_forces=True, train_file='structures/train.xyz', valid_file=None, valid_fraction=0.05, test_file='structures/test.xyz', E0s='average', energy_key='energy', forces_key='forces', virials_key='virials', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='weighted', forces_weight=10.0, swa_forces_weight=1.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, stress_weight=1.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', optimizer='adam', batch_size=5, valid_batch_size=10, lr=0.01, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=False, start_swa=None, ema=True, ema_decay=0.99, max_num_epochs=10, patience=2048, eval_interval=2, keep_checkpoints=False, restart_latest=False, save_cpu=False, clip_grad=10.0)
2023-01-02 18:44:25.299 INFO: Using CPU
2023-01-02 18:44:25.344 INFO: Loaded 20 training configurations from 'structures/train.xyz'
2023-01-02 18:44:25.344 INFO: Using random 5.0% of training set for validation
2023-01-02 18:44:25.440 INFO: Loaded 137 test configurations from 'structures/test.xyz'
2023-01-02 18:44:25.440 INFO: Total number of configurations: train=19, valid=1, tests=[Default: 137]
2023-01-02 18:44:25.441 INFO: AtomicNumberTable: (1, 26)
2023-01-02 18:44:25.441 INFO: Atomic Energies not in training file, using command line argument E0s
2023-01-02 18:44:25.441 INFO: Computing average Atomic Energies using least squares regression
2023-01-02 18:44:25.442 INFO: Atomic energies: [-233.11162700313903, -799.2398640107622]
2023-01-02 18:44:25.561 INFO: WeightedEnergyForcesLoss(energy_weight=1.000, forces_weight=10.000)
2023-01-02 18:44:25.570 INFO: Average number of neighbors: 18.110
2023-01-02 18:44:25.570 INFO: Selected the following outputs: {'energy': True, 'forces': True, 'virials': False, 'stress': False, 'dipoles': False}
2023-01-02 18:44:25.570 INFO: Building model
2023-01-02 18:44:27.340 INFO: ScaleShiftMACE(
  (node_embedding): LinearNodeEmbeddingBlock(
    (linear): Linear(2x0e -> 17x0e | 34 weights)
  )
  (radial_embedding): RadialEmbeddingBlock(
    (bessel_fn): BesselBasis(r_max=4.0, num_basis=8, trainable=False)
    (cutoff_fn): PolynomialCutoff(p=5.0, r_max=4.0)
  )
  (spherical_harmonics): SphericalHarmonics()
  (atomic_energies_fn): AtomicEnergiesBlock(energies=[-233.1116, -799.2399])
  (interactions): ModuleList(
    (0): RealAgnosticResidualInteractionBlock(
      (linear_up): Linear(17x0e -> 17x0e | 289 weights)
      (conv_tp): TensorProduct(17x0e x 1x0e+1x1o+1x2e+1x3o -> 17x0e+17x1o+17x2e+17x3o | 68 paths | 68 weights)
      (conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 68]
      (linear): Linear(17x0e+17x1o+17x2e+17x3o -> 17x0e+17x1o+17x2e+17x3o | 1156 weights)
      (skip_tp): FullyConnectedTensorProduct(17x0e x 2x0e -> 17x0e+13x1o | 578 paths | 578 weights)
      (reshape): reshape_irreps()
    )
    (1): RealAgnosticResidualInteractionBlock(
      (linear_up): Linear(17x0e+13x1o -> 17x0e+13x1o | 458 weights)
      (conv_tp): TensorProduct(17x0e+13x1o x 1x0e+1x1o+1x2e+1x3o -> 30x0e+43x1o+43x2e+30x3o | 146 paths | 146 weights)
      (conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 146]
      (linear): Linear(30x0e+43x1o+43x2e+30x3o -> 17x0e+17x1o+17x2e+17x3o | 2482 weights)
      (skip_tp): FullyConnectedTensorProduct(17x0e+13x1o x 2x0e -> 17x0e | 578 paths | 578 weights)
      (reshape): reshape_irreps()
    )
  )
  (products): ModuleList(
    (0): EquivariantProductBasisBlock(
      (symmetric_contractions): SymmetricContraction(
        (contractions): ModuleDict(
          (17x0e): Contraction(
            (weights): ParameterDict(
                (1): Parameter containing: [torch.DoubleTensor of size 2x1x17]
                (2): Parameter containing: [torch.DoubleTensor of size 2x4x17]
                (3): Parameter containing: [torch.DoubleTensor of size 2x23x17]
            )
          )
          (13x1o): Contraction(
            (weights): ParameterDict(
                (1): Parameter containing: [torch.DoubleTensor of size 2x1x17]
                (2): Parameter containing: [torch.DoubleTensor of size 2x6x17]
                (3): Parameter containing: [torch.DoubleTensor of size 2x51x17]
            )
          )
        )
      )
      (linear): Linear(17x0e+13x1o -> 17x0e+13x1o | 458 weights)
    )
    (1): EquivariantProductBasisBlock(
      (symmetric_contractions): SymmetricContraction(
        (contractions): ModuleDict(
          (17x0e): Contraction(
            (weights): ParameterDict(
                (1): Parameter containing: [torch.DoubleTensor of size 2x1x17]
                (2): Parameter containing: [torch.DoubleTensor of size 2x4x17]
                (3): Parameter containing: [torch.DoubleTensor of size 2x23x17]
            )
          )
        )
      )
      (linear): Linear(17x0e -> 17x0e | 289 weights)
    )
  )
  (readouts): ModuleList(
    (0): LinearReadoutBlock(
      (linear): Linear(17x0e+13x1o -> 1x0e | 17 weights)
    )
    (1): NonLinearReadoutBlock(
      (linear_1): Linear(17x0e -> 16x0e | 272 weights)
      (non_linearity): Activation [x] (16x0e -> 16x0e)
      (linear_2): Linear(16x0e -> 1x0e | 16 weights)
    )
  )
  (scale_shift): ScaleShiftBlock(scale=0.239862, shift=-0.002303)
)
2023-01-02 18:44:27.342 INFO: Number of parameters: 41607
2023-01-02 18:44:27.342 INFO: Optimizer: Adam (
Parameter Group 0
    amsgrad: True
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.01
    maximize: False
    name: embedding
    weight_decay: 0.0

Parameter Group 1
    amsgrad: True
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.01
    maximize: False
    name: interactions_decay
    weight_decay: 5e-07

Parameter Group 2
    amsgrad: True
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.01
    maximize: False
    name: interactions_no_decay
    weight_decay: 0.0

Parameter Group 3
    amsgrad: True
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.01
    maximize: False
    name: products
    weight_decay: 5e-07

Parameter Group 4
    amsgrad: True
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.01
    maximize: False
    name: readouts
    weight_decay: 0.0
)
2023-01-02 18:44:27.343 INFO: Using gradient clipping with tolerance=10.000
2023-01-02 18:44:27.343 INFO: Started training
Traceback (most recent call last):
  File "/Users/tks32/work/mace/scripts/run_train.py", line 488, in <module>
    main()
  File "/Users/tks32/work/mace/scripts/run_train.py", line 427, in main
    tools.train(
  File "/Users/tks32/work/mace/mace/tools/train.py", line 69, in train
    _, opt_metrics = take_step(
  File "/Users/tks32/work/mace/mace/tools/train.py", line 211, in take_step
    output = model(
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/tks32/work/mace/mace/modules/models.py", line 281, in forward
    node_feats = product(
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/tks32/work/mace/mace/modules/blocks.py", line 181, in forward
    return self.linear(node_feats) + sc
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/e3nn/o3/_linear.py", line 276, in forward
    return self._compiled_main(features, weight, bias)
  File "/Users/tks32/opt/miniconda3/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.45", line 8, in forward
    getitem = getattr_1[slice(None, -1, None)];  getattr_1 = None
    add = getitem + (56,);  getitem = None
    reshape = x.reshape(-1, 56);  x = None
              ~~~~~~~~~ <--- HERE
    getattr_2 = reshape.shape
    getitem_1 = getattr_2[0];  getattr_2 = None
RuntimeError: shape '[-1, 56]' is invalid for input of size 21080

RuntimeError: expected scalar type Double but found Float

I recently updated my local copy of MACE, pulled yesterday (Feb. 20) and ran into the following problem when calling an evaluation with ASE:

Traceback (most recent call last):
  File "/home/ly1/MCASE/bin/mcase_table_half", line 66, in <module>
    energy = walker.get_potential_energy()
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/ase/atoms.py", line 731, in get_potential_energy
    energy = self._calc.get_potential_energy(self)
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/ase/calculators/calculator.py", line 709, in get_potential_energy
    energy = self.get_property('energy', atoms)
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
    self.calculate(atoms, [name], system_changes)
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/mace/calculators/mace.py", line 71, in calculate
    out = self.model(batch.to_dict(), compute_stress=True)
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/mace/modules/models.py", line 295, in forward
    node_e0 = self.atomic_energies_fn(data["node_attrs"])
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ly1/.conda/envs/mace/lib/python3.9/site-packages/mace/modules/blocks.py", line 137, in forward
    return torch.matmul(x, self.atomic_energies)
RuntimeError: expected scalar type Double but found Float

I have verified that training works with this version, as I trained a new model overnight, and I actually tried using that new model above. Below is the training script:

$ cat input.sh
python /home/ly1/mace/scripts/run_train.py \
	--name="hydrogen" \
	--train_file="/home/ly1/H/metallic/subtracted_data/qmc/training.xyz" \
	--valid_fraction=0.05 \
	--test_file="/home/ly1/H/metallic/subtracted_data/qmc/testing.xyz" \
	--config_type_weights='{"Default":1.0}' \
	--E0s="average" \
	--model="MACE" \
	--hidden_irreps='64x0e + 64x1o' \
	--r_max=3.0 \
	--default_dtype='float32' \
	--batch_size=3 \
	--valid_batch_size=3 \
	--max_num_epochs=500 \
	--ema \
	--ema_decay=0.99 \
	--amsgrad \
	--restart_latest \
	--device=cuda

out of memory even with multi-card training

Describe the bug
My attempt to train MACE with multiple 4090 cards was unsuccessful.

Expected behavior
I wish to train MACE in parallel using multiple graphics cards.

Screenshotsimage

Error due to missing virials on some configurations

Using the stress_develop with the following input script and the attached train.xyz and test.xyz (renamed to add .txt extensions) leads to an error.

test.xyz.txt
train.xyz.txt

python -i ../mace/scripts/run_train.py \
    --name="MACE_model" \
    --train_file="train.xyz" \
    --valid_fraction=0.05 \
    --test_file="test.xyz" \
    --config_type_weights='{"Default":1.0}' \
    --model="MACE" \
    --hidden_irreps='128x0e + 128x1o' \
    --r_max=5.0 \
    --batch_size=10 \
    --max_num_epochs=1500 \
    --swa \
    --start_swa=1200 \
    --ema \
    --ema_decay=0.99 \
    --amsgrad \
    --device=cuda \
    --forces_key=dft_force \
    --energy_key=dft_energy \
    --virials_key=dft_virial \
    --loss=virials

Traceback is as follows, with interactive debugging that it its ref["virals"] which is None.

2022-09-05 10:51:12.044 INFO: Started training
Traceback (most recent call last):
  File "../mace/scripts/run_train.py", line 405, in <module>
    main()
  File "../mace/scripts/run_train.py", line 346, in main
    tools.train(
  File "/gpfs/home/e/essswb/mace-venv/lib/python3.8/site-packages/mace/tools/train.py", line 69, in train
    _, opt_metrics = take_step(
  File "/gpfs/home/e/essswb/mace-venv/lib/python3.8/site-packages/mace/tools/train.py", line 206, in take_step
    loss = loss_fn(pred=output, ref=batch)
  File "/sulis/easybuild/software/PyTorch/1.9.0-fosscuda-2020b/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/home/e/essswb/mace-venv/lib/python3.8/site-packages/mace/modules/loss.py", line 172, in forward
    + self.virials_weight * weighted_mean_squared_virials(ref, pred)
  File "/gpfs/home/e/essswb/mace-venv/lib/python3.8/site-packages/mace/modules/loss.py", line 41, in weighted_mean_squared_virials
    configs_weight * torch.square((ref["virials"] - pred["virials"]) / num_atoms)
  File "/sulis/easybuild/software/PyTorch/1.9.0-fosscuda-2020b/lib/python3.8/site-packages/torch/_tensor.py", line 544, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
TypeError: rsub() received an invalid combination of arguments - got (Tensor, NoneType), but expected one of:
 * (Tensor input, Tensor other, *, Number alpha)
 * (Tensor input, Number other, Number alpha)

Question about Message Construction

Hi all! Thanks for your work on Multi-ACE.

The key component of MACE is the construction of A-basis:
image
and B-basis:
image

But I didn't find any scatter sum operation for the basis function in the code (the summation operation to construct A-basis), or the multiplication of A-basis (to construct the B-basis). The implementation is the direct tensor product between node_feats and edge_feats, which are augmented by spherical harmonics.
Am I missing something else? Looking forward to your reply. Thanks!

Error Running LAMMPs on compiled variant

Hi! I was able to train a model successfully, serialize it, and compile the lammps variant that has the mace pair style. The problem I run into is whenever I even try to use the lmp command (with or without kokkos) I get the following error:

lmp: symbol lookup error: /data/santiago/lammps-mace/build/lmp: undefined symbol: _ZTIN3fmt6v9_lmp12format_errorE

I attached the cmake command, environment, and cmake output from compiling the patched LAMMPS variant.

Support for larger than memory datasets

Hi,
You mentioned at PSI-K that the only bottleneck for larger than memory datasets was the normalization of the energies and forces. It would be great if you could add an option to explicitly enter the statistics to avoid this issue.
best and thank you very much,
Jonathan

Unable to reproduce results

Hi all, I'm trying to reproduce the results on 3BPA from the paper for L=2 but am getting significantly worse results so far. I've tried a number of different setups already (see below), based on parameters that I saw in the code code but that were never reported in the paper so I thought they might be the culprit, but so far that doesn't seem to be the reason. What am I doing wrong?

The paper reports an error in the F-RMSE of 8.8 meV/A on T=300K on 3BPA. I trained the following 5 setups but was only able to get a best error of about 10.9 meV/A. Here's what I ran

python $~/harvard_internal_path/mace/scripts/run_train.py \
    --name="MACE_model" \
    --train_file="~/harvard_internal_path/3bpa/dataset_3BPA/train_300K.xyz" \
    --valid_fraction=0.1 \
    --test_file="~/harvard_internal_path/3bpa/dataset_3BPA/test_300K.xyz" \
    --config_type_weights='{"Default":1.0}' \
    --E0s='{1:-13.663181292231226, 6:-1029.2809654211628, 7:-1484.1187695035828, 8:-2042.0330099956639}' \
    --model="MACE" \
    --hidden_irreps='256x0e + 256x1o + 256x2e' \
    --r_max=5.0 \
    --batch_size=5 \
    --ema \
    --ema_decay=0.99 \
    --amsgrad \
    --restart_latest \
    --device=cuda \
    --clip_grad=None

which gives an error of ~11.6 meV/A:

+-------------+---------------------+------------------+-------------------+
| config_type | RMSE E / meV / atom | RMSE F / meV / A | relative F RMSE % |
+-------------+---------------------+------------------+-------------------+
|    train    |         0.1         |       3.8        |        0.40       |
|    valid    |         0.1         |       11.5       |        1.19       |
|   Default   |         0.1         |       11.6       |        1.20       |
+-------------+---------------------+------------------+-------------------+

And here are the modifications I tried:

  • add SWA, train longer: 13.2 meV/A
  • add gradient clipping, train longer: 11.5 meV/A
  • add SWA + gradient clipping, train longer: 13.7 meV/A
  • use ScaleShiftMACE, no clip, no SWA, default epochs: 10.9 meV/A

Since the last one is a pretty significant change, here the full config:

python $harvard_internal_path/mace/scripts/run_train.py \
    --name="MACE_model" \
    --train_file="harvard_internal_path//3bpa/dataset_3BPA/train_300K.xyz" \
    --valid_fraction=0.1 \
    --test_file="harvard_internal_path//3bpa/dataset_3BPA/test_300K.xyz" \
    --config_type_weights='{"Default":1.0}' \
    --E0s='{1:-13.663181292231226, 6:-1029.2809654211628, 7:-1484.1187695035828, 8:-2042.0330099956639}' \
    --model="ScaleShiftMACE" \
    --hidden_irreps='256x0e + 256x1o + 256x2e' \
    --r_max=5.0 \
    --batch_size=5 \
    --ema \
    --ema_decay=0.99 \
    --amsgrad \
    --restart_latest \
    --device=cuda \
    --clip_grad=None
+-------------+---------------------+------------------+-------------------+
| config_type | RMSE E / meV / atom | RMSE F / meV / A | relative F RMSE % |
+-------------+---------------------+------------------+-------------------+
|    train    |         0.2         |       5.0        |        0.53	   |
|    valid    |         0.2         |       10.7       |        1.11	   |
|   Default   |         0.2         |       10.9       |        1.13	   |
+-------------+---------------------+------------------+-------------------+

These are all "Default" errors which I assume is the test set file I pass?

I must be missing something major? Can you help out?

Related to this: there also seem to be options that are by default on in the code that are never reported, e.g. SWA together with all its parameters and gradient clipping. Were they used in the paper or not? I cannot find a mention in the paper which seems like a clear no to me, but they seem to be on by default in the code. I also didn't see a git commit reported in the paper or a version for the code to reproduce? What was actually run for the paper? Can you share complete input + data files + code versions to reproduce?

Finally, I'd also like to compute test errors separately using the eval script (as opposed to at the end of training), but get this error:

Traceback (most recent call last):
  File "~/harvard_internal_path/mace/scripts/eval_configs.py", line 113, in <module>
    main()
  File "~/harvard_internal_path/mace/scripts/eval_configs.py", line 59, in main
    z_table = tools.AtomicNumberTable([int(z) for z in model.atomic_numbers])
AttributeError: 'dict' object has no attribute 'atomic_numbers'

Thanks!

Test error reporting

When creating the error trable at the end of the training:

  • Report per atom energy RMSE
  • Report relative force RMSE
  • Have the option to print MAE instead of RMSE

`ValueError: Ellipses lengths do not match` for `max_ell=0`

I am still working on understanding this package, so correct me if I'm wrong, but I believe there's a bug when using max_ell=0.

Specifically, on this line:

self.U_tensors(self.correlation),

with max_ell=0 the resultant tensor has a shape of (1,) when it needs to have a shape of (1, 1) in order to match the contraction indices provided by self.equation_main.

Changing line 154 to be

torch.atleast_2d(self.U_tensors(self.correlation)),

makes the code runnable, but I wasn't sure if this is correct or not.

error while doing the evaluation of the trained model

Does anyone have an idea about this error? I encountered it while using a trained potential to predict a dataset.

python ~/mace/scripts/eval_configs.py
--configs="test.extxyz"
--model="MACE_model_run-123.pt" \
--output="output.extxyz"

Traceback (most recent call last):
File "/Users/yuanbinliu/software/mace/scripts/eval_configs.py", line 151, in
main()
File "/Users/yuanbinliu/software/mace/scripts/eval_configs.py", line 88, in main
z_table = utils.AtomicNumberTable([int(z) for z in model.atomic_numbers])
AttributeError: 'dict' object has no attribute 'atomic_numbers'

SWA doesn't work when patience triggers loop exit

If the initial optimization is stopped by patience, the SWA phase does not run. Instead, it fails with the error message

Traceback (most recent call last):
  File "/p/home/noamb/.local/bin/mace_fit", line 8, in <module>
    sys.exit(main())
  File "/p/home/noamb/.local/lib/python3.8/site-packages/mace/cli/run_train.py", line 485, in main
    epoch = checkpoint_handler.load_latest(
  File "/p/home/noamb/.local/lib/python3.8/site-packages/mace/tools/checkpoint.py", line 207, in load_latest
    result = self.io.load_latest(swa=swa, device=device)
  File "/p/home/noamb/.local/lib/python3.8/site-packages/mace/tools/checkpoint.py", line 168, in load_latest
    path = self._get_latest_checkpoint_path(swa=swa)
  File "/p/home/noamb/.local/lib/python3.8/site-packages/mace/tools/checkpoint.py", line 142, in _get_latest_checkpoint_path
    latest_checkpoint_info = max(

stdout file is attached
mace_fit_stdout.txt

Handle missing labels

If there are no energy or forces keys defined it should fail by saying missing keys.

Eventually we should implement handling of missing keys as well.

Question related to Model generated after training complete

Hi
I am new to MACE. I am encountering a problem that after training complete, why my mace mode is generated from the epoch with lowest loss value, not the last epoch with the lowest RMSE for energy and forces? Is this due to my training data set? (I am using oxides)
In addition, after "changing the loss based on SWA", the loss doesn't get lower (still > 0.5), is that any possible solution for this?

The following are part of my output:
......
2022-09-20 10:29:03.008 INFO: Epoch 172: loss=0.2121, RMSE_E_per_atom=86.6 meV, RMSE_F=144.2 meV / A
2022-09-20 10:29:30.293 INFO: Epoch 174: loss=0.2073, RMSE_E_per_atom=84.6 meV, RMSE_F=142.7 meV / A
2022-09-20 10:29:56.333 INFO: Epoch 176: loss=0.2055, RMSE_E_per_atom=85.2 meV, RMSE_F=142.0 meV / A
2022-09-20 10:30:23.293 INFO: Epoch 178: loss=0.2055, RMSE_E_per_atom=84.9 meV, RMSE_F=141.9 meV / A
2022-09-20 10:30:49.278 INFO: Epoch 180: loss=0.2064, RMSE_E_per_atom=85.1 meV, RMSE_F=142.3 meV / A
2022-09-20 10:31:15.227 INFO: Epoch 182: loss=0.2080, RMSE_E_per_atom=85.4 meV, RMSE_F=142.9 meV / A
......
2022-09-20 12:44:32.619 INFO: Epoch 798: loss=0.2939, RMSE_E_per_atom=79.2 meV, RMSE_F=172.0 meV / A
2022-09-20 12:44:58.490 INFO: Epoch 800: loss=0.2940, RMSE_E_per_atom=79.3 meV, RMSE_F=172.0 meV / A
2022-09-20 12:44:58.490 INFO: Changing loss based on SWA
2022-09-20 12:45:24.412 INFO: Epoch 802: loss=4.4866, RMSE_E_per_atom=67.2 meV, RMSE_F=217.0 meV / A
2022-09-20 12:45:50.332 INFO: Epoch 804: loss=3.2585, RMSE_E_per_atom=56.8 meV, RMSE_F=290.4 meV / A
2022-09-20 12:46:16.266 INFO: Epoch 806: loss=2.5655, RMSE_E_per_atom=50.1 meV, RMSE_F=340.0 meV / A
......
2022-09-20 13:26:00.144 INFO: Epoch 990: loss=0.7132, RMSE_E_per_atom=23.5 meV, RMSE_F=363.9 meV / A
2022-09-20 13:26:25.996 INFO: Epoch 992: loss=0.7140, RMSE_E_per_atom=23.4 meV, RMSE_F=363.2 meV / A
2022-09-20 13:26:51.794 INFO: Epoch 994: loss=0.7081, RMSE_E_per_atom=23.3 meV, RMSE_F=362.3 meV / A
2022-09-20 13:27:17.722 INFO: Epoch 996: loss=0.7218, RMSE_E_per_atom=23.6 meV, RMSE_F=361.9 meV / A
2022-09-20 13:27:43.711 INFO: Epoch 998: loss=0.7198, RMSE_E_per_atom=23.5 meV, RMSE_F=362.8 meV / A
2022-09-20 13:27:56.566 INFO: Training complete
2022-09-20 13:27:56.570 INFO: Loading checkpoint: checkpoints/MACE_model_run-123_epoch-176.pt
2022-09-20 13:27:56.624 INFO: Loaded model from epoch 176
2022-09-20 13:27:56.624 INFO: Computing metrics for training, validation, and test sets
2022-09-20 13:28:11.828 INFO: Evaluating train ...
2022-09-20 13:28:28.318 INFO: Evaluating valid ...
2022-09-20 13:28:31.363 INFO: Evaluating Default ...
2022-09-20 13:28:33.182 INFO: Evaluating slab_MD ...
2022-09-20 13:28:33.662 INFO:
+-------------+---------------------+------------------+-------------------+
| config_type | RMSE E / meV / atom | RMSE F / meV / A | relative F RMSE % |
+-------------+---------------------+------------------+-------------------+
| train | 70.5 | 59.9 | 6.33 |
| valid | 85.2 | 142.0 | 15.79 |
| Default | 78.6 | 71.1 | 2364.92 |
| slab_MD | 41.8 | 166.3 | 13.01 |
+-------------+---------------------+------------------+-------------------+
2022-09-20 13:28:33.662 INFO: Saving model to checkpoints/MACE_model_run-123.model

Error for hidden irreps with l=4

I'm trying to run a variant of the mace model with hidden irreps of order l=4 (20x0e + 20x1o + 20x2e + 20x3o + 20x4e) and I get an error on the symmetric contraction:

    prod = EquivariantProductBasisBlock(
  File "~/mace/modules/blocks.py", line 114, in __init__
    self.symmetric_contractions = SymmetricContraction(
  File "~/mace/modules/symmetric_contraction.py", line 71, in __init__
    self.contractions[str(irrep_out)] = Contraction(
  File "~/mace/modules/symmetric_contraction.py", line 107, in __init__
    U_matrix = U_matrix_real(
  File "~/mace/tools/cg.py", line 132, in U_matrix_real
    out += [last_ir, stack]
UnboundLocalError: local variable 'last_ir' referenced before assignment

Should I be able to run the model with l=4 hidden irreps?

Thanks!

make the training code importable

Is your feature request related to a problem? Please describe.
I would like to embed MACE training into Python logic I am working on, though the training code cannot be readily installed and imported but is shown to be used through running a script.

Describe the solution you'd like
Simply refactor the main() function of the run_train.py script into the library itself, import it from there in run_train.py and allow packages to use MACE as a dependency for training as a result.

Describe alternatives you've considered
Calling the training code through subprocess. This is possible, but not ideal in my case.

hard limit on the `max_ell` parameter

Hi,

it seems that we can't use max_ell>3 in the MACE model because of the limitation of the computations of the W3j matrix element (e3nn supports l_max>=11) which is used to compute the U_matrix_real. Since these coefficients don't need to be computed with pytorch (no backdrop and they are computed only once), would you mind using the implementation from sympy ?

This code will fail:

model = MACE(
    r_max=10,
    num_bessel=10,
    num_polynomial_cutoff=5,
    max_ell=6,
    interaction_cls=modules.interaction_classes[
                    "RealAgnosticResidualInteractionBlock"
                ],
    interaction_cls_first=modules.interaction_classes[
                    "RealAgnosticInteractionBlock"
                ],
    num_interactions=1,
    num_elements=46,
    hidden_irreps=o3.Irreps('256x0e'),
    MLP_irreps=o3.Irreps("16x0e"),
    atomic_energies=np.zeros(46),
    avg_num_neighbors=9,
    atomic_numbers=np.arange(1,47),
    correlation=4,
    gate=torch.nn.SiLU(),
    device='cpu',
)

Standalone Model outside this Repository

Is your feature request related to a problem? Please describe.
I want to use this model to train on energy and forces on the Spice dataset. Instead of converting my data into the xyz format and using it here, I want to incorporate it in my codebase in a way that allows me to load it and use it on any dataset. One of the reasons is also benchmarking it against other networks implemented in my repository. Is there a way to do this?

Describe the solution you'd like
I would like something like,

from mace import MACE

I know that the above happens here as well, but I see a lot of stuff happening in the train script which makes it convoluted.

Symmetric contractions which are element-dependent vs. not

class SymmetricContraction(CodeGenMixin, torch.nn.Module):

In the symmetric contraction block, which implements Eq. 10 and 11 from the MACE paper, one of the key hyperparams is element_dependent which is by default set to True.

Based on my understanding, if we set this param to True, the contraction step will additionally consider the node_attrs (https://github.com/ACEsuit/mace/blob/main/mace/modules/blocks.py#L127) which are passed as the optional input y (https://github.com/ACEsuit/mace/blob/main/mace/modules/symmetric_contraction.py#L150).

My questions are:

  • Does the paper's Eq. 10 describe the element dependent or non-dependent version of the symmetric contraction?
  • Assuming that the initial node feature node_feat in the overall model (https://github.com/ACEsuit/mace/blob/main/mace/modules/models.py#L148) would already be incorporating information about the element type, why do we need the element dependent symmetric contraction?

document --model option

I am hearing of lots of confusion among users about the --model option. Can we please briefly document the various values?

Make U matrices not persistent to reduce `state_dict` size.

Is your feature request related to a problem? Please describe.
We are using mace with max angular momentum 4 and correlation 3 and the checkpoint files are huge because the U matrices of the Contraction class are stored in it (~400MB when the parameters of the model only occupy ~5MB).

Describe the solution you'd like
We would like that U matrices weren't stored in the checkpoint file.

Describe alternatives you've considered
We think that passing persistent=False in this line:

self.register_buffer(f"U_matrix_{nu}", U_matrix)
should solve the problem and doesn't cause any harm to the model. Otherwise maybe allowing the user to choose whether they are persistent or not would be nice as well.

Device error when evaluating MACE.

Hello,
First of all, thanks for making this wonderful tool public :)
I wanted to evaluate some structures with fitted MACE with eval_configs.py script.
This was trained with GPU.
While I was trying to evaluate some structures, I've received following error.

(/u/hjung/conda-envs/mace_env) hjung@raven02:~/4_Free_energy_sampling/1_MACE_test> python ~/Softwares/mace/scripts/eval_configs.py --configs="new_training_set_iter7.xyz" --model="checkpoints/MACE_model_run-123.model" --output="./output.xyz" --device="cpu"
Traceback (most recent call last):
  File "/u/hjung/Softwares/mace/scripts/eval_configs.py", line 151, in <module>
    main()
  File "/u/hjung/Softwares/mace/scripts/eval_configs.py", line 82, in main
    model = torch.load(f=args.model, map_location=args.device)
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/torch/serialization.py", line 809, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/torch/serialization.py", line 1172, in _load
    result = unpickler.load()
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/e3nn/util/codegen/_mixin.py", line 109, in __setstate__
    smod = torch.jit.load(buffer)
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/torch/jit/_serialization.py", line 164, in load
    cpp_module = torch._C.import_ir_module_from_buffer(
RuntimeError: No CUDA GPUs are available

Having seen complaints on CUDA GPU, I submitted evaluation job with the same configuration as in training with GPU and it succeeded.
Is it true that if I have trained a MACE model with GPU, then evaluation also has to be carried out in GPU? (i.e. If it's trained with GPU, it cannot be evaluated with CPU?)
I just want to be clarified with this issue.
Many thanks in advance!

Best regards,
Hyunwook

Return learned local embedding

It could be a useful addition to add a keyword to the model whcih would return the learned local embedding of the model.

This could be just the invariant part of it, which is a tensor of shape [N_atoms x (num_interactions x N_channels)].

To implement we just need to return the invariant features after each interaction that are passed to the readout blocks.

This would probably only give a good descriptor with a trained model because of normalisation.

document --reuse_latest

Please document what the --reuse_latest option does. Presumably it continues training from a previous model? does that previous model just need to exist in the current working directory? can one specify the file name if it's different?

fewer loss functions

Would it be possible to simplify the internals of MACE by reducing the number of loss functions? At least drop one of stress or virial, perhaps doing an automatic conversion of the input data when it's read in?

Colab tutorial no longer accessible

I get the following error when I click the link in the README

There was an error loading this notebook. Ensure that the file is accessible and try again.
Invalid Credentials
https://drive.google.com/drive/?action=locate&id=1D6EtMUjQPey_GkuxUAbPgld6_9ibIa-V&authuser=1

installation instructions do not work out of the box

Describe the bug
A clear and concise description of what the bug is.

typing command from installation instructions results in error

To Reproduce
Steps to reproduce the behavior:

  1. type "git clone [email protected]:ACEsuit/mace.git"
  2. See error
gabor@Gabors-MacBook-Pro Downloads % git clone [email protected]:ACEsuit/mace.git
Cloning into 'mace'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
gabor@Gabors-MacBook-Pro Downloads % 

Expected behavior
A clear and concise description of what you expected to happen.

mace to be cloned by git

Desktop (please complete the following information):

  • OS: Mac OS Ventura

Recompute statistics for new training set when loading a checkpoint

Currently the way training set related statistics (E0s, avg num neighbours, scale, shift) are handled when loading a checkpoint and using a new dataset is really opaque.

I propose that we should NOT recompute and update any of the statistics by default if there is a checkpoint.

We should also have a new -recompute_statistics flag which makes the code recompute everything with the new training set and updates the model even if there was a checkpoint.

training with MPS - ARM64 Mac

Describe the bug
Training of model cannot be initialised on M2 Mac, using MPS acceleration. Since apple GPUs don't support 64 bit floats, so one needs to set default_dtype=float32 which is likely the issue.

To Reproduce
Steps to reproduce the behavior:

  1. Try training a model with --device=mps --default_dtype=float32

Expected behavior
The training should "just work" like elsewhere or on CPU.

Desktop (please complete the following information):

  • OS: MacOS, 13.2.1
  • M2 chip
  • Torch 2.0.0,
  • Python 3.9

Additional context

training args used:

python ../scripts/run_train.py \
    --name="MACE_model" \
    --train_file="Al2O3_train.xyz" \
    --valid_fraction=0.05 \
    --test_file="Al2O3_test.xyz" \
    --config_type_weights='{"Default":1.0}' \
    --model="MACE" \
    --hidden_irreps='16x0e + 16x1o' \
    --r_max=5.0 \
    --batch_size=10 \
    --max_num_epochs=1500 \
    --swa \
    --start_swa=1200 \
    --ema \
    --ema_decay=0.99 \
    --amsgrad \
    --restart_latest \
    --device=mps \
    --default_dtype=float32

output:

2023-03-26 08:04:10.863 INFO: MACE version: 0.2.0
2023-03-26 08:04:10.863 INFO: Configuration: Namespace(name='MACE_model', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='mps', default_dtype='float32', log_level='INFO', error_table='PerAtomRMSE', model='MACE', r_max=5.0, num_radial_basis=8, num_cutoff_basis=5, interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='16x0e + 16x1o', num_channels=None, max_L=None, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=False, compute_forces=True, train_file='Al2O3_train.xyz', valid_file=None, valid_fraction=0.05, test_file='Al2O3_test.xyz', E0s=None, energy_key='energy', forces_key='forces', virials_key='virials', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='weighted', forces_weight=100.0, swa_forces_weight=100.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=1.0, swa_stress_weight=10.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', optimizer='adam', batch_size=10, valid_batch_size=10, lr=0.01, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=True, start_swa=1200, ema=True, ema_decay=0.99, max_num_epochs=1500, patience=2048, eval_interval=2, keep_checkpoints=False, restart_latest=True, save_cpu=False, clip_grad=10.0, wandb=False, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight'])
2023-03-26 08:04:10.883 INFO: Using MPS GPU acceleration
2023-03-26 08:04:10.908 INFO: Using isolated atom energies from training file
2023-03-26 08:04:10.909 INFO: Loaded 45 training configurations from 'Al2O3_train.xyz'
2023-03-26 08:04:10.909 INFO: Using random 5.0% of training set for validation
2023-03-26 08:04:10.918 INFO: Loaded 11 test configurations from 'Al2O3_test.xyz'
2023-03-26 08:04:10.918 INFO: Total number of configurations: train=43, valid=2, tests=[Default: 11]
2023-03-26 08:04:10.919 INFO: AtomicNumberTable: (8, 13)
2023-03-26 08:04:10.919 INFO: Atomic energies: [-422.9243, -105.9163]
2023-03-26 08:04:13.990 INFO: WeightedEnergyForcesLoss(energy_weight=1.000, forces_weight=100.000)
2023-03-26 08:04:13.999 INFO: Average number of neighbors: 58.728389739990234
2023-03-26 08:04:13.999 INFO: Selected the following outputs: {'energy': True, 'forces': True, 'virials': False, 'stress': False, 'dipoles': False}
2023-03-26 08:04:13.999 INFO: Building model
2023-03-26 08:04:13.999 INFO: Hidden irreps: 16x0e + 16x1o
/Users/tks32/research/mace-tmp/venv/lib/python3.9/site-packages/torch/jit/_check.py:172: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "
Traceback (most recent call last):
  File "/Users/tks32/research/mace-tmp/Al2O3/../scripts/run_train.py", line 563, in <module>
    main()
  File "/Users/tks32/research/mace-tmp/Al2O3/../scripts/run_train.py", line 324, in main
    model.to(device)
  File "/Users/tks32/research/mace-tmp/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/Users/tks32/research/mace-tmp/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 844, in _apply
    self._buffers[key] = fn(buf)
  File "/Users/tks32/research/mace-tmp/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

Torchscript support and OpenMM and LAMMPS

Being able to compile MACE models is of high priority :

  • Compile with torchscript the internal functionalities of MACE.
  • Write test for compiled modules.
  • Resolve incompatibilities between torchscript and AtomicData.
  • Compile the full model with torchscript.
  • Create deployed model with metadata (r_cut, species).
  • Load model in C++ using libtorch.
  • Compiled version of neighboring list.
  • Create interface to OpenMM.
  • Create interface to LAMMPs with pair potential.

About eval_configs

Describe the bug
Hi, thanks for this great code!
I met an Error report during the evaluation process RuntimeError: expected scalar type Double but found Float, however, I believe the training process works very well, since I got RMSE/MAE for both total and each config_type individually, so I am not sure why it happens.

The full report

Traceback (most recent call last):
  File "/u/mncui/software/mace_v2/scripts/eval_configs.py", line 113, in <module>
    main()
  File "/u/mncui/software/mace_v2/scripts/eval_configs.py", line 78, in main
    output = model(batch, training=False)
  File "/u/mncui/software/anaconda3/envs/mace_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/u/mncui/software/anaconda3/envs/mace_env2/lib/python3.7/site-packages/mace/modules/models.py", line 203, in forward
    node_e0 = self.atomic_energies_fn(data.node_attrs)
  File "/u/mncui/software/anaconda3/envs/mace_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/u/mncui/software/anaconda3/envs/mace_env2/lib/python3.7/site-packages/mace/modules/blocks.py", line 77, in forward
    return torch.matmul(x, self.atomic_energies)
RuntimeError: expected scalar type Double but found Float

To Reproduce
In order to quickly reproduce this case, I attached the output model, submit scripts for both training and evaluation, and the log file as well.
Thanks so much for any information about it!
attachments

Lammps with GPU acceleration not working

Describe the bug
Running a lammps simulation with a potential fitted for testing fails with following error when trying to run GPU accelerated version:

[W parser.cpp:3777] Warning: operator() sees varying value in profiling, ignoring and this should be handled by GUARD logic (function operator())
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [10,0,0], thread: [96,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. (Line occurs multiple times)
cudaStreamSynchronize(stream) error( cudaErrorAssert): device-side assert triggered lammps_mace/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:172

I tried to compile lammps with mace and mace-dev branches and on a cluster and local machine. When running the CPU version it worked fine.

To Reproduce
Exemplary cmake command (Tried a few and had to adjust for different machines):

cmake -C ../lammps/cmake/presets/kokkos-cuda.cmake -C ../lammps/cmake/presets/gcc.cmake -D CMAKE_BUILD_TYPE=Release -D BUILD_MPI=yes -D Kokkos_ARCH_PASCAL60=no -D BUILD_OMP=no -D BUILD_SHARED_LIBS=yes -D LAMMPS_EXCEPTIONS=yes -D Kokkos_ARCH_ZEN2=yes -D Kokkos_ARCH_AMPERE80=yes -D Kokkos_ENABLE_OPENMP=no -D Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=no -D Kokkos_ENABLE_CUDA_UVM=no -D PKG_ML-MACE=yes -D CMAKE_PREFIX_PATH=/work/groups/da_mm/apps/lammps_mace/libtorch-gpu -D CMAKE_CXX_STANDARD=17 ../lammps/cmake/

why symmetric contraction doesn't work for some hidden irreps?

Discussed in #42

Originally posted by Fadelis98 November 8, 2022
Although already mentioned in #36 that all degrees should have the same number of channels, I find it doesn't work for some certain irreps like 0o,1e. The error occurs in cg.py: line 117 current_ir = wigners[0][0], list out of range. Or in line 132 out += [last_ir, stack], last_ir not defined.

In the README guidance, it seems that these irreps are avoided on purpose, but I didn't see the reason in the MACE paper, could anyone give me some insight?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.