janosh / matbench-discovery Goto Github PK

View Code? Open in Web Editor NEW

68.0 8.0 6.0 137.91 MB

An evaluation framework for machine learning models simulating high-throughput materials discovery.

Home Page: https://matbench-discovery.materialsproject.org

License: MIT License

Python 86.11% HTML 0.42% Svelte 10.16% CSS 1.39% TypeScript 0.97% JavaScript 0.95%

bayesian-optimization convex-hull machine-learning materials-discovery high-throughput-search interatomic-potential

matbench-discovery's Introduction

Matbench Discovery

TL;DR: We benchmark ML models on crystal stability prediction from unrelaxed structures finding universal interatomic potentials (UIP) like MACE, CHGNet and M3GNet to be highly accurate, robust across chemistries and ready for production use in high-throughput materials discovery.

Matbench Discovery is an interactive leaderboard and associated PyPI package which together make it easy to rank ML energy models on a task designed to simulate a high-throughput discovery campaign for new stable inorganic crystals.

We've tested models covering multiple methodologies ranging from random forests with structure fingerprints to graph neural networks, from one-shot predictors to iterative Bayesian optimizers and interatomic potential relaxers.

Our results show that ML models have become robust enough to deploy them as triaging steps to more effectively allocate compute in high-throughput DFT relaxations. This work provides valuable insights for anyone looking to build large-scale materials databases.

We welcome contributions that add new models to the leaderboard through GitHub PRs. See the contributing guide for details.

If you're interested in joining this work, please reach out via GitHub discussion or email.

For detailed results and analysis, check out the preprint.

matbench-discovery's People

Contributors

Stargazers

Watchers

Forkers

abhiroopbhattacharya pbenner chiang-yuan sailfish009 shiqiaol shdchen

matbench-discovery's Issues

Pytorch module and virtual environment usage

matbench-discovery/models/mace/train_mace.py

Line 14 in 7d80413

subprocess.run([". ~/.venv/py311/bin/activate"], check=True)

If I remembered correctly, once pytorch module on Perlmutter is loaded, all packaged will be installed by the pip provided by Perlmutter's pytorch module, not by the virtual env. That's to say, all the packages will be installed in module's conda environment in $PYTHONUSERBASE to be accessed correctly. (NERSC Pytorch Doc).

If we force all the new packages to be installed in virtual env through pip in virtual env, some C kernel will not be found when we try to import torch

compute_struct_fingerprints.py: cannot insert material_id, already exists

Seems that fetch_process_wbm_dataset.py requires me to run compute_struct_fingerprints.py first. However, I get the following error:

> python scripts/compute_struct_fingerprints.py
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/scripts/compute_struct_fingerprints.py", line 133, in <module>
    df_out.reset_index().to_json(f"{out_dir}/site-stats.json.gz")
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/core/frame.py", line 6219, in reset_index
    new_obj.insert(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/core/frame.py", line 4782, in insert
    raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert material_id, already exists

load_train_test(): UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Running the following script fails:

>>> from matbench_discovery.data import load_train_test
>>> load_train_test('mp_computed_structure_entries')
Downloading 'mp_computed_structure_entries' from https://figshare.com/ndownloader/files/40344436
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pbenner/Source/tmp/matbench-discovery/matbench_discovery/data.py", line 95, in load_train_test
    df = reader(url)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 733, in read_json
    json_reader = JsonReader(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 819, in __init__
    self.data = self._preprocess_data(data)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 831, in _preprocess_data
    data = data.read()
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

The only files for which the file download works are 'wbm_summary' and 'mp_energies'.

MIssing `"direct_url.json"` causes `JSONDecodeError`: Expecting value: line 1 column 1 (char 0)

Discussed in #76

^{Originally posted by Claudia-Hello January 28, 2024}
I have run this repository, but there was no "direct_url.json" file in the "matbench-discovery" folder. Could you please tell me how to solve this problem? The source coda can be seen as follows. Thank you so much!

pkg_name = "matbench-discovery"
direct_url = Distribution.from_name(pkg_name).read_text("direct_url.json") or ""
pkg_is_editable = json.loads(direct_url).get("dir_info", {}).get("editable", False)

Full Stacktrace

---> 17 from matbench_discovery import (
     18     PDF_FIGS,
     19     SCRIPTS,
     20     SITE_FIGS,
     21     Key,
     22     ModelType,
     23     Targets,
     24     Task,
     25 )
     26 from matbench_discovery.data import DATA_FILES, df_wbm
     27 from matbench_discovery.metrics import stable_metrics

File ~/.venv/py311/lib/python3.11/site-packages/matbench_discovery/__init__.py:17
     15 pkg_name = "matbench-discovery"
     16 direct_url = Distribution.from_name(pkg_name).read_text("direct_url.json") or ""
---> 17 pkg_is_editable = json.loads(direct_url).get("dir_info", {}).get("editable", False)
     19 PKG_DIR = os.path.dirname(__file__)
     20 # repo root directory if editable install, else the pkg directory

File /opt/homebrew/Cellar/python@3.11/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
...
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

df_summary.index contains nan values

> python data/wbm/fetch_process_wbm_dataset.py
[...]
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 331, in <module>
    df_summary.index = df_summary.index.map(increment_wbm_material_id)  # format IDs
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6158, in map
    new_values = self._map_values(mapper, na_action=na_action)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/core/base.py", line 924, in _map_values
    new_values = map_f(values, mapper)
  File "pandas/_libs/lib.pyx", line 2834, in pandas._libs.lib.map_infer
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 147, in increment_wbm_material_id
    prefix, step_num, material_num = wbm_id.split("_")
AttributeError: 'float' object has no attribute 'split'

caused by nan values at positions 185450, 185451, 185473, 185474, 185476, 185477

Simplified user interface

What about providing a simplified user interface for training and testing models? This would be a simple example:

import pandas as pd

from matbench_discovery.data import DATA_FILES, df_wbm
from pymatgen.core import Structure
from sklearn.metrics import r2_score

class MatbenchDiscovery:
    def __init__(self, task_type = "IS2RE"):
        if task_type not in ['IS2RE', 'RS2RE']:
            raise ValueError(f'Invalid task_type {task_type}')
        self.task_type = task_type
    
    def get_test_data(self):
        id_col = "material_id"
        input_col = {"IS2RE": "initial_structure", "RS2RE": "relaxed_structure"}[self.task_type]
        target_col = "e_form_per_atom_mp2020_corrected"

        data_path = {
            "IS2RE": DATA_FILES.wbm_initial_structures,
            "RS2RE": DATA_FILES.wbm_computed_structure_entries,
        }[self.task_type]

        df_in = pd.read_json(data_path).set_index(id_col)

        X = pd.Series([Structure.from_dict(x) for x in df_in[input_col]], index = df_in.index)
        y = pd.Series(df_wbm[target_col])

        return X[y.index], y

    def get_train_data(self):
        assert self.task_type == "IS2RE", "TODO"

        target_col = "formation_energy_per_atom"
        input_col = "structure"
        id_col = "material_id"

        df_cse = pd.read_json(DATA_FILES.mp_computed_structure_entries).set_index(id_col)
        df_eng = pd.read_csv(DATA_FILES.mp_energies).set_index(id_col)

        X = pd.Series([ Structure.from_dict(cse[input_col]) for cse in df_cse.entry ], index = df_cse.index)
        y = pd.Series(df_eng[target_col], index = df_eng.index)

        return X[y.index], y

    def evaluate_predictions(self, y_pred, apply_correction = False):

        assert type(y_pred) == pd.Series

        target_col = "e_form_per_atom_mp2020_corrected"

        y_pred = y_pred.dropna()
        y_true = df_wbm[target_col][y_pred.index]

        if apply_correction:
            y_pred -= df_wbm.e_correction_per_atom_mp_legacy
            y_pred += df_wbm.e_correction_per_atom_mp2020

        mae = (y_true - y_pred).abs().mean()
        r2 = r2_score(y_true, y_pred)

        return {'mae': mae, 'r2': r2, 'y_true': y_true, 'y_pred': y_pred}

Reference: Critical examination of robustness and generalizability

Maybe another reference for the paper to motivate this work:

Li, Kangming, et al. A critical examination of robustness and generalizability of machine learning prediction of materials properties. npj Computational Materials 9.1 (2023): 55.

Figshare json missing

With a fresh conda environment and the cache cleaned, running fetch_process_wbm_dataset.py leads to the following error:

> python fetch_process_wbm_dataset.py 
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 24, in <module>
    from matbench_discovery.data import DATA_FILES
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py", line 27, in <module>
    f"{os.path.expanduser('~/.cache/matbench-discovery')}/{figshare_versions[-1]}",
IndexError: list index out of range

The error can be bypassed by copying figshare/1.0.0.json to [...]/site-packages/data/figshare.

Since this is matbench-discovery related data, I would suggest to use [...]/site-packages/data/matbench-discovery/figshare instead.

add ChargE3Net to leaderboard

https://github.com/AIforGreatGood/charge3net

Obtain E_above_hull predictions

Thanks for the previous reply, I am able to obtain the ground truth of e_above_hull. But how can I get the predicted e_above_hull from formation energies? It is using the ComputedStructureEntry as input.

matbench-discovery/data/wbm/fetch_process_wbm_dataset.py

Line 559 in 297251c

e_above_hull = ppd_mp.get_e_above_hull(cse, allow_negative=True)

Sorry for bothering, I am developing a new model and trying to test it on different benchmarks, It would be great to know the performance with different hyperparameters before contributing to MBD.

Fetching `2023-02-07-ppd-mp.pkl.gz` still fails with UnicodeDecodeError

fetch_process_wbm_dataset.py now hangs here:

Warning: '/home/pbenner/.cache/matbench-discovery/1.0.0/mp/2023-02-07-ppd-mp.pkl.gz' associated with key='mp_patched_phase_diagram' does not exist. Would you like to download it now using matbench_discovery.data.load_train_test('mp_patched_phase_diagram'). This will cache the file for future use. [y/n] y
Downloading 'mp_patched_phase_diagram' from https://figshare.com/ndownloader/files/40344451


variable dump:
file='mp/2023-02-07-ppd-mp.pkl.gz',
url='https://figshare.com/ndownloader/files/40344451',
reader=<function read_json at 0x7f31b4681120>,
kwargs={'compression': 'gzip'}
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 538, in <module>
    with gzip.open(DATA_FILES.mp_patched_phase_diagram, "rb") as zip_file:
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py", line 217, in __getattribute__
    self._on_not_found(key, msg)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py", line 239, in _on_not_found
    load_train_test(key)  # download and cache data file
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py", line 111, in load_train_test
    df = reader(url, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 733, in read_json
    json_reader = JsonReader(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 819, in __init__
    self.data = self._preprocess_data(data)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 831, in _preprocess_data
    data = data.read()
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

However, manual download seems to work.

Mismatching fingerprint paths

It seems that fetch_process_wbm_dataset.py and compute_struct_fingerprints.py expect different fingerprint paths:

> python fetch_process_wbm_dataset.py
[...]
fingerprints_path='[...]/data/wbm/site-stats.json.gz' not found, run scripts/compute_struct_fingerprints.py to generate

> python compute_struct_fingerprints.py
Loading 'wbm_summary' from cached file at '~/.cache/matbench-discovery/1.0.0/wbm/2022-10-19-wbm-summary.csv.gz'
out_path='~/.local/opt/anaconda3/envs/discovery/lib/python3.10/site-packages/data/wbm/site-stats-000.json.gz' already exists, exciting early

test_plots.py and test_preds.py failing

> pytest 
===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.10.8, pytest-7.1.2, pluggy-1.0.0
rootdir: /home/pbenner/Source/tmp/matbench-discovery, configfile: pyproject.toml, testpaths: tests
collected 34 items / 2 errors / 2 deselected / 32 selected                                                                                                                                    

=========================================================================================== ERRORS ============================================================================================
____________________________________________________________________________ ERROR collecting tests/test_plots.py _____________________________________________________________________________
tests/test_plots.py:18: in <module>
    from matbench_discovery.preds import load_df_wbm_with_preds
../../../.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/preds.py:127: in <module>
    df_preds = load_df_wbm_with_preds().round(3)
../../../.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/preds.py:96: in load_df_wbm_with_preds
    df = glob_to_df(PRED_FILES[model_name], pbar=False, **kwargs).set_index(id_col)
../../../.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py:170: in glob_to_df
    raise FileNotFoundError(f"No files matching glob {pattern=}")
E   FileNotFoundError: No files matching glob pattern='/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/models//bowsr/2023-01-23-bowsr-megnet-wbm-IS2RE.csv'
--------------------------------------------------------------------------------------- Captured stderr ---------------------------------------------------------------------------------------
Loading preds:   0%|          | 0/8 [00:00<?, ?it/s, BOWSR + MEGNet]
____________________________________________________________________________ ERROR collecting tests/test_preds.py _____________________________________________________________________________
tests/test_preds.py:6: in <module>
    from matbench_discovery.preds import (
../../../.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/preds.py:127: in <module>
    df_preds = load_df_wbm_with_preds().round(3)
../../../.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/preds.py:96: in load_df_wbm_with_preds
    df = glob_to_df(PRED_FILES[model_name], pbar=False, **kwargs).set_index(id_col)
../../../.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py:170: in glob_to_df
    raise FileNotFoundError(f"No files matching glob {pattern=}")
E   FileNotFoundError: No files matching glob pattern='/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/models//bowsr/2023-01-23-bowsr-megnet-wbm-IS2RE.csv'
--------------------------------------------------------------------------------------- Captured stderr ---------------------------------------------------------------------------------------
Loading preds:   0%|          | 0/8 [00:00<?, ?it/s, BOWSR + MEGNet]
=================================================================================== short test summary info ===================================================================================
ERROR tests/test_plots.py - FileNotFoundError: No files matching glob pattern='/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/models//bowsr/2023-01-23-bowsr-...
ERROR tests/test_preds.py - FileNotFoundError: No files matching glob pattern='/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/models//bowsr/2023-01-23-bowsr-...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Importing CSV with pd.read_json()

The file format of DATA_FILES.mp_energies is now CSV:

matbench-discovery/models/cgcnn/train_cgcnn.py

Line 65 in 4a90dee

df_in = pd.read_json(data_path).set_index(id_col)

Also, the structure column was removed, which is required for training.

2022-09-19-mp-elemental-reference-entries.json missing

> python fetch_process_wbm_dataset.py 
Downloading 'wbm_summary' from https://figshare.com/ndownloader/files/40344475
Cached 'wbm_summary' to '/home/pbenner/.cache/matbench-discovery/1.0.0/wbm/2022-10-19-wbm-summary.csv'
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 25, in <module>
    from matbench_discovery.energy import get_e_form_per_atom
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/energy.py", line 66, in <module>
    pd.read_json(DATA_FILES.mp_elemental_ref_entries, typ="series")
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 733, in read_json
    json_reader = JsonReader(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 818, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 874, in _get_data_from_filepath
    raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
FileNotFoundError: File /home/pbenner/.cache/matbench-discovery/1.0.0/mp/2022-09-19-mp-elemental-reference-entries.json does not exist

Solved this issue using:

> mkdir /home/pbenner/.cache/matbench-discovery/1.0.0/mp/
> cp ../mp/2022-09-19-mp-elemental-reference-entries.json /home/pbenner/.cache/matbench-discovery/1.0.0/mp/

Location of site-stats.json.gz

scripts/compute_struct_fingerprints.py saves result to data/wbm/structure-fingerprints/site-stats.json.gz, while fetch_process_wbm_dataset.py expects data/wbm/site-stats.json.gz. Would be nice to have it directly saved to data/wbm/site-stats.json.gz

fetch_process_wbm_dataset.py: AssertionError: mat_id='wbm-1-9': e_form=-0.31117 != e_form_ppd - correction=-0.32358

Executing fetch_process_wbm_dataset.py results in the following error:

Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 577, in <module>
    abs(e_form - (e_form_ppd - correction)) < 1e-4
AssertionError: mat_id='wbm-1-9': e_form=-0.31117 != e_form_ppd - correction=-0.32358

fetch_process_wbm_dataset.py: Generating Aflow labels raised exception=KeyError('wyckoff_spglib')

> python data/wbm/fetch_process_wbm_dataset.py
[...]
  0%|                                                                                                                                              | 0/256963 [00:00<?, ?it/s]
Generating Aflow labels raised exception=KeyError('wyckoff_spglib')

fetch_process_wbm_dataset.py: data/wbm/2022-10-19-wbm-init-structs.json.bz2 does not exist

Thanks for all the bugfixes! The latest error with a fresh start is the following:

Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 597, in <module>
    df_init_struct = pd.read_json(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 760, in read_json
    json_reader = JsonReader(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 861, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 917, in _get_data_from_filepath
    raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
FileNotFoundError: File /home/pbenner/Source/tmp/matbench-discovery/data/wbm/2022-10-19-wbm-init-structs.json.bz2 does not exist

df_wbm has wrong index column name type for wandb.Table

Running test_mace.py currently fails with the following error:

Relaxing: 100%|██████████| 257/257 [13:10:57<00:00, 184.66s/it]  
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery-pbenner/models/mace/test_mace.py", line 168, in <module>
    table = wandb.Table(
            ^^^^^^^^^^^^
  File "/home/pbenner/.local/opt/anaconda3/envs/mace/lib/python3.11/site-packages/wandb/data_types.py", line 281, in __init__
    self._init_from_dataframe(dataframe, columns, optional, dtype)
  File "/home/pbenner/.local/opt/anaconda3/envs/mace/lib/python3.11/site-packages/wandb/data_types.py", line 334, in _init_from_dataframe
    self._assert_valid_columns(columns)
  File "/home/pbenner/.local/opt/anaconda3/envs/mace/lib/python3.11/site-packages/wandb/data_types.py", line 304, in _assert_valid_columns
    assert len(columns) == 0 or all(
           ^^^^^^^^^^^^^^^^^
AssertionError: columns argument expects list of strings or ints

matbench-discovery/models/mace/test_mace.py

Line 167 in 3118330

 dataframe=df_wbm.dropna()[[Key.dft_energy, e_pred_col, Key.formula]].reset_index() 

A simple bugfix is the following:

df_wbm.index.name = str(df_wbm.index.name)

i.e. the index column name has type enum 'Key', which is not accepted by wandb.

Package uses non-standard site-package paths for resources

The package seems to use the following non-standard paths:

[...]/site-packages/tmp
[...]/site-packages/data

Please use pkg_resources for determining package resource paths.

dead link in contributing

I'm interested in predicting ground truth energies. You mention a correction from MP20 and provide a link on this page.

e_form_per_atom_mp2020_corrected: Matbench Discovery takes these as ground truth for the formation energy. The result of applying the MP2020 energy corrections (latest correction scheme at time of release) to e_form_per_atom_uncorrected.

e_correction_per_atom_mp2020: MaterialsProject2020Compatibility energy corrections in eV/atom.

e_correction_per_atom_mp_legacy: Legacy MaterialsProjectCompatibility energy corrections in eV/atom. Having both old and new corrections allows updating predictions from older models like MEGNet that were trained on MP formation energies treated with the old correction scheme.

However, that link is a 404.

Benchmark design questions

I'm having a concern regarding the MBD benchmark. The protocol suggests training on the MP datasets and testing on the WBM dataset. Could you please confirm if "energy above hull" is the correct training label and if "energy_above_hull_mp2020_corrected_ppd_mp" is the appropriate ground truth label for the test set? I've observed a noticeable difference in the distribution between the training and testing datasets(there will be no negative value in the train set). Moreover, each dataset seems to imbalance between "stable" and "unstable" classifications. Do you think mixing the two datasets and then performing a 5-fold cross-validation will be better, similar to the process used in Matbench?

Missing `site/src/figs` directory and `2023-02-07-ppd-mp.pkl.gz` file

          This particular problem seems to be solved. However, there are further issues:

n_too_stable = 502
n_too_unstable = 22
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 473, in <module>
    save_fig(fig, f"{img_path}.svelte")
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pymatviz/utils.py", line 308, in save_fig
    fig.write_html(path, **defaults)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/plotly/basedatatypes.py", line 3708, in write_html
    return pio.write_html(self, *args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/plotly/io/_html.py", line 536, in write_html
    path.write_text(html_str)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/pathlib.py", line 1154, in write_text
    with self.open(mode='w', encoding=encoding, errors=errors, newline=newline) as f:
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/site/src/figs/hist-wbm-e-form-per-atom.svelte'

After manually creating the [...]/src/figs directory, the next issue is the following:

Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 538, in <module>
    with gzip.open(DATA_FILES.mp_patched_phase_diagram, "rb") as zip_file:
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/gzip.py", line 174, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/data/mp/2023-02-07-ppd-mp.pkl.gz'

The [...]/data/mp directory exists, but the pkl file is missing.

Originally posted by @pbenner in #12 (comment)

Inconsistency in GNoME's F1 Scores on Matbench

I noticed that the F1 scores for GNoME listed on two different web pages within the Matbench Discovery section appear to be inconsistent:

https://matbench-discovery.materialsproject.org/models
and
https://matbench-discovery.materialsproject.org/

Could someone please clarify why there is a difference in the reported F1 scores for GNoME? Thanks!

Package import fails

Package import fails because 2022-10-19-wbm-summary.csv is not shipped with the pip package:

matbench-discovery/matbench_discovery/data.py

Line 17 in 69dc756

df_wbm = pd.read_csv(f"{ROOT}/data/wbm/2022-10-19-wbm-summary.csv")

Import of cached CSV is inconsistent

Imported data frames from CSV differ between downloaded and cached version. To fix this issue, please add index = False to:

matbench-discovery/matbench_discovery/data.py

Line 150 in 69dc756

df.to_csv(cache_path)

Fetching data fails

Fetching the data currently fails:

Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 184, in <module>
    urllib.request.urlretrieve(f"{mat_cloud_url}&{filename=}", file_path)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND

matbench-discovery/data/wbm/fetch_process_wbm_dataset.py

Line 184 in 69dc756

urllib.request.urlretrieve(f"{mat_cloud_url}&{filename=}", file_path)

Different training size for benchmarking

Hi,

I heard about this project at MRS Fall last week and I am looking at the matbench discovery page. Shouldn't training size (e.g. 1.6 million vs 133 k) be the same for a fair comparison, else the model might be considered overfitting? This is machine learning 101.

2023-02-07-ppd-mp.pkl.gz missing

Running fetch_process_wbm_dataset.py leads to the following error:

[...]
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 537, in <module>
    with gzip.open(DATA_FILES.mp_patched_phase_diagram, "rb") as zip_file:
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/gzip.py", line 174, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/pbenner/.cache/matbench-discovery/1.0.0/mp/2023-02-07-ppd-mp.pkl.gz'

This file is not part of the repository and cannot be easily fixed.

How to calculate the MAE of the submitted files

Hi,
I am new to this benchmark. I am trying to figure out the correct way to calculate the MAE of my prediction. I am not sure the following steps are correct (sorry I could not find a documentation anywhere in the website):
To get the ground truth structure and the energy per atom, I need to download both files:

wbm_initial_structures where I can get the initial structure of all unrelaxed crystals, I saw another file wbm_computed_structure_entries bu not sure what is it for?
wbm_summary, to get the ground truth energy per atom I look at the column e_correction_per_atom_mp2020

I took my prediction and calculated the MAE between my prediction and the given ground-truth energy above.

Could you please tell me is that the correct way to validate the MAE of my model?

Since I am not sure what is the correct way to do so, so I looked into https://github.com/janosh/matbench-discovery/tree/main/models/chgnet with some submitted prediction files. I downloaded one of them, for example, the 2023-12-21-chgnet-0.3.0-wbm-IS2RE.csv.gz. That file has three columns:

material_id,chgnet_energy,e_form_per_atom_chgnet
wbm-1-1,-42.8388,0.5589
wbm-1-2,-25.9892,0.0934
wbm-1-3,-35.0625,0.7332
wbm-1-4,-9.1859,0.0142
wbm-1-5,-13.3448,-0.2062
wbm-1-6,-16.3522,-0.2686
wbm-1-7,-8.398,-0.2713
wbm-1-8,-16.8369,-0.2781
wbm-1-9,-15.2442,-0.4043

Which column is actual prediction?

fetch_process_wbm_dataset.py: bad JSON file checksum

matbench-discovery/data/wbm > python fetch_process_wbm_dataset.py
[...]
From: https://drive.google.com/u/0/uc?id=1639IFUG7poaDE2uB6aISUOi65ooBwCIg
To: /home/pbenner/Source/tmp/matbench-discovery-pbenner/data/wbm/raw/wbm-summary.txt
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14.3M/14.3M [00:00<00:00, 118MB/s]
step=1

  File "/home/pbenner/Source/tmp/matbench-discovery-pbenner/data/wbm/fetch_process_wbm_dataset.py", line 113, in <module>
    assert checksum == wbm_struct_json_checksums[step - 1], f"bad JSON file checksum, expected {wbm_struct_json_checksums[step - 1]} but got {checksum}"
AssertionError: bad JSON file checksum, expected -7815922250032563359 but got 10630821823676988257