GithubHelp home page GithubHelp logo

mast-ml's Introduction

Materials Simulation Toolkit for Machine Learning (MAST-ML)

MAST-ML is an open-source Python package designed to broaden and accelerate the use of machine learning in materials science research

GitHub release (latest by date)

PyPI - Downloads

Documentation Status

Run example notebooks in Google Colab:

  • Tutorial 1: Getting Started with MAST-ML: Open In Colab

  • Tutorial 2: Data Import and Cleaning: Open In Colab

  • Tutorial 3: Feature Engineering: Open In Colab

  • Tutorial 4: Models and Data Splitting Tests: Open In Colab

  • Tutorial 5: Left out data, nested cross validation, and optimized models: Open In Colab

  • Tutorial 6: Model error analysis and uncertainty quantification: Open In Colab

  • Tutorial 7: Model predictions with guide rails: Open In Colab

Contributors

University of Wisconsin-Madison Computational Materials Group:

  • Prof. Dane Morgan
  • Dr. Ryan Jacobs
  • Lane Schultz
  • Dr. Tam Mayeshiba
  • Dr. Ben Afflerbach
  • Dr. Henry Wu

University of Kentucky contributors:

  • Luke Harold Miles
  • Robert Max Williams
  • Matthew Turner
  • Prof. Raphael Finkel

MAST-ML documentation

  • An overview of code documentation and tutorials for getting started with MAST-ML can be found here:
https://mastmldocs.readthedocs.io/en/latest/

Funding

This work was funded by the National Science Foundation (NSF) SI2 award number 1148011

This work was funded by the National Science Foundation (NSF) DMREF award number DMR-1332851

This work was funded by the National Science Foundation (NSF) CSSI award number 1931298

Citing MAST-ML

If you find MAST-ML useful, please cite the following publication:

Jacobs, R., Mayeshiba, T., Afflerbach, B., Miles, L., Williams, M., Turner, M., Finkel, R., Morgan, D., "The Materials Simulation Toolkit for Machine Learning (MAST-ML): An automated open source toolkit to accelerate data- driven materials research", Computational Materials Science 175 (2020), 109544. https://doi.org/10.1016/j.commatsci.2020.109544

If you find the uncertainty quantification (error bar) approaches useful, please cite the following publication:

Palmer, G., Du, S., Politowicz, A., Emory, J. P., Yang, X., Gautam, A., Gupta, G., Li, Z., Jacobs, R., Morgan, D., "Calibration after bootstrap for accurate uncertainty quantification in regression models", npj Computational Materials 8 115 (2022). https://doi.org/10.1038/s41524-022-00794-8

Installation

MAST-ML can be installed via pip:

pip install mastml

Clone from Github:

git clone https://github.com/uw-cmg/MAST-ML

Changelog

MAST-ML version 3.2.x Major Updates from April 2024

  • Integration of domain of applicability approach using kernel density estimates based on MADML package: https://github.com/leschultz/materials_application_domain_machine_learning

  • Refinement of tutorials, addition of new tutorial for domains of applicability

  • Updates to plotting routines for error bar analysis

  • Many small bug fixes and updates to conform to updated versions of package dependencies

MAST-ML version 3.1.x Major Updates from July 2022

  • Refinement of tutorials, addition of Tutorial 7, Colab links as badges added for easier use.

  • mastml_predictor module added to help streamline making predictions (with option to include error bars) on new test data.

  • Basic parallelization added, which is especially useful for speeding up nested CV runs with many inner splits.

  • EnsembleModel now handles ensembles of GPR and XGBoost models.

  • Numerous improvements to plotting, including new plots (QQ plot), better axis handling and error bars (RvE plot), plotting and stats separated per group if groups are specified.

  • Improvements to feature selection methods. EnsembleModelFeatureSelector includes dummy feature references, added SHAP-based selector

  • Added assessment of baseline tests like comparing metrics to predicting the data average or permuted data test

  • Many miscellaneous bug fixes.

MAST-ML version 3.0.x Major Updates from July 2021

  • MAST-ML no longer uses an input file. The core functionality and workflow of MAST-ML has been rewritten to be more conducive to use in a Jupyter notebook environment. This major change has made the code more modular and transparent, and we believe more intuitive and easier to use in a research setting. The last version of MAST-ML to have input file support was version 2.0.20 on PyPi.

  • Each component of MAST-ML can be run in a Jupyter notebook environment, either locally or through a cloud-based service like Google Colab. As a result, we have completely reworked our use-case tutorials and examples. All of these MAST-ML tutorials are in the form of Jupyter notebooks and can be found in the mastml/examples folder on Github.

  • An active part of improving MAST-ML is to provide an automated, quantitative analysis of model domain assessement and model prediction uncertainty quantification (UQ). Version 3.x of MAST-ML includes more detailed implementation of model UQ using new and established techniques.

mast-ml's People

Contributors

alexp205 avatar atloo1 avatar avery2 avatar bafflerbach avatar cmgtam avatar hlkee avatar joshbperry avatar kvyuan avatar leschultz avatar qpwo avatar quyminh avatar rjacobs914 avatar robertmaxwilliams avatar summerjinyu avatar wardlt avatar witian94 avatar zufw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mast-ml's Issues

favorites or summary directory

Where favorite plots are pulled out
Could be difficult to pre-determine;
Maybe would want some kind of input file section, or we determine which plots are "favorite" plots
also key statistics and readme file

consider class structure for tests

Each test typically requires:
data read-in
*which datasets to use
*which features to use for fitting
*which feature to predict on
data division
*into test/train groups
*also sometimes by additional grouping like by category
*also sometimes filtering out of certain test and/or train data according to separate criteria
fits
*how many fits and predictions may depend on data division
*which model
*which hyperparameters
*sometimes may want additional features to be calculated on the fly, which weren't in the original data
predictions
*how many predictions may depend on data division
statistics collected on predictions
*sometimes validation (RMSE, R2)
*sometimes the testing set has no measured data to validate with
printing
*printing of essential statistics and model outputs
plotting
*can be complex
*often requires specific annotations
*plots of measured versus predicted
*plots of predictions versus some numeric data (not necessarily a fitting data column) split out by groups)
*plots with certain data filtered out, even though that data WAS included in training and/or in testing and testing statistics

Tests could inherit from some basic class
Class structure would prevent having to pass a lot of variables.
Class structure may encourage modular programming and uniform extension

look at how we are passing information through the workflow

For example, hyperparameter optimizations are not passed through to other tests.
Maybe have a controlling class be able to run hyperparameter test and then subsequent tests, and the controlling class would have optimized parameter info? (which could then be updated with subsequent hyperparameter optimizations)

ipython and docker?

Encapsulation ability - look into docker for environment and ipython notebook for display

fix extrapolation plotting

Give user ability to modify colors, choose between line/no line for lots of points, modify legend labels, and plot arbitrary number of test lines (no more topredict vs standard)

doc note multiprocessing and ParamOptGA

Make sure virtual memory (and memory) are large enough for ParamOptGA

Note that morgan3 apparently works without setting pvmem tag (otherwise actually limits pvmem to 1000mb which is not enough) while morgan2 with pvmem tag gives unlimited pvmem

qstat -f | grep used to see resource usage

Add check that all test_cases exist in [Test Parameters] before running tests

I've had a few runs where I add a new test or change an existing test name but forget to change the name up in the [models and tests to run], test_cases. Currently it doesn't look like there's a check that verifies if each test_case has corresponding [Test Parameters] which causes a crash when it finally tries to run the test and can't lookup the parameters.

Having a check would make this error show up sooner and prevent wasting time running tests when the workflow won't end up completing

GA

mean and std dev if do multiples
Multiple GA just run in serial one after another the way Josh had had it; that way MASTML can control

X features, y features

if not set for a particular test parameters, then look in Data Setup X features and y feature, and if not there, error out

dataset dictionary in MASTML

data as a dictionary, with keys as csv names or some other way
some additional input parameter keys for each test_case in order to allow different sets to be used for different tests cases;
but do the X, y feature setup ahead of time, and the data parsing, in MASTML the way Ryan has it, for each dataset.

array to dataframe reindexing causes other issues

@rjacobs914
Question for Ryan:
Line 322 in DataParser.py

def _array_to_dataframe(cls, array):
    dataframe = pd.DataFrame(data=array, index=range(1, len(array)+1))
return dataframe

So if I had read in an array from a CSV, the indexing defaults to starting at 0.
Then if I normalize the dataframe's x_features with FeatureNormalization, the indexing is set to start to 1 in line 322.
So then when I try to recombine the normalized x_features with the previous array (which might have other columns in addition to the x_features), the indexing is all off. Reindexing doesn't work, because the 0 index of the normalized data is just null.

Was there compelling reasoning behind starting the indexing explicitly at 1? Would it work to not set the index to anything and let pandas index automatically?

Add keyword for input features to read-in from csv using pandas?

Right now user manually enters name of features. Could add keyword, e.g. "Auto", that is passed to (future) modified data_parser class that can use pandas to automatically extract all features based on names of their columns, excluding whatever column is listed as the y_values

Input file model list

Move list of all models available and input params out of input file and into documentation.

Permission error when save path not set as current directory

if save_path is ./"anything" I am seeing a permission error when writing MASTMLlog.log

File "C:\ProgramData\Anaconda2\envs\ML\lib\shutil.py", line 544, in move
os.rename(src, real_dst)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another pro
cess: 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\MASTMLlog.log
' -> 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\test_paper_nor
malization\MASTMLlog.log'

This may be related to the fact that it looks like the code is writing MASTMLlog.log to both the specified save directory and working directory that the code is run from.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.