uw-cmg / mast-ml Goto Github PK

MAterials Simulation Toolkit for Machine Learning (MAST-ML)

License: MIT License

Python 29.99% Jupyter Notebook 70.01%

mast-ml's Introduction

Materials Simulation Toolkit for Machine Learning (MAST-ML)

MAST-ML is an open-source Python package designed to broaden and accelerate the use of machine learning in materials science research

Run example notebooks in Google Colab:

Tutorial 1: Getting Started with MAST-ML:
Tutorial 2: Data Import and Cleaning:
Tutorial 3: Feature Engineering:
Tutorial 4: Models and Data Splitting Tests:
Tutorial 5: Left out data, nested cross validation, and optimized models:
Tutorial 6: Model error analysis and uncertainty quantification:
Tutorial 7: Model predictions with guide rails:

Contributors

University of Wisconsin-Madison Computational Materials Group:

Prof. Dane Morgan
Dr. Ryan Jacobs
Lane Schultz
Dr. Tam Mayeshiba
Dr. Ben Afflerbach
Dr. Henry Wu

University of Kentucky contributors:

Luke Harold Miles
Robert Max Williams
Matthew Turner
Prof. Raphael Finkel

MAST-ML documentation

An overview of code documentation and tutorials for getting started with MAST-ML can be found here:

https://mastmldocs.readthedocs.io/en/latest/

Funding

This work was funded by the National Science Foundation (NSF) SI2 award number 1148011

This work was funded by the National Science Foundation (NSF) DMREF award number DMR-1332851

This work was funded by the National Science Foundation (NSF) CSSI award number 1931298

Citing MAST-ML

If you find MAST-ML useful, please cite the following publication:

Jacobs, R., Mayeshiba, T., Afflerbach, B., Miles, L., Williams, M., Turner, M., Finkel, R., Morgan, D., "The Materials Simulation Toolkit for Machine Learning (MAST-ML): An automated open source toolkit to accelerate data- driven materials research", Computational Materials Science 175 (2020), 109544. https://doi.org/10.1016/j.commatsci.2020.109544

If you find the uncertainty quantification (error bar) approaches useful, please cite the following publication:

Palmer, G., Du, S., Politowicz, A., Emory, J. P., Yang, X., Gautam, A., Gupta, G., Li, Z., Jacobs, R., Morgan, D., "Calibration after bootstrap for accurate uncertainty quantification in regression models", npj Computational Materials 8 115 (2022). https://doi.org/10.1038/s41524-022-00794-8

Installation

MAST-ML can be installed via pip:

pip install mastml

Clone from Github:

git clone https://github.com/uw-cmg/MAST-ML

Changelog

MAST-ML version 3.2.x Major Updates from April 2024

Integration of domain of applicability approach using kernel density estimates based on MADML package: https://github.com/leschultz/materials_application_domain_machine_learning
Refinement of tutorials, addition of new tutorial for domains of applicability
Updates to plotting routines for error bar analysis
Many small bug fixes and updates to conform to updated versions of package dependencies

MAST-ML version 3.1.x Major Updates from July 2022

Refinement of tutorials, addition of Tutorial 7, Colab links as badges added for easier use.
mastml_predictor module added to help streamline making predictions (with option to include error bars) on new test data.
Basic parallelization added, which is especially useful for speeding up nested CV runs with many inner splits.
EnsembleModel now handles ensembles of GPR and XGBoost models.
Numerous improvements to plotting, including new plots (QQ plot), better axis handling and error bars (RvE plot), plotting and stats separated per group if groups are specified.
Improvements to feature selection methods. EnsembleModelFeatureSelector includes dummy feature references, added SHAP-based selector
Added assessment of baseline tests like comparing metrics to predicting the data average or permuted data test
Many miscellaneous bug fixes.

MAST-ML version 3.0.x Major Updates from July 2021

MAST-ML no longer uses an input file. The core functionality and workflow of MAST-ML has been rewritten to be more conducive to use in a Jupyter notebook environment. This major change has made the code more modular and transparent, and we believe more intuitive and easier to use in a research setting. The last version of MAST-ML to have input file support was version 2.0.20 on PyPi.
Each component of MAST-ML can be run in a Jupyter notebook environment, either locally or through a cloud-based service like Google Colab. As a result, we have completely reworked our use-case tutorials and examples. All of these MAST-ML tutorials are in the form of Jupyter notebooks and can be found in the mastml/examples folder on Github.
An active part of improving MAST-ML is to provide an automated, quantitative analysis of model domain assessement and model prediction uncertainty quantification (UQ). Version 3.x of MAST-ML includes more detailed implementation of model UQ using new and established techniques.

mast-ml's People

Contributors

Stargazers

Watchers

mast-ml's Issues

Consider configobj module for input file

configobj may be worth using over built-in ConfigParser. More info here: https://pypi.python.org/pypi/configobj/5.0.6

remove execute functions

Just use classes (init, and run)

favorites or summary directory

Where favorite plots are pulled out
Could be difficult to pre-determine;
Maybe would want some kind of input file section, or we determine which plots are "favorite" plots
also key statistics and readme file

Use logger (.log) file to provide summary of each routine

write in switch to automatically normalize input features

consider class structure for tests

Each test typically requires:
data read-in
*which datasets to use
*which features to use for fitting
*which feature to predict on
data division
*into test/train groups
*also sometimes by additional grouping like by category
*also sometimes filtering out of certain test and/or train data according to separate criteria
fits
*how many fits and predictions may depend on data division
*which model
*which hyperparameters
*sometimes may want additional features to be calculated on the fly, which weren't in the original data
predictions
*how many predictions may depend on data division
statistics collected on predictions
*sometimes validation (RMSE, R2)
*sometimes the testing set has no measured data to validate with
printing
*printing of essential statistics and model outputs
plotting
*can be complex
*often requires specific annotations
*plots of measured versus predicted
*plots of predictions versus some numeric data (not necessarily a fitting data column) split out by groups)
*plots with certain data filtered out, even though that data WAS included in training and/or in testing and testing statistics

Tests could inherit from some basic class
Class structure would prevent having to pass a lot of variables.
Class structure may encourage modular programming and uniform extension

Add hyperopt

Probably serial for now

Look at compatibility of matminer (python 2.7) vs MASTML (python 3+)

Compatibility, whether to copy/modify their framework

Enable storage and updates of previous models

From meeting with Dane 5-25-2017

Idea is being able to have an existing model be updated with new data and persist through user sessions

refine configobj validation to handle specific datatypes and keywords

(in progress)

look at how we are passing information through the workflow

For example, hyperparameter optimizations are not passed through to other tests.
Maybe have a controlling class be able to run hyperparameter test and then subsequent tests, and the controlling class would have optimized parameter info? (which could then be updated with subsequent hyperparameter optimizations)

PredictionVsFeature crashes if plot_filter_out isn't set

Current workaround is just to assign a filter that doesn't actually filter any data

ipython and docker?

Encapsulation ability - look into docker for environment and ipython notebook for display

add remaining DBTT_skunkworks master code

Add Josh C's code and models from DBTT_skunkworks master, and
Josh P's composition-dependent p value function to GA.
Preserve file history!

fix extrapolation plotting

Give user ability to modify colors, choose between line/no line for lots of points, modify legend labels, and plot arbitrary number of test lines (no more topredict vs standard)

move dbtt specific to other classes and folders

make dbtt specific class on GA class?
move dbtt specific models like eony
move dbtt specific columns and conversions

doc note multiprocessing and ParamOptGA

Make sure virtual memory (and memory) are large enough for ParamOptGA

Note that morgan3 apparently works without setting pvmem tag (otherwise actually limits pvmem to 1000mb which is not enough) while morgan2 with pvmem tag gives unlimited pvmem

qstat -f | grep used to see resource usage

readme for each test's csv and png output

Code should create readme for each test folder

Check Matminer Compatibility with Python 3

Ben - check compatibility and make adjustments

Add jupyter notebook for each plot so can customize plots

Add check that all test_cases exist in [Test Parameters] before running tests

I've had a few runs where I add a new test or change an existing test name but forget to change the name up in the [models and tests to run], test_cases. Currently it doesn't look like there's a check that verifies if each test_case has corresponding [Test Parameters] which causes a crash when it finally tries to run the test and can't lookup the parameters.

Having a check would make this error show up sooner and prevent wasting time running tests when the workflow won't end up completing

dbtt - composition dependent p

don't lose this feature in the new code

check matplotlib on cluster

May need matplotlib.use('Agg') or similar if no display.

meta-readme at top level as html

Link into each test directory?

timex broken for PlotNoAnalysis

use dataframes to improve data handling after read-in?

pandas dataframes?
Have data_parser use pandas dataframes?

Grid search heatmap for GKRR currently not useful as lower values all get washed out

generalize eony e900 columns

allow precision of printing statistics to change

Currently RMSE's etc. are printed with various precisions in various places, all hardcoded.

add ability to fit and predict on composition-dependent newly created features

add ability for new features on the fly that aren't part of the dataset, for example, creating the optimized composition-dependent p for each composition.

Add LO% to GA

GA

mean and std dev if do multiples
Multiple GA just run in serial one after another the way Josh had had it; that way MASTML can control

LeaveoutGroupCV not writing Group Prediction column in csv files

in the csv files written for each group "group"_test_data.csv every file except the first file is missing the "Group Prediction" column

Add readme files for contents of test folders

Write in to the rest of tests. template and fullfit done.
dad734e
dc6b0c5
038d583

X features, y features

if not set for a particular test parameters, then look in Data Setup X features and y feature, and if not there, error out

dataset dictionary in MASTML

data as a dictionary, with keys as csv names or some other way
some additional input parameter keys for each test_case in order to allow different sets to be used for different tests cases;
but do the X, y feature setup ahead of time, and the data parsing, in MASTML the way Ryan has it, for each dataset.

remove references to lwr_data_path

clean up dependencies

some dependencies left over from MAST workflow; some not used (deap?)
add pandas

save path currently changes between savepath and save_path in different files

This is causing issues correctly forwarding the save path around. Not sure if this is just in flux with what someone is changing, but revisit to make sure it gets cleaned up once major changes settle down

Document output for LeaveOutGroup not writing predicted values for each of the different groups

array to dataframe reindexing causes other issues

@rjacobs914
Question for Ryan:
Line 322 in DataParser.py

def _array_to_dataframe(cls, array):
    dataframe = pd.DataFrame(data=array, index=range(1, len(array)+1))
return dataframe

So if I had read in an array from a CSV, the indexing defaults to starting at 0.
Then if I normalize the dataframe's x_features with FeatureNormalization, the indexing is set to start to 1 in line 322.
So then when I try to recombine the normalized x_features with the previous array (which might have other columns in addition to the x_features), the indexing is all off. Reindexing doesn't work, because the 0 index of the normalized data is just null.

Was there compelling reasoning behind starting the indexing explicitly at 1? Would it work to not set the index to anything and let pandas index automatically?

allow change of logger level

Output comments or not; verbosity level - allow logger to print out more or less data

creating group directories with numbers as the name is throwing errors on windows

specific error is in os.join(X,Y) where Y is read from the csv. error says it expects string and is getting int64 in my case.
error was coming from fullfit.py in singlefit
may be as simple as assigning the int to string before feeding it into os.join?

add R2

Add keyword for input features to read-in from csv using pandas?

Right now user manually enters name of features. Could add keyword, e.g. "Auto", that is passed to (future) modified data_parser class that can use pandas to automatically extract all features based on names of their columns, excluding whatever column is listed as the y_values

dbtt - composition dependent features for optimization

add to GA for optimization: composition-dependent p, composition-dependent ref flux

html summary file, add smaller size pictures

Add small size pictures to html summary file so can see at a glance without having to click on each link. Maybe 1/4 or 1/8 size images, with link to full size.

Input file model list

Move list of all models available and input params out of input file and into documentation.

code documentation

pull docstrings
see structopt?

Permission error when save path not set as current directory

if save_path is ./"anything" I am seeing a permission error when writing MASTMLlog.log

File "C:\ProgramData\Anaconda2\envs\ML\lib\shutil.py", line 544, in move
os.rename(src, real_dst)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another pro
cess: 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\MASTMLlog.log
' -> 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\test_paper_nor
malization\MASTMLlog.log'

This may be related to the fact that it looks like the code is writing MASTMLlog.log to both the specified save directory and working directory that the code is run from.

uw-cmg / mast-ml Goto Github PK

mast-ml's Introduction

Materials Simulation Toolkit for Machine Learning (MAST-ML)

Run example notebooks in Google Colab:

Contributors

MAST-ML documentation

Funding

Citing MAST-ML

Installation

Changelog

MAST-ML version 3.2.x Major Updates from April 2024

MAST-ML version 3.1.x Major Updates from July 2022

MAST-ML version 3.0.x Major Updates from July 2021

mast-ml's People

Contributors

Stargazers

Watchers

Forkers

mast-ml's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs