sp8rks / materialsinformatics Goto Github PK

MSE5540/6640 Materials Informatics course at the University of Utah

License: MIT License

Jupyter Notebook 99.93% Python 0.07%

materialsinformatics's Introduction

MaterialsInformatics

MSE5540/6640 Materials Informatics course at the University of Utah

This github repo contains coursework content such as class slides, code notebooks, homework assignments, literature, and more for MSE 5540/6640 "Materials Informatics" taught at the University of Utah in the Materials Science & Engineering department.

Below you'll find the approximate calendar for Spring 2024 and videos of the lectures are being placed on the following YouTube playlist https://youtube.com/playlist?list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0

month	day	Subject to cover	Assignment	Link
Jan	9	Syllabus. What is machine learning? How are materials discovered?	Install software packages together in class
Jan	11	Machine Learning vs Materials Informatics, In class example of fitting Hall-Petch data with linear model	Read 5 High Impact Research Areas in ML for MSE (paper1), Read ISLP Chapter 3, but especially Section 3.1	paper1, ISLP
Jan	16	Materials data repositories, get pymatgen running for everybody, examples of MP API, MDF, NOMAD, others	Create a new env and make sure you can get the notebooks in the "worked examples/MP_API_example" and "worked examples/foundry" folders running.	Materials Project API
Jan	18	Machine Learning Tasks and Types, Featurization in ML, Composition-based feature vector	Read Is domain knowledge necessary for MI (paper1). Make sure you can get the CBFV_example notebook running in the ""worked examples/CBFV_example" folder	paper1
Jan	23	Classification and cross-validation	Read ISLP Sections 4.1-4.5 and Section 5.1. Run through classification notebook	ISLP
Jan	25	Structure-based feature vector, crystal graph networks, SMILES vs SELFIES, 2pt statistics	read selfies (paper1), two-point statistics (paper2) and intro to graph networks (blog1)	paper1, paper2, blog1
Jan	30	Simple linear/nonlinear models. test/train/validation/metrics	Read linear vs non-linear (blog1), read best practices (paper1), benchmark dataset (paper2), and loco-cv (paper3).	blog1, paper1, paper2, paper3
Feb	1	in-class examples of featurization	Run through 2pt statistics, GridRDF, CBFV notebooks	HW1 due!
Feb	6	ensemble models, ensemble learning	Read ensemble (blog1), and ensemble learning (paper1)	blog1, paper1
Feb	8	Extrapolation, support vector machines, clustering	Read extrapolation to extraordinary materials (paper1), clustering (blog1) , SVMs (blog2)	paper1, blog1, blog2
Feb	13	Artificial neural networks	Read the introduction to neural networks (blog1, blog2)	blog1, blog2
Feb	15	Advanced deep learning (CNNs, RNNs)	HW2 due. Read…	blog1, blog2
Feb	20	Transformers	Read the introduction to transformers (blog1, blog2)	blog1, blog2
Feb	22	Generative ML: Generative Adversarial Networks and variational autoencoders	Read about VAEs (blog1, blog2, repo1) and GANS ()	blog1, blog2, repo1
Feb	27	Diffusion models and Image segmentation	Read U-net (paper1) and nuclear forensics (paper2)	CrysTens repo
Feb	29	Image segmentation part 2 and in-class coding examples	Download CrysTens github repo, read Segment Anything Model (paper 3)	paper1, paper2, paper3
Mar	5	NO CLASS, spring break
Mar	7	No CLASS, spring break
Mar	12	Bayesian Inference	Read the introduction to Bayesian (blog1), go through Naive Bayes notebook	blog1
Mar	14	Gaussian Processes and Bayesian Optimization
Mar	19	Case study: Superhard materials, structure prediction	Read superhard (paper1), and structure prediction papers (paper2)	paper1, paper2
Mar	21	Case study: CGCNN vs MEGNET vs SchNET	Read CGCNN (paper1), MegNET (paper2), SchNET (paper3)	paper1, paper2, paper3
Mar	26	Case study: CrabNET vs Roost	Read CrabNet (paper1) and Roost (paper2)	paper1, paper2
Mar	28	Case study: Cococrab, BRDA	HW4 due. Read Cococrab (paper1) and BRDA (paper2)	paper1, paper2
Apr	2	Large Language Models part 1	TBD	TBD
Apr	4	Large Language Models part 2	TBD	TBD
Apr	9	Case study: Element Mover’s Distance, Mat2Vec	Read Element mover’s distance (paper1) and Mat2Vec (paper2)	paper1, paper2
Apr	11	Case study: Discover algorithm, Robocrystallographer	TBD	TBD
Apr	16	Final project presentation day 1	Final Project due
Apr	18	Final project presentation day 2	Final Project due

I can recommend the book Introduction to Machine Learning found here https://www.statlearning.com/

materialsinformatics's People

Contributors

Stargazers

Watchers

materialsinformatics's Issues

Add HW2 and HW3 PDFs

Nice distill resource for Bayesian optimization

Not one that was covered before IIRC:

https://distill.pub/2020/bayesian-optimization/

HW2, task 2: supposed to say "using the arbitrary cut-off value of 10−2 Ω cm for electrical resistivity?"

Says conductivity

(b) task 2. Using a support vector machine classifier and the composition-based feature vector (magpie descriptor set), construct a model that will categorize materials as metals or insulators using the arbitrary cut-off value of 10−2 Ω cm for electrical conductivity. The model should take chemical formula as an input and take into account temperature as a feature (300, 400, 700, and 1000K).

hw1, pb 2, task 5: group by chemical formula hint/example

See groupby_formula and https://stackoverflow.com/a/49216427/13697228

HW2 suggestion - comparison with non-group-CV results

For next time, might be good to have students have an apples-to-apples comparison to see how much worse the results are when group-CV is used.

Suggestion for hw 1, pb 1 next time course is taught: use of Zotero shared group folder and shared annotations for literature extraction

To help with making the literature data extraction FAIR, going with a Zotero group library where PDF files, annotations, etc. are shared between group members can help out by:

keeping all the references in one place
highlights/annotations are shared and are easily made searchable, plus "click link to go directly to annotation" (makes curation easier)
easy to make the list of references public, and the copyrighted files along with the annotations can be easily shared upon request

(planning to update later with some examples/images)

HW 1 general suggestions - `webplotdigitizer` and `MPRester` tips

Problem 1

when you have variables in the chemical formula, pay extra attention to which formulas correspond to which values of x. For example, in some plots, x=0 --> x=1.0 might go from the top to the bottom, whereas in others x=0 starts from the bottom.
pay attention to units, e.g. Kelvin vs. Celsius, 10^4 S/m vs. S/m vs. S/cm, make sure units are converted correctly based on what's listed on the spreadsheet e.g. electrical conductivity: S*cm^-1.
I found it useful to add all images (where the image includes figure caption) to a single session in webplotdigitizer, and use "Point Groups" corresponding to each of the chemical formulas if grabbing multiple traces from an image. Additionally, if multiple types of data were in the same figure, I made copies of each figure and named them e.g. fig5-electrical-conductivity.png, fig5-thermal-conductivity.png ... even though they're the exact same figure. This makes it easier to retain the caption and separate calibrations for each dataset. Rename your dataset appropriately, e.g. electrical-conductivity to make it easier to keep track and so that when you export the CSV it auto-populates the name.
a trick to using "Point Groups" (not the crystallographic kind) is to add a group for each composition (e.g. Cu0.98GaTe2, Cu0.985GaTe2, Cu0.99GaTe2, CuGaTe2) and then select the points in order for a given temperature. For example, click points in the following order:

1. Cu0.98GaTe2@300K
2. Cu0.985GaTe2@300K
3. Cu0.99GaTe2@300K
4. CuGaTe2@300K
5. Cu0.98GaTe2@400K
6. Cu0.985GaTe2@400K
...
...
16. CuGaTe2@800K

Then click on your dataset, View Data, and Sort By --> Groups (dropdown). You can also export to CSV from this interface.

I suggest saving your images, raw CSV data, and your webplotdigitizer project (JSON and TAR format) data organized into folders based on the article, or at least save a copy of your data somewhere other than Google Sheets (e.g. your local computer) for data redundancy.

Problem 2

One of the best resources for getting an intro to MPRester is via the Materials Project workshop tutorial.

On YouTube, there is Taylor's prerecorded lecture and (what I'm pretty sure is) the corresponding video for the workshop tutorial mentioned above.

In addition to the customized examples given by Taylor in this repository, here are some additional examples "in practice" at RoboCrab (archived repo) and mat_discover.

Task 5

See https://matsci.org/t/how-to-distinguish-experimental-or-theoretical-structure-entries/2036/4
To keep track of the MPIDs that map back to a single composition (i.e. repeat chemical formulas), I suggest using df.set_index() to replace the normal indices (0, 1, 2, ...) with your mpids. Then follow the advice at https://stackoverflow.com/a/49216427/13697228. See also groupby_formula for an example that is close to what is asked.
Bonus (but not actually extra credit): assuming you added a "count" column as in the example above, you can see the number of repeats for a given chemical formula via:

grp_df.hist("count", bins=100, log=True)

In this case, the large majority of compounds have fewer than 20 polytopes, but there is one chemical formula with 200 repeats!?

sp8rks / materialsinformatics Goto Github PK

materialsinformatics's Introduction

MaterialsInformatics

materialsinformatics's People

Contributors

Stargazers

Watchers

Forkers

materialsinformatics's Issues

Problem 1

Problem 2

Recommend Projects

Recommend Topics

Recommend Org

Jobs