GithubHelp home page GithubHelp logo

materialsintelligence / mat2vec Goto Github PK

View Code? Open in Web Editor NEW
608.0 608.0 182.0 55.86 MB

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).

License: MIT License

Python 100.00%

mat2vec's People

Contributors

ardunn avatar computron avatar jdagdelen avatar johannesebke avatar vtshitoyan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mat2vec's Issues

Training data for original model

I was wondering, is the training data used for the original model is available anywhere as the Matscholar API is currently not available.

Trained my model using phrase2vec.py but now I want to test using that model. How?

I used my own corpus using this command in the Read Me replacing the corpus_example with my own, it created a new model called model_example in the same folder as the pre trained embeddings.
python phrase2vec.py --corpus=data/corpus_example --model_name=model_example

image

I want to then use this new model to test. How do I do that? I thought I would run

from gensim.models import Word2Vec w2v_model = Word2Vec.load("mat2vec/training/models/model_example")

but I got an error saying
No such file or directory: '/Users/monicapuerto/Desktop/Github/mat2vec/mat2vec/processing/models/phraser.pkl'

When that file does indeed exist. Why does the new model depend on this file? How do I utilize the model I just trained?

Request for a step by step document on how to run the code

Would you be kind enough to first share a step by step document on how to run the code using Jupyer Notebook in Anaconda3 environment available on https://github.com/materialsintelligence/mat2vec on the laptop with CPU only (i.e. without GPU). I am running Python 3,7.3 on Juptyter Notebook 6.0.2 in Anaconda3 with Tensorflo version 1.15.0, Keras version 2.2.4.
(i) I have installed all packages mentioned in the requirements.txt including"ChemDataExtractor". But am running into issues with installing "molsets". Any guidance for that?
(iii) l'd like to know which .py file(s) exactly to run, vs. a sequence of .py files to run, and any other tips. For example there are these .py files, setup.py, process.py, test-process.py, phrase2vec.py, etc. Assuming I want to simply run the model and get the output using Juptyer Notebook in Anaconda3 environment, what exactly do I have to run and in what order?
Once I am able to run it on my laptop, I will attempt to run it in Colab. Basically would appreciate knowing, assuming I use Jupyer Notebook in Anaconda3 environment, what exactly should be the steps? Thanks.

TypeError: __init__() got an unexpected keyword argument 'common_terms'

Hi,
I tried to train a model and I got this error:

Traceback (most recent call last):
File "C:\Users\T\desktop\backup\mat2vec-master\mat2vec\training\phrase2vec.py", line 164, in
sentences, phraser = wordgrams(processed_sentences,
File "C:\Users\T\desktop\backup\mat2vec-master\mat2vec\training\phrase2vec.py", line 43, in wordgrams
phrases = Phrases(
TypeError: init() got an unexpected keyword argument 'common_terms'

Model missing from the training folder

I was looking for the word2vec model specifically. According to the code in the first page, the model is supposed to be at "mat2vec/training/models/pretrained_embeddings", but I'm unable to find it.

Training my own word embeddings

Hi,
I am trying to train my own word embeddings on my own corpus using your model. Could you please help me step by step how can I do that?
I run the model according to your instructions in the readme file and it works.

Prediction of (new) thermoelectric materials

First of all thank you so much for sharing all this! I found the paper and the associated results very exciting!

I tried to reproduce Fig. 2a using your pre-trained model. I first printed the tokens the most similar to "thermoelectric" (highest cosine similarity). Then I used one of your processing functions (in the process script) to get only "simple chemical formulae". And finally, as you were mentioning it in the paper, I removed the formulae appearing less than 3 times in the corpus.

However, I ended up with a lot of noise in my list compared to yours. I got the 2 first same predictions but then I was also having formulae like Bi20Se3Te27 or SSe4Sn5 in the top 10. Just to give you an idea of the noise amount, PbTe, which is 3rd in your list, is 92th in mine.

So what am I missing?

Thank you in advance!
Anita

Drug repurposing for COVID

Hello @jdagdelen . You mentioned in an earlier thread that Mat2Vec could be used for drug repurposing. Experiments are currently underway to perform simulations on 8,000 drugs approved by the FDA, and 77 of those compounds have been selected as being likely candidates for COVID treatment. https://onezero.medium.com/the-worlds-most-powerful-supercomputer-has-entered-the-fight-against-coronavirus-3e98c4d67459

If we use Mat2vec for the same drug repurposing task, the intersection of our work and theirs might yield a shortened list of candidate compounds. What do you think about getting started on this? I can use the program but I don’t have a deep enough understanding of it to figure out how to discover new knowledge from it. Would you like to collaborate?

Other applications

How can this software be applied to other research areas? Space travel/propulsion, physics, history, etc? Thank you!

Setup requirements error

Had an issue (from within a fresh conda environment) running:

pip install -r requirements.txt

After successfully compiling packages installation ultimately failed with:

...
Installing collected packages: monty, regex, urllib3, requests, unidecode, ruamel.yaml, pydispatcher, tabulate, spglib, palettable, pymatgen, pycryptodome, pdfminer.six, python-crfsuite, cssselect, appdirs, DAWG, chemdataextractor, jmespath, botocore, s3transfer, boto3, smart-open, gensim, tqdm
  Found existing installation: urllib3 1.12
ERROR: Cannot remove entries from nonexistent file /xxx/.conda/envs/tshitoyan/lib/python3.6/site-packages/easy-install.pth

The solution was to re-run pip with the --ignore-installed option:

pip install --ignore-installed -r requirements.txt

Beyond that, everything installed and ran as expected.
Hope this saves some headaches for others.

Great work!

Doug

Script to fetch cleaned abstracts

I noticed you've nicely provided the DOIs, but a simple pull will fetch the article in raw html. Might you have a recommendation on grabbing the cleaned abstracts as you did it? There's the several dataset splits that you mentioned being quite influential on the final result.

No module named 'helpers' error when loading newly trained embeddings

Dear all,
I've trained my own embeddings and I'm now trying to open using your mat2vec tool using

 w2v_model = Word2Vec.load(....)

however I get a strange error about a module helpers. This same problem does not happens with the pre-trained models you are providing:

>>> w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
>>> w2v_model = Word2Vec.load("mat2vec/training/models/test_model")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/word2vec.py", line 975, in load
    return super(Word2Vec, cls).load(*args, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 629, in load
    model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 278, in load
    return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/utils.py", line 425, in load
    obj = unpickle(fname)
  File "/anaconda3/envs/mat2vec/lib/python3.6/site-packages/gensim/utils.py", line 1332, in unpickle
    return _pickle.load(f, encoding='latin1')
  File "/Users/lfoppiano/development/github/mat2vec/mat2vec/training/__init__.py", line 1, in <module>
    from helpers import utils
ModuleNotFoundError: No module named 'helpers'
>>> 

Any suggestion?

Another question, after the training, I have only accuracies, loss and phraser in my model output directory. I copied the files containing vectors, trainable and the actual model from the tmp directory (I took the one of epoque29), was this the correct way?

.rw-r--r-- lfoppiano staff 286.4 MB Mon Aug  5 11:33:58 2019   test_model
.rw-r--r-- lfoppiano staff   3.2 GB Mon Aug  5 11:33:24 2019   test_model.trainables.syn1neg.npy
.rw-r--r-- lfoppiano staff   3.2 GB Mon Aug  5 11:39:55 2019   test_model.wv.vectors.npy
.rw-r--r-- lfoppiano staff   7.5 KB Mon Aug  5 11:26:21 2019   test_model_accuracies.pkl
.rw-r--r-- lfoppiano staff   389 B  Mon Aug  5 11:27:07 2019   test_model_loss.pkl
.rw-r--r-- lfoppiano staff    53 MB Mon Aug  5 11:40:00 2019   test_model_phraser.pkl

Some questions about the use of chemdataextractor tool

Hello, I feel very interesting after reading your article, I have tried some similar work myself, and have some questions, I would like to ask you for advice. When using chemdataextractor tool, the process of extracting chemical formula from about 20W abstracts of literature is very slow. May I ask how do you carry out the process of chemical formula labeling in abstracts? Or is there a way to speed up the process? Thank you very much!

Formatting Abstracts

Is there any special text formatting that needs to be done to abstracts before training? I noticed the corpus example has % and <nUm> in various places. Just wondering if formatting matters at all, or if you can dump the plain text from abstracts into a corpus file.

Question about target and context words

I have a question about your research approach communicated in Nature. You use there phrases "target word" and "context word". Normally, in the skip-gram model embedding for the "target word" (input layer) is different that the embedding for the "context word" (output layer). In gensim if you use model.wv.most_similar you are effectively searching for similar words using embeddings from the input layer. You can also access "context word" embeddings via model.syn1neg. Where you using both embeddings for analyzing e.g. relation between chemical compound and "thermoelectric"?

Problems training the model

Dear community,

I'm having a problem running the line code to train the model on the corpus example. The following error messages are printed.

:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
2022-06-10 00:06:51,569 : INFO : Basic min_count trim rule for formula.
2022-06-10 00:06:51,569 : INFO : Not including extra phrases, option not specified.
Traceback (most recent call last):
File "/home/lucas_bandeira/Documents/mat2vec/mat2vec/training/phrase2vec.py", line 165, in
sentences, phraser = wordgrams(processed_sentences,
File "/home/lucas_bandeira/Documents/mat2vec/mat2vec/training/phrase2vec.py", line 44, in wordgrams
phrases = Phrases(
TypeError: __init__() got an unexpected keyword argument 'common_terms'

Could somebody help me to solve this problem?

Sincerely yours,

Question about the outputs

I have successfully installed the mat2vec in the conda environment (python=3.6) and tried to reproduce your work. However, when I followed the directions under Processing and Pretrained Embeddings, no output appeared (and no error message).

For example, I run test.txt in the root folder of this repository, after serveral seconds running, no outputs (like pic1.jpg).

test.txt
pic1

I am an freshman in machine learning, hoping to get your guidance. Thank you!

About the final word embeddings.

I just trained a model on my own corpus. It has space group numbers and I replaced them with 'Xx1, Xx2,...., Xx229, Xx230' to avoid overlapping with some element names. But when I tried to get final embeddings from the model it says, some space group numbers (Xx105, Xx139 etc.) are not in vocabulary independent of the frequency! Why is this happening? I've tried to look up the code and couldn't figure it out.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.