Comments (13)
Hi Anita, sorry for the late response. Your method sounds right, except we used output embeddings for materials. One way you could achieve this is to get the word embedding vector for "thermoelectric" and find the most similar words to this vector for output embeddings. This link might be useful:
https://stackoverflow.com/questions/42554289/how-can-i-access-output-embeddingoutput-vector-in-gensim-word2vec
Also make sure you are using the normalized output embeddings.
Hope this helps!
from mat2vec.
Hi Vahe,
Thank you for your response!
I finally figured out how to get the same list. The link was very useful!
My first mistake was to use "thermoelectric" instead of its output embedding in the most_similar function. After that, I was also keeping the elements with an occurrence of 3 but it seems that I should have kept the ones whose occurrence was >3 only.
Thank you again for your help!
Anita
from mat2vec.
Hello!
I am doing a similar study for my master's project. I used output embeddings but still there are some noises. I have not applied any occurrence parameter. I am very new to this kindly help in reproducing exact list.
Also, I wanted to learn How can two keywords("key1" + "key2") be used where key1=thermoelectrics and key2=" any specific structure"?
from mat2vec.
Hi Lokesh,
I believe you can just not supply a negative word, like so:
w2v_model.wv.most_similar(
positive=["thermoelectric", "perovskite"],
topn=1)
from mat2vec.
Hi John, Thank you so much for the response. I have done the same but there are still noises. Also, I wanted to know how the list was filtered using the number of occurrences that Vahe has mentioned as less than three.
from mat2vec.
Can you clarify what you mean by noisy data? You may want to also refer to the Gensim documentation on the different methods for finding similar sets of words as there might better functions for your needs.
from mat2vec.
Hi John,
1st issue:
I am not able to reproduce the list exactly that is in the paper there are many other chemical formulas in between the actual formulas mentioned in the paper.
2nd issue: of two keywords
Using this,
from gensim.models import Word2Vec
import csv
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
mylist=w2v_model.wv.most_similar(positive=['thermoelectric','perovskite'],topn=10)
mylist
I am getting this output,
[('thermoelectric_properties', 0.7222813367843628),
('perovskites', 0.7098876237869263),
('Ca100Co131O312', 0.6918851137161255),
('Mo150Ni3Sb270Te80', 0.6901201009750366),
('thermoelectrics', 0.6884835362434387),
('MoO6Sr2Ti', 0.6838506460189819),
('CoMoO12Sr4Ti2', 0.682638943195343),
('La4Mn5O15Tb', 0.6811503171920776),
('Ba4InO12YbZr2', 0.6792004108428955),
('Mg2(Si,Sn)', 0.6788942813873291)]
I want to remove the bold results. I want to get the chemical formulas only like in the paper.
from mat2vec.
1st issue: Are you using the provided pretrained word embeddings or are you training on your own corpus? Can you provide code examples of how you are doing the search so we can help you debug?
2nd issue: To filter out non-material embeddings we compare the embeddings to our list of materials built using Named Entity Recognition and a rule-based parsing tool. However, it would probably not be too hard to build and train a classifier that filters out non-material embeddings using the word embeddings as input. I'm sorry we haven't made the entire pipeline of tools available to the public yet. Olga Kononova will be publishing a paper soon on the rule-based parser and at that point we can make it public. (Sorry, I was confused. this is how we're doing it now but for the study that this repo supports we just used a simple parser based on the Pymatgen Composition object.)
2nd issue: You can use process.is_simple_formula
from mat2vec.
Hi John,
1st issue: pretrained_embeddings downloaded from README.md link
from gensim.models import Word2Vec
import csv
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
mylist=w2v_model.wv.most_similar(positive=['thermoelectric'],topn=10)
mylist
Output:
[('thermoelectrics', 0.8435688018798828),
('thermoelectric_properties', 0.8339031934738159),
('thermoelectric_power_generation', 0.7931368947029114),
('thermoelectric_figure_of_merit', 0.7916494607925415),
('seebeck_coefficient', 0.7753844857215881),
('thermoelectric_generators', 0.7641353011131287),
('figure_of_merit_ZT', 0.7587920427322388),
('thermoelectricity', 0.7515754699707031),
('Bi2Te3', 0.7480159997940063),
('thermoelectric_modules', 0.7434878945350647)]
the same problem of getting non-materials, I am trying to remove these.
Thank You for help!
from mat2vec.
- The vocabulary of a gensim model has word counts, you can write a simple piece of code to filter the results out based on a word count. E.g. see this stackoverflow post
https://stackoverflow.com/questions/37190989/how-to-get-vocabulary-word-count-from-gensim-word2vec - The paper does not use NER to filter out non-materials, it uses a simple function available in process.py
mat2vec/mat2vec/processing/process.py
Line 265 in a4ae89d
Hope this helps
Vahe
from mat2vec.
Thank You so much Vahe Sir!
from mat2vec.
@iitklokesh Sorry, I was confused. This is how we're doing it now but for the study that this repo supports we just used a simple parser based on the pymatgen Composition object. Note that the simple parsing method won't work for words/phrases like "lithium chloride" or "LMNCO".
from mat2vec.
Thank you, John!
I am using normalization for those cases.
from mat2vec.
Related Issues (20)
- No module named 'helpers' error when loading newly trained embeddings HOT 5
- Question about target and context words HOT 2
- Code used to obtain the training data and for abstract classification HOT 2
- Question about the outputs HOT 5
- Request for a step by step document on how to run the code HOT 7
- Training my own word embeddings HOT 3
- Trained my model using phrase2vec.py but now I want to test using that model. How? HOT 6
- I was able to train and use this on the COVID papers dataset HOT 3
- Drug repurposing for COVID HOT 3
- Script to fetch cleaned abstracts HOT 1
- my corpus is too big to be put in one large file HOT 1
- TypeError: __init__() got an unexpected keyword argument 'common_terms' HOT 10
- Training data for original model HOT 4
- About the final word embeddings. HOT 4
- Problems training the model HOT 2
- Model missing from the training folder HOT 1
- Do we need to use argument `include_extra_phrases`
- Some questions about the use of chemdataextractor tool
- Other applications HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mat2vec.