First of all thank you so much for sharing all this! I found the paper and the associa

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Prediction of (new) thermoelectric materials about mat2vec HOT 13 CLOSED

materialsintelligence commented on September 24, 2024

Prediction of (new) thermoelectric materials

from mat2vec.

Comments (13)

vtshitoyan commented on September 24, 2024

Hi Anita, sorry for the late response. Your method sounds right, except we used output embeddings for materials. One way you could achieve this is to get the word embedding vector for "thermoelectric" and find the most similar words to this vector for output embeddings. This link might be useful:
https://stackoverflow.com/questions/42554289/how-can-i-access-output-embeddingoutput-vector-in-gensim-word2vec
Also make sure you are using the normalized output embeddings.
Hope this helps!

from mat2vec.

anita-clmnt commented on September 24, 2024

Hi Vahe,
Thank you for your response!
I finally figured out how to get the same list. The link was very useful!
My first mistake was to use "thermoelectric" instead of its output embedding in the most_similar function. After that, I was also keeping the elements with an occurrence of 3 but it seems that I should have kept the ones whose occurrence was >3 only.
Thank you again for your help!
Anita

from mat2vec.

iitklokesh commented on September 24, 2024

Hello!
I am doing a similar study for my master's project. I used output embeddings but still there are some noises. I have not applied any occurrence parameter. I am very new to this kindly help in reproducing exact list.
Also, I wanted to learn How can two keywords("key1" + "key2") be used where key1=thermoelectrics and key2=" any specific structure"?

from mat2vec.

jdagdelen commented on September 24, 2024

Hi Lokesh,

I believe you can just not supply a negative word, like so:

w2v_model.wv.most_similar(
    positive=["thermoelectric", "perovskite"], 
    topn=1)

from mat2vec.

iitklokesh commented on September 24, 2024

Hi John, Thank you so much for the response. I have done the same but there are still noises. Also, I wanted to know how the list was filtered using the number of occurrences that Vahe has mentioned as less than three.

from mat2vec.

jdagdelen commented on September 24, 2024

Can you clarify what you mean by noisy data? You may want to also refer to the Gensim documentation on the different methods for finding similar sets of words as there might better functions for your needs.

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors

from mat2vec.

iitklokesh commented on September 24, 2024

Hi John,
1st issue:
I am not able to reproduce the list exactly that is in the paper there are many other chemical formulas in between the actual formulas mentioned in the paper.

2nd issue: of two keywords
Using this,
from gensim.models import Word2Vec
import csv
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
mylist=w2v_model.wv.most_similar(positive=['thermoelectric','perovskite'],topn=10)
mylist
I am getting this output,
[('thermoelectric_properties', 0.7222813367843628),
('perovskites', 0.7098876237869263),
('Ca100Co131O312', 0.6918851137161255),
('Mo150Ni3Sb270Te80', 0.6901201009750366),
('thermoelectrics', 0.6884835362434387),
('MoO6Sr2Ti', 0.6838506460189819),
('CoMoO12Sr4Ti2', 0.682638943195343),
('La4Mn5O15Tb', 0.6811503171920776),
('Ba4InO12YbZr2', 0.6792004108428955),
('Mg2(Si,Sn)', 0.6788942813873291)]
I want to remove the bold results. I want to get the chemical formulas only like in the paper.

from mat2vec.

jdagdelen commented on September 24, 2024

1st issue: Are you using the provided pretrained word embeddings or are you training on your own corpus? Can you provide code examples of how you are doing the search so we can help you debug?

2nd issue: To filter out non-material embeddings we compare the embeddings to our list of materials built using Named Entity Recognition and a rule-based parsing tool. However, it would probably not be too hard to build and train a classifier that filters out non-material embeddings using the word embeddings as input. I'm sorry we haven't made the entire pipeline of tools available to the public yet. Olga Kononova will be publishing a paper soon on the rule-based parser and at that point we can make it public. (Sorry, I was confused. this is how we're doing it now but for the study that this repo supports we just used a simple parser based on the Pymatgen Composition object.)

2nd issue: You can use process.is_simple_formula

from mat2vec.

iitklokesh commented on September 24, 2024

Hi John,

1st issue: pretrained_embeddings downloaded from README.md link
from gensim.models import Word2Vec
import csv
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
mylist=w2v_model.wv.most_similar(positive=['thermoelectric'],topn=10)
mylist
Output:
[('thermoelectrics', 0.8435688018798828),
('thermoelectric_properties', 0.8339031934738159),
('thermoelectric_power_generation', 0.7931368947029114),
('thermoelectric_figure_of_merit', 0.7916494607925415),
('seebeck_coefficient', 0.7753844857215881),
('thermoelectric_generators', 0.7641353011131287),
('figure_of_merit_ZT', 0.7587920427322388),
('thermoelectricity', 0.7515754699707031),
('Bi2Te3', 0.7480159997940063),
('thermoelectric_modules', 0.7434878945350647)]
the same problem of getting non-materials, I am trying to remove these.
Thank You for help!

from mat2vec.

vtshitoyan commented on September 24, 2024

@iitklokesh

The vocabulary of a gensim model has word counts, you can write a simple piece of code to filter the results out based on a word count. E.g. see this stackoverflow post
https://stackoverflow.com/questions/37190989/how-to-get-vocabulary-word-count-from-gensim-word2vec
The paper does not use NER to filter out non-materials, it uses a simple function available in process.py

mat2vec/mat2vec/processing/process.py

Line 265 in a4ae89d

def is_simple_formula(self, text):

Hope this helps
Vahe

from mat2vec.

iitklokesh commented on September 24, 2024

Thank You so much Vahe Sir!

from mat2vec.

jdagdelen commented on September 24, 2024

@iitklokesh Sorry, I was confused. This is how we're doing it now but for the study that this repo supports we just used a simple parser based on the pymatgen Composition object. Note that the simple parsing method won't work for words/phrases like "lithium chloride" or "LMNCO".

from mat2vec.

iitklokesh commented on September 24, 2024

Thank you, John!
I am using normalization for those cases.

from mat2vec.

Prediction of (new) thermoelectric materials about mat2vec HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs