GithubHelp home page GithubHelp logo

Comments (13)

vtshitoyan avatar vtshitoyan commented on September 24, 2024

Hi Anita, sorry for the late response. Your method sounds right, except we used output embeddings for materials. One way you could achieve this is to get the word embedding vector for "thermoelectric" and find the most similar words to this vector for output embeddings. This link might be useful:
https://stackoverflow.com/questions/42554289/how-can-i-access-output-embeddingoutput-vector-in-gensim-word2vec
Also make sure you are using the normalized output embeddings.
Hope this helps!

from mat2vec.

anita-clmnt avatar anita-clmnt commented on September 24, 2024

Hi Vahe,
Thank you for your response!
I finally figured out how to get the same list. The link was very useful!
My first mistake was to use "thermoelectric" instead of its output embedding in the most_similar function. After that, I was also keeping the elements with an occurrence of 3 but it seems that I should have kept the ones whose occurrence was >3 only.
Thank you again for your help!
Anita

from mat2vec.

iitklokesh avatar iitklokesh commented on September 24, 2024

Hello!
I am doing a similar study for my master's project. I used output embeddings but still there are some noises. I have not applied any occurrence parameter. I am very new to this kindly help in reproducing exact list.
Also, I wanted to learn How can two keywords("key1" + "key2") be used where key1=thermoelectrics and key2=" any specific structure"?

from mat2vec.

jdagdelen avatar jdagdelen commented on September 24, 2024

Hi Lokesh,

I believe you can just not supply a negative word, like so:

w2v_model.wv.most_similar(
    positive=["thermoelectric", "perovskite"], 
    topn=1)

from mat2vec.

iitklokesh avatar iitklokesh commented on September 24, 2024

Hi John, Thank you so much for the response. I have done the same but there are still noises. Also, I wanted to know how the list was filtered using the number of occurrences that Vahe has mentioned as less than three.

from mat2vec.

jdagdelen avatar jdagdelen commented on September 24, 2024

Can you clarify what you mean by noisy data? You may want to also refer to the Gensim documentation on the different methods for finding similar sets of words as there might better functions for your needs.

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors

from mat2vec.

iitklokesh avatar iitklokesh commented on September 24, 2024

Hi John,
1st issue:
I am not able to reproduce the list exactly that is in the paper there are many other chemical formulas in between the actual formulas mentioned in the paper.

2nd issue: of two keywords
Using this,
from gensim.models import Word2Vec
import csv
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
mylist=w2v_model.wv.most_similar(positive=['thermoelectric','perovskite'],topn=10)
mylist

I am getting this output,
[('thermoelectric_properties', 0.7222813367843628),
('perovskites', 0.7098876237869263),
('Ca100Co131O312', 0.6918851137161255),
('Mo150Ni3Sb270Te80', 0.6901201009750366),
('thermoelectrics', 0.6884835362434387),
('MoO6Sr2Ti', 0.6838506460189819),
('CoMoO12Sr4Ti2', 0.682638943195343),
('La4Mn5O15Tb', 0.6811503171920776),
('Ba4InO12YbZr2', 0.6792004108428955),
('Mg2(Si,Sn)', 0.6788942813873291)]
I want to remove the bold results. I want to get the chemical formulas only like in the paper.

from mat2vec.

jdagdelen avatar jdagdelen commented on September 24, 2024

1st issue: Are you using the provided pretrained word embeddings or are you training on your own corpus? Can you provide code examples of how you are doing the search so we can help you debug?

2nd issue: To filter out non-material embeddings we compare the embeddings to our list of materials built using Named Entity Recognition and a rule-based parsing tool. However, it would probably not be too hard to build and train a classifier that filters out non-material embeddings using the word embeddings as input. I'm sorry we haven't made the entire pipeline of tools available to the public yet. Olga Kononova will be publishing a paper soon on the rule-based parser and at that point we can make it public. (Sorry, I was confused. this is how we're doing it now but for the study that this repo supports we just used a simple parser based on the Pymatgen Composition object.)

2nd issue: You can use process.is_simple_formula

from mat2vec.

iitklokesh avatar iitklokesh commented on September 24, 2024

Hi John,

1st issue: pretrained_embeddings downloaded from README.md link
from gensim.models import Word2Vec
import csv
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
mylist=w2v_model.wv.most_similar(positive=['thermoelectric'],topn=10)
mylist
Output:
[('thermoelectrics', 0.8435688018798828),
('thermoelectric_properties', 0.8339031934738159),
('thermoelectric_power_generation', 0.7931368947029114),
('thermoelectric_figure_of_merit', 0.7916494607925415),
('seebeck_coefficient', 0.7753844857215881),
('thermoelectric_generators', 0.7641353011131287),
('figure_of_merit_ZT', 0.7587920427322388),
('thermoelectricity', 0.7515754699707031),
('Bi2Te3', 0.7480159997940063),
('thermoelectric_modules', 0.7434878945350647)]
the same problem of getting non-materials, I am trying to remove these.
Thank You for help!

from mat2vec.

vtshitoyan avatar vtshitoyan commented on September 24, 2024

@iitklokesh

  1. The vocabulary of a gensim model has word counts, you can write a simple piece of code to filter the results out based on a word count. E.g. see this stackoverflow post
    https://stackoverflow.com/questions/37190989/how-to-get-vocabulary-word-count-from-gensim-word2vec
  2. The paper does not use NER to filter out non-materials, it uses a simple function available in process.py
    def is_simple_formula(self, text):

    Hope this helps
    Vahe

from mat2vec.

iitklokesh avatar iitklokesh commented on September 24, 2024

Thank You so much Vahe Sir!

from mat2vec.

jdagdelen avatar jdagdelen commented on September 24, 2024

@iitklokesh Sorry, I was confused. This is how we're doing it now but for the study that this repo supports we just used a simple parser based on the pymatgen Composition object. Note that the simple parsing method won't work for words/phrases like "lithium chloride" or "LMNCO".

from mat2vec.

iitklokesh avatar iitklokesh commented on September 24, 2024

Thank you, John!
I am using normalization for those cases.

from mat2vec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.