GithubHelp home page GithubHelp logo

conceptnet-numberbatch's People

Contributors

joshua-chin avatar pawnyy avatar rspeer avatar thatandromeda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

conceptnet-numberbatch's Issues

Converting sentence into a list of concepts?

When using the following code from Readme:

>>> from conceptnet5.nodes import standardized_concept_uri

>>> standardized_concept_uri('en', 'this is an example')
'/c/en/be_example'

What I get instead is

{
  "uri": "/c/en/this_be_example"
}

Which is of course not found in conceptnet-ensemble-201603-labels.txt

Is there a "standard" way to turn an arbitrary sentence into a list of concepts?

Embedding for other dimensions: 50, 100 and 200

It's really nice to see concept-enriched embeddings!

It would be nicer to have embeddings in other dimensions, e.g., 50, 100 and 200 because there are many models that use these smaller dimensions to prevent overfitting.

Accuracy issues

Hi

I tried ConceptNet Numberbatch pre trained embedding on CNN classification task and compared the results Glove and Word2vec results.

Results of Word2vec and Glove are still better than ConceptNet embedding. I was expecting to get better accuracy results from numberbatch,

Any advise ? if i am doing anything wrong?

error while getting datasets with "git annex get"

I have done all setup and tried to get dataset with git annex command but it says

get code/source-data/conceptnet5.5.csv
Remote origin not usable by git-annex; setting annex-ignore
(not available)
No other repository is known to contain the file.
failed
get code/source-data/conceptnet5.csv (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.300d.npy (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.600d.npy (not available)
No other repository is known to contain the file.
failed
get code/source-data/ensemble-2016-03.labels (not available)
No other repository is known to contain the file.
failed
get code/source-data/glove.42B.300d.txt (not available)
No other repository is known to contain the file.
failed
get code/source-data/glove12.840B.300d.txt (not available)
No other repository is known to contain the file.
failed
get code/source-data/ppdb-xl-lexical.csv (not available)
No other repository is known to contain the file.
failed
get code/source-data/w2v-google-news.bin.gz (not available)
No other repository is known to contain the file.
failed
git-annex: get: 9 failed

Error when ninja : Shape mismatch in assignement

Hi,

When I launch ninja, I get this error
[21/127] python3 -m conceptnet_retrofitting.builders.self_loops build...uild-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz FAILED: build-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz python3 -m conceptnet_retrofitting.builders.self_loops build-data/glove.840B.300d.ppdb-xl-lexical-standardized.npz build-data/glove.840B.300d.ppdb-xl-lexical-standardized.self_loops.npz Traceback (most recent call last): File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 22, in <module> main(*sys.argv[1:]) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 16, in main assoc = self_loops(assoc) File "/media/data/roxane/Desktop/conceptnet-numberbatch/code/conceptnet_retrofitting/builders/self_loops.py", line 8, in self_loops assoc[diagonal] = assoc.sum(axis=1).T[0] File "/media/data/roxane/anaconda3/envs/numberbatch/lib/python3.7/site-packages/scipy/sparse/_index.py", line 124, in __setitem__ raise ValueError("shape mismatch in assignment") ValueError: shape mismatch in assignment

I used the orginal datasets and when I debug, I cannot see mismatchh in shape...
Has anybody else had this issue?

Quick questions. Thanks.

Thanks for the great opensource and embeddings!

I just have a few quick questions:

  1. Glove embeddings (that 840B version)'s vocabulary size is: 2196017; while your embeddings have 1453347 words only. Since I am under the impression that your approach combines lots of resources: word2vec/Glove/PPDB/ConceptNet, could you please clarify why yours has a much smaller vocabulary size (~66%) than Glove?

  2. Is this because you combine words together into phrases? I found there are lots of phrase words in your vocabulary, like "supreme_court", "washington_dc", "san_francisco" or "natural_gas" etc. And Glove does not have these phrase words.

  3. BTW, any possibilities to release your emb in plain text files (zipped, just like Golve's format)? instead of numpy matrices?

Thanks again!

Can i use embeddings in closed source game (trough rest server)?

I'm little confused how license of your project apply to my case.

I would like to use embeddings for my game. Embeddings data would be used by simple rest server, which would provide one method - calculating similarity between two words.

That method would be used by my game. So in my game i won't use embeddings directly nor will i distribute them. My game will just use calculated similarity from rest server.

So, my question is - how ShareAlike license apply to my case? Can my game be closed source? What about my server? Do i need to release it under CC BY-SA 4.0 license?

Do all versions occupy the same vector space?

Do all versions occupy the same vector space, so that same words between versions have similar coordinates?

This would be extremely useful when upgrading to new versions, as one wouldn't need to vectorize the entire corpus again. It becomes even more important in cases where the original source text isn't available anymore.

Thanks

number batch code

Hi,

I've run the code over the whole weekend but it still has not completed. You had mentioned it would take about a day to run... do you think there was something that happened incorrectly?

Thanks,

Megh

Format different from Word2Vec's format?

It looks like GloVe and Word2Vec have slightly different formats for their files, so I think it's a bit confusing to say that the models here are in the same format?

I noticed this when trying to load these embeddings into Gensim. Apparently the same problem exists with Glove, and this repository offers a solution that also works for the Conceptnet embeddings: https://github.com/manasRK/glove-gensim

Basically, the first line needs to indicate the number of word embeddings in the file and the number of dimensions of the vectors. I think it'd be a good idea to at least mention this in the README.

Cannot get annex files

After everything else is successful, including a Git Annex installation, when I type:

cd code/data
git annex get

Nothing occurs. No downloads start. I tried this with cd code/source-data, and it does not work either. How can I download the data files?

Alternatively, is there a way to obtain the files without git annex?

downloads and term vectors not available

Your links are not working. I'm trying to download any of the following:
conceptnet-numberbatch-201609_uris_main.txt.gz (1928481 × 300) contains terms from many languages, specified by their complete ConceptNet URI (the strings starting with /c/en/ in the example above).

conceptnet-numberbatch-201609_en_main.txt.gz (426572 × 300) contains only English terms, and strips the /c/en/ prefix to provide just the term text. This form is the most compatible with other systems, as long as you only want English.

conceptnet-numberbatch-201609_en_extra.txt.gz (233488 × 300) contains additional single words of English whose vectors could be inferred as the average of their neighbors in ConceptNet.

conceptnet-numberbatch-201609.h5 contains the data in its native HDF5 format, which can be loaded with the Python library pandas.

http://conceptnet5.media.mit.edu/downloads/ is also not working.

KeyError: "word 'coffee_pot' not in vocabulary"

Hello,

I am trying to get similarity between two words. I am using multilingual (numberbatch-17.06.txt') and the smaller english-only (numberbatch-en-17.06.txt) word vectors.

What have I currently achieved:

model = gensim.models.KeyedVectors.load_word2vec_format('numberbatch-17.06.txt', binary=False)
print(model.vector_size)
print(model.similarity("coffee_pot", "tea_kettle"))

Results:

300
`KeyError: "word 'coffee_pot' not in vocabulary"

No matter the word pairs, it never finds any similarities.

Interesting is that when I do the exact same thing with ConceptNet smaller english-only word vector file, all works just fine:

model = gensim.models.KeyedVectors.load_word2vec_format('numberbatch-en-17.06.txt', binary=False)
print(model.vector_size)
print(model.similarity("coffee_pot", "tea_kettle"))

Results:

300
0.5312845

For testing purposes when I iterate over every line of these files, I get the following results:

  1. numberbatch-17.06.txt -> 1 917 248
  2. numberbatch-en-17.06.txt -> 417 195

This shows us that the files are just fine and contain data.

Example content of file numberbatch-en-17.06.txt:

417194 300
tea_kettle 0.0387 -0.0292 0.2034 0.0983 -0.0785 -0.0051 -0.0116 -0.1310 0.1573 0.0358 -0.1409 -0.0158 -0.0262 -0.0663 -0.0684 0.1487 0.0211 0.0157 0.0348 -0.1160 -0.0701 -0.0608 -0.0211 0.0731 0.1092 -0.0442 0.0256 0.0136 0.0202 0.0671 0.0546 -0.0398 0.0347 0.1572 0.0104 0.0684 0.0615 0.0011 0.0769 -0.0849 0.1121 -0.0146 0.0206 0.0890 0.0034 0.0998 -0.1155 -0.0272 0.1015 0.0245 -0.0029 0.0695 0.0315 0.0344 -0.1253 -0.0065 0.0318 0.0381 0.0714 0.1117 0.0643 0.0176 -0.0146 0.0323 -0.0121 0.0828 0.1397 0.0657 0.0341 -0.0022 -0.0808 -0.0102 -0.0376 -0.0665 0.0470 -0.0740 0.0475 -0.0439 -0.1397 -0.0080 -0.0162 -0.0080 -0.0090 0.0758 0.0810 0.0960 0.0251 0.0324 0.0364 -0.0174 0.0730 0.0455 0.0726 -0.0408 0.1600 -0.0330 0.0497 0.0386 0.0575 0.0502 0.0282 0.0694 0.0284 0.0106 0.0604 -0.0308 0.1479 0.0419 0.0148 -0.0838 0.0076 0.0850 -0.0081 0.0001 -0.0346 0.0440 0.0194 -0.0662 -0.0037 -0.0127 0.0501 -0.0037 -0.0433 0.0840 0.0849 -0.0227 -0.0348 -0.0678 0.0064 0.0069 -0.0961 0.0382 -0.0234 -0.0157 0.0476 0.0230 0.0274 -0.0948 -0.0189 -0.0320 0.0148 0.0048 0.0111 0.0164 -0.0060 0.0528 -0.0438 -0.0374 0.0483 -0.0509 -0.0621 -0.0944 0.0287 -0.0347 0.0426 0.0072 0.0636 -0.0269 0.0194 0.0125 0.0522 -0.0145 -0.0429 -0.0658 0.0550 -0.0563 0.0634 -0.0271 0.0067 0.0529 0.0446 0.0477 -0.0389 -0.0156 -0.0803 0.0096 -0.0045 0.0738 0.0082 0.1149 0.0426 0.0435 0.1527 0.0145 0.0287 0.0157 0.0240 -0.0163 0.0111 -0.1571 -0.0086 0.0315 0.1189 -0.0286 0.0136 -0.0009 -0.0022 -0.0620 -0.0087 -0.0087 0.0451 -0.0221 0.0440 0.0300 0.0246 -0.0211 0.0015 -0.0988 0.0207 0.0209 -0.0194 0.0085 0.0048 -0.0461 -0.0463 0.0118 0.0319 0.0644 0.0314 -0.0716 0.0013 0.0189 0.0017 -0.0892 -0.0420 -0.0389 0.0255 -0.0115 -0.0180 -0.0208 -0.0679 -0.0670 -0.0114 0.0184 0.0075 -0.0079 0.0893 0.1186 -0.0519 0.0240 0.0709 -0.0012 -0.0427 0.0180 -0.0194 0.0077 0.0242 0.0327 0.0736 -0.1041 0.0360 -0.0107 0.1080 -0.0048 0.0447 -0.0109 -0.0357 0.0029 0.0464 0.0288 0.0930 0.0280 -0.0380 -0.0303 0.0239 -0.0361 0.1058 0.0381 0.0397 0.0503 0.0488 -0.0014 -0.0189 0.0218 0.0538 0.0643 -0.0117 -0.0569 -0.0072 -0.0235 -0.0106 -0.0155 0.0249 0.0790 0.0974 -0.0126 -0.0214 -0.0303 -0.0031 -0.0403 -0.1275 0.0454 -0.0159 -0.0287 -0.0092 -0.0471 -0.0019 0.0183 -0.0509 -0.0412
coffee_pot -0.0230 0.0046 0.0981 0.1118 -0.0274 -0.0430 0.0668 -0.1377 0.1417 -0.0054 -0.1251 0.0249 -0.0319 -0.0386 -0.0870 0.1135 0.0580 0.0420 -0.0394 -0.0855 -0.1048 -0.0423 -0.0198 0.0363 0.0809 -0.0504 -0.0459 0.0026 -0.1134 -0.0098 0.0396 0.0257 0.0578 0.0409 0.1037 0.0127 0.0631 0.0111 0.0341 -0.0565 0.0457 -0.0754 0.0174 0.0017 0.0379 0.0919 0.0048 -0.0303 0.1128 -0.0517 -0.0679 0.0375 0.0068 0.0612 -0.0367 -0.0346 0.0093 0.0608 0.0587 0.0321 0.0465 -0.0551 -0.0880 -0.0569 -0.0324 0.0402 0.0586 0.0173 -0.0797 -0.0163 -0.0103 -0.0142 -0.0537 -0.0697 0.1746 -0.0507 0.0150 -0.0284 -0.1064 -0.0054 -0.0395 -0.0012 0.0224 -0.0276 -0.0227 0.0777 0.0406 0.0460 0.0104 -0.0124 -0.0179 -0.0581 0.0546 0.0230 0.1200 -0.0507 0.1206 0.0995 0.1138 0.1081 0.1309 0.1133 0.0837 0.0106 0.1533 -0.0413 0.0384 0.0320 -0.0448 0.0390 -0.0273 -0.0037 0.0100 0.1070 0.1078 -0.0111 -0.0051 -0.1064 -0.0507 -0.0184 -0.0077 -0.0425 -0.0462 0.0528 0.0964 -0.0050 0.0147 -0.0723 -0.0232 0.0427 -0.1352 0.0433 -0.0277 -0.0064 0.0547 -0.0011 0.0105 0.0018 -0.0281 -0.0369 0.0138 -0.0069 0.0185 0.0368 0.0152 0.0851 -0.0760 0.0149 0.0127 -0.0212 0.0215 -0.0758 -0.0211 -0.0327 0.0059 0.0646 0.0738 -0.0097 0.0307 -0.0074 -0.0192 0.0750 0.0092 -0.0525 0.0939 0.0345 0.0386 -0.0119 -0.0113 0.0230 0.0050 0.0099 0.0856 0.0425 -0.0634 -0.0230 0.0607 -0.0060 -0.0486 0.1053 0.0487 -0.0081 0.0836 -0.0040 0.0138 -0.1171 0.0372 0.0944 0.0219 -0.0437 0.0506 0.0204 0.1172 0.0622 -0.0056 0.0303 -0.0120 -0.0067 0.0493 -0.0059 -0.0535 -0.0646 0.0731 0.0510 -0.0589 0.0143 -0.0261 -0.1250 0.0329 -0.0203 -0.0688 -0.0065 0.0075 0.0406 -0.0259 0.0218 0.0851 0.1140 0.0471 -0.0155 -0.0035 0.0228 0.0486 -0.0672 -0.0486 -0.0427 0.0194 0.1313 -0.0559 0.1879 0.0610 0.0066 -0.0540 0.0240 0.0789 0.0820 -0.0753 0.0255 -0.0801 -0.0039 0.0454 -0.0655 0.0078 -0.0493 -0.0665 -0.0217 0.0398 0.0206 0.0275 -0.1553 0.0141 -0.0150 -0.0216 -0.0092 0.0282 0.0306 0.0238 0.0245 -0.0251 -0.0183 0.0438 0.0267 -0.0379 0.0549 0.0149 -0.0172 -0.0228 0.0316 0.0067 0.0254 0.0174 -0.0269 -0.0616 0.0822 0.0304 -0.0101 0.0323 -0.0698 0.0373 0.0479 -0.0292 0.0060 0.0129 -0.0062 -0.0005 0.0549 -0.0928 0.0237 0.0139 -0.0256 -0.0110 -0.0107 0.0545 -0.0719 -0.0023 -0.0257 -0.0343 0.0371 -0.0116 -0.1188
...etc

I am assuming file numberbatch-17.06.txt has even more data inside (I can not open the txt file, as it is too massive).

What might be the issue here? Why I can not get similarities between words? Am I running out of memory?

Versions

Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.4
SciPy 1.1.0
gensim 3.8.0
FAST_VERSION 1

.

how do i delete this

Large number of zero vectors

Hello,

This could be the case with my processing, but it appears that 617, 129 out of the 665, 494 english vectors are zero vectors: they are defined in the label, but have all zeros (ie, there are only 48, 365 non-zero vectors for English). I discovered this with the 300-sized dataset. Might this be an issue with the uploaded dataset, or should I recheck my methodology? If you could confirm this is not the issue on your side using the dataset available for download, I can work on fixing on my side.

For reference, this is the code I used to count empty vectors:

empty = np.zeros(300)
count = 0
for each in englishVectors:
 if np.array_equal(each, empty):
  count +=1

I discovered this while trying to figure out the words closest to semi-common words.

For reference, using your code for 'most similar', the words that seem to be representative of the 'zero vectors' are the following:

['adddresse', 'rudat', 'barhydt', 'weeked', 'inovonics', 'alleppey', 'katten', 'georgievski', 'kopinski', 'waxwing', 'irin_plusnews']

meaning of number of # characters in subwords?

For continuation words. there are varying number of # signs. For example, in the 5 first words we have followings:

  • /c/de/####er
  • /c/de/###er
  • /c/de/##er

For example, if I have a word ending with "er", which one should I use?

Thanks..

Wrong link in readme

in README.md

numberbatch-17.04.txt.gz contains this data in 77 languages.
numberbatch-en-17.04.txt.gz contains just the English subset of the data, with the /c/en/ prefix removed.

Both link to 17.02 instead of 17.04

Vector Ensembler Code

Hello

Here we see the Dataset where can I find the Numberbatch Vector Ensembler Code ?

Thank you very much.

training script for embedding

This is a really great job ! Now I want to add more knowledge in ConceptNet and train another embedding. I wonder if you can share the training script ? Thanks in advance for any help on that.

Spelling Error in README

ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning.

SHOULD BE:

ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning.

Happens to the best of us!

List of removed stop words

Section 3.1 of the paper (second para) says some stop words were removed while pre-processing. Would there be a list of the words that were removed? Some very common stop words appear to be around, so just wanted to be sure which ones had been knowingly gotten rid of.

Common word subset

Is it possible to download a subset of Numberbatch sorted by common words? In my application, it is computationally infeasible to load all of the words into memory.

However, a 20% subset of the most common words would solve my problems and fit into memory as well.

Please let me know if this is possible!
Nick

@paper is not recognized while importing citation

Hello,
The citation you provided in readme threw an error while I was importing it to zotero.
I couldn't find anything related to a @paper entry, I believe the following is the correct (updated?) form.

@inproceedings{speer2017conceptnet,
  title = {{{ConceptNet}} 5.5: {{An Open Multilingual Graph}} of {{General Knowledge}}},
  url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972},
  eventtitle = {{{AAAI Conference}} on {{Artificial Intelligence}}},
  date = {2017},
  pages = {4444-4451},
  author = {Speer, Robyn and Chin, Joshua and Havasi, Catherine}
}

Using the pretrained term vectors

First-time using the pretrained term vectors and noticed the vectors are in a text file. The word2vec and the googlenews pretrained vectors can be loaded as a numpy array which in turn can optionally be read from disk with the mmap_mode. Given a term, look up an dictionary or hashtable to get an index for the term and then extract the term vector from the numpy array using the index value. I've used this successfully.

Can numberbatch be used in a similar way and if so how?

Lemmatization for SNLI

Hi,

I would like to use your embedding on SNLI dataset. However, due to lemmatization, almost half of the words have no embeddings. Therefore I'd like to lemmatize the SNLI dataset.

I am wondering, which lemmatization algorithm would be best to get a dataset similar to Conceptnet Numberbatch

Sorting by occurrence count

Hi guys! Do you think that you could provide the conceptnet-numberbatch embeddings sorted by some kind of word frequency, similarly as GloVe and FastText does? In my research I'm limiting the vocabulary to most frequent K words in order not to eat all the GPU memory with embedding lookup when using pretrained embeddings in my models, and the sort order used by the other embeddings makes this much easier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.