vered1986 / hypenet Goto Github PK

View Code? Open in Web Editor NEW

85.0 85.0 13.0 774 KB

Integrated path-based and distributional method for hypernymy detection

License: Other

Python 95.51% Shell 4.49%

hypenet's People

Contributors

Stargazers

Watchers

Forkers

amitvpatel06 qiuyuew ziliwang biu-nlp semsevens xinxu1018 yermouth jfarrugia-uom lunnada lily012 xrosliang a1270645260 standardgalactic

hypenet's Issues

TypeError: String or Integer object expected for key, unicode found

Hi @vered1986,
when run create_resource_from_corpus.sh line 55, to convert the textual triplets to triplets of IDs. There is a error in create_resource_from_corpus_2.py line 45, which is TypeError: String or Interger object expected for key, unicode found.

what is the form of Wikipedia ? xml or json or text?

thanks. I just use the xml but it failed. it says "ValueError: Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].sent_start "

parse_wikipedia.py produces a very large file with a newer version of spacy

The original corpus in the paper was processed using spacy version 0.99. Using a newer spacy version creates a much larger triplet file (over 11TB, while the original file was ~900GB). For now the possible solutions are:

Use spacy version 0.99 - install using:
pip install spacy==0.99
python -m spacy.en.download all --force
Limit parse_wikipedia.py to a specific vocabulary as in LexNET.

I'm working on figuring out what happens in the newer spacy version, and writing a memory-efficient version of parse_wikipedia.py, in case the older spacy version is the buggy one, and the number of paths should in fact be much larger.

Thanks @christos-c for finding this bug!

wikipedia eump file

Hey Vered,
I am very interested in trying your code too. but i don't know the format of wikipedia dump file. Is it xml or json ?

dynet verson

Hi,I got some error like this:
terminate called after throwing an instance of 'std::invalid_argument'
what(): Attempting to define parameters before initializing DyNet. Be sure to call dynet::initialize() before defining your model.
Aborted (core dumped)
What's your dynet verson?How can I fix it?

False Negatives in the dataset

Hello, upon experimenting with the dataset I came across several examples where a hypernym relationship exists but is labelled as False (mostly novels).
Here are a few examples from the test dataset (lexical split) -

saraswatichandra novel False
pollyanna novel False
jurassic park novel False
makamisa novel False
the hunger games novel False
the secret novel False
...

You mention in the paper that the dataset was created via distant supervision and only the positives are manually audited. Could I state that the dataset is noisy and needs to be cleaned up a bit? Or are these, according to you, truly False annotations?
Thank You

Bug in saving the model

Currently, only the NN parameters are saved (lookup tables, W1, b1, etc) but the LSTM parameters are not saved.

KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96'

Hello, I need to reproduce the results on a subset of your dataset and I met some problems including pid killed in parsing, ascii error in create_*_1.py and key error in create_*_2.py. Some of them are the same as @wt123u in another issue.

I delete & before line 40 in create_*.sh to solve the pid killed problem.

I add sys.setdefaultencoding('utf-8') to solve the ascii error.

Then I met the KeyError in create_*_2.py, I tried to solve it by putting x_id, y_id, path_id = term_to_id_db[x], term_to_id_db[y], path_to_id_db.get(path, -1) to the try block, finally I got a db file nearly 70GB. When I train the model, it shows Pairs without paths: 1549 , all dataset: 20314. Continuing to train can damage the results, so it would be unfair.

I am using the 20181201 version of wiki dump and spacy 1.9.0, can the different versions or the above changes be the reason of KeyError? What can I do to get fair results? Thanks!

vered1986 / hypenet Goto Github PK

hypenet's People

Contributors

Stargazers

Watchers

Forkers

hypenet's Issues

TypeError: String or Integer object expected for key, unicode found

what is the form of Wikipedia ? xml or json or text?

parse_wikipedia.py produces a very large file with a newer version of spacy

wikipedia eump file

dynet verson

False Negatives in the dataset

Bug in saving the model

KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96'

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs