vered1986 / hypenet Goto Github PK
View Code? Open in Web Editor NEWIntegrated path-based and distributional method for hypernymy detection
License: Other
Integrated path-based and distributional method for hypernymy detection
License: Other
Hi @vered1986,
when run create_resource_from_corpus.sh line 55, to convert the textual triplets to triplets of IDs. There is a error in create_resource_from_corpus_2.py line 45, which is TypeError: String or Interger object expected for key, unicode found.
thanks. I just use the xml but it failed. it says "ValueError: Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].sent_start "
The original corpus in the paper was processed using spacy version 0.99. Using a newer spacy version creates a much larger triplet file (over 11TB, while the original file was ~900GB). For now the possible solutions are:
Use spacy version 0.99 - install using:
pip install spacy==0.99
python -m spacy.en.download all --force
Limit parse_wikipedia.py to a specific vocabulary as in LexNET.
I'm working on figuring out what happens in the newer spacy version, and writing a memory-efficient version of parse_wikipedia.py, in case the older spacy version is the buggy one, and the number of paths should in fact be much larger.
Thanks @christos-c for finding this bug!
Hey Vered,
I am very interested in trying your code too. but i don't know the format of wikipedia dump file. Is it xml or json ?
Hi,I got some error like this:
terminate called after throwing an instance of 'std::invalid_argument'
what(): Attempting to define parameters before initializing DyNet. Be sure to call dynet::initialize() before defining your model.
Aborted (core dumped)
What's your dynet verson?How can I fix it?
Hello, upon experimenting with the dataset I came across several examples where a hypernym relationship exists but is labelled as False (mostly novels).
Here are a few examples from the test dataset (lexical split) -
saraswatichandra novel False
pollyanna novel False
jurassic park novel False
makamisa novel False
the hunger games novel False
the secret novel False
...
You mention in the paper that the dataset was created via distant supervision and only the positives are manually audited. Could I state that the dataset is noisy and needs to be cleaned up a bit? Or are these, according to you, truly False annotations?
Thank You
Currently, only the NN parameters are saved (lookup tables, W1, b1, etc) but the LSTM parameters are not saved.
Hello, I need to reproduce the results on a subset of your dataset and I met some problems including pid killed
in parsing, ascii error
in create_*_1.py
and key error
in create_*_2.py
. Some of them are the same as @wt123u in another issue.
I delete &
before line 40 in create_*.sh
to solve the pid killed
problem.
I add sys.setdefaultencoding('utf-8')
to solve the ascii error
.
Then I met the KeyError
in create_*_2.py
, I tried to solve it by putting x_id, y_id, path_id = term_to_id_db[x], term_to_id_db[y], path_to_id_db.get(path, -1)
to the try
block, finally I got a db file nearly 70GB. When I train the model, it shows Pairs without paths: 1549 , all dataset: 20314
. Continuing to train can damage the results, so it would be unfair.
I am using the 20181201
version of wiki dump and spacy 1.9.0
, can the different versions or the above changes be the reason of KeyError? What can I do to get fair results? Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.