GithubHelp home page GithubHelp logo

malllabiisc / wordgcn Goto Github PK

View Code? Open in Web Editor NEW
289.0 289.0 64.0 5.19 MB

ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks

License: Apache License 2.0

C++ 6.78% Python 93.01% Makefile 0.21%
acl2019 deep-learning-tutorial gcn graph-convolutional-networks natural-language-processing tensorflow word-embeddings

wordgcn's People

Contributors

loginaway avatar moore3930 avatar parthatalukdar avatar punchwes avatar svjan5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wordgcn's Issues

Can not compile the batch_generator.cpp

Hi,

The WordGCN is an interesting thing. However, I can't compile the batch_generator.cpp with the command "make" following the Readme. Meanwhile, the requirements can not be found in the repository.

g++ batch_generator.cpp -o batchGen.so -fPIC -shared -pthread -O3 -march=native -std=c++11
batch_generator.cpp:96:3: error: expected identifier before ‘)’ token
) {
^
makefile:2: recipe for target 'all' failed
make: *** [all] Error 1

Thanks.

"ModuleNotFoundError: No module named 'web"

Hello,there are some problem with me.
When I run "python syngcn.py -name test_embeddings -gpu 0", It has a error
"ModuleNotFoundError: No module named 'web"
So I run "pip install web.py",
Then "ModuleNotFoundError: No module named 'web.embedding".
I wanna to know how can I use web.embedding.
Thanks.

how to build a syntactic graph

Hi,Thank you for your amazing work
However,this code doesn't seem to contain the part of how to build the syntactic graph. can i ask for sharing the code of this part?
thank you very much.

Reproduction Problem

Hi @svjan5 ,

Thanks for your paper, as well as releasing the code.

I follow your current code and default settings, after several runs it seems hard to reproduce your reported result on test set.

My results on five runs with deviation are like below:

Analogy task:

Google MSR SemEval2012_2
our 45.16±1.61 49.41±0.60 16.28±1.73
reported 52.8 23.4

Similarity task

MEN WS353 WS353R WS353S SimLex999 RW RG65 MTurk TR9856
our 69.99±0.19 58.35±0.52 43.51±1.68 70.68±0.51 47.61±0.29 37.91±0.47 58.19±1.33 59.90±0.86 17.23±0.26
reported 45.7 73.2 45.5 33.7

Categorisation task

AP BLESS Batting ESSLI_2c ESSLI_2b ESSLI_1a
our 59.22±1.97 69.04±0.86 39.50±1.34 67.41±3.63 77.78±6.20 80.67±1.33
reported 69.3 85.2 45.2

As you can see, there is a large gap for tasks like SemEval2012_2 and categorisation tasks. The deviations for several tasks are also a little bit large.

I wonder where did I go wrong? Forgive my carelessness, Is there anything I missed?

Detailed experimental parameters settings

Hi,
Thank you for your paper, as well as releasing the code.
I follow your source code with your default settings, we obtain a poor result. The experimental setup as shown below:

2019-09-22 11:04:39,606 - test_embeddings_22_09_2019_11:04:39 - [INFO] - {'embed_loc': None, 'gcn_layer': 1, 'batch_size': 512, 'sample': 0.0001, 'lr': 0.001, 'config_dir': './config/', 'dropout': 1.0, 'max_epochs': 5, 'total_sents': 56974869, 'num_neg': 25, 'log_dir': './log/', 'side_int': 10000, 'log_db': 'aaai_runs', 'emb_dir': './embeddings/', 'opt': 'adam', 'onlyDump': False, 'restore': False, 'l2': 0.0, 'context': False, 'gpu': '0', 'seed': 1234, 'name': 'test_embeddings_22_09_2019_11:04:39', 'embed_dim': 300}

  | WS353S | WS353R | SimLex999 | RW | AP | Battig | BLESS | SemEval2012 | MSR|
SynGCN | 73.2 | 45.7 | 45.5 | 33.7 | 69.3 | 45.2 | 85.2 | 23.4 | 52.8
our imp. | 75.4 | 39.9 | 44.7 | 30.1 | 66.8 | 44.9 | 77.0 | 21.5 | 41.3

Where did I go wrong?

About the stopping criteria

Hi @svjan5 ,

I am curious about the stopping criteria for training you used in the paper. Is it the same as in the code by depending on the average score of all sorts of word similarity/analogy/categorisation tasks? Because I found that using the average score to save best model brought so much stochastics. Though the final average scores are similar for multiple runs, the score for specific tasks could have huge differences. Besides, do you think it is plausible to use those intrinsic tasks to decide the best model during training given the fact that you would evaluate your model on those tasks for comparison with other models?

Best,
Qiwei

Handling of outgoing and incoming arcs

Hello,
Thank you for your paper, as well as releasing the code.

My question is about Processing of edges in the GCN, the original paper differentiated incomings and outgoing arcs by modeling two matrices, each one for each direction. They do so to avoid overparametrize the model (when adding reversed edges), also I saw that you compute the self loop vector, but you did not include it in the update formulae.

Could you tell me if I misunderstood the code.

Thank you

Kindly regards

Can we upload our own dataset?

Do you have scripts available/any easy way to convert raw data to your processed dataset files . So that i can test your on my own dataset .

SSLError: Certificate verify failed

Hi,
Nice paper!

I had some problem during running 'semgcn.py'.
I downloaded pretrained 300-dimensional SynGCN embeddings from your README.md, and tried to fine tune them using semgcn.py. Here is how I typed:
sudo python3 semgcn.py -embed ./embeddings/syngcn_embeddings.txt -semantic synonyms -embed_dim 300 -name fine_tuned_embeddings -gpu 0

However, after the percentile hit 100, some Error occurred:

2019-09-20 19:57:05,665 - [INFO] - E:0 (Sents: 64640/64640 [100.0]): Train Loss0.36636	fine_tuned_embeddings_20_09_2019_19:44:56	0.0

Traceback (most recent call last):
  File "/usr/lib/python3.5/urllib/request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.5/http/client.py", line 1122, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1167, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1118, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 944, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 887, in send
    self.connect()
  File "/usr/lib/python3.5/http/client.py", line 1276, in connect
    server_hostname=server_hostname)
  File "/usr/lib/python3.5/ssl.py", line 377, in wrap_socket
    _context=self)
  File "/usr/lib/python3.5/ssl.py", line 752, in __init__
    self.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 988, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 633, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "semgcn.py", line 613, in <module>
    model.fit(sess)
  File "semgcn.py", line 567, in fit
    self.checkpoint(train_loss, epoch, sess)
  File "semgcn.py", line 494, in checkpoint
    results		= evaluate_on_all(embedding)
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/evaluate.py", line 370, in evaluate_on_all
    "TR9856": fetch_TR9856(),
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/datasets/similarity.py", line 335, in fetch_TR9856
    'similarity', uncompress=True, verbose=0),
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/datasets/utils.py", line 741, in _fetch_file
    handlers=handlers)
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/datasets/utils.py", line 648, in _fetch_helper
    data = url_opener.open(request)
  File "/usr/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 1297, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.5/urllib/request.py", line 1256, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)>

I'm sure that my Ubuntu is properly connected to the Internet. So why does this happen?

Thanks!

No embeddings generated?

Hi,

I came up with a problem while using semgcn to fine tune the given 'syngcn_embeddings.txt'.
sudo python3 semgcn.py -embed ./embeddings/syngcn_embeddings.txt -semantic synonyms -embed_dim 300 -name fine_tuned_embeddings -epoch 10 -gpu 0

Everything seems going well. However, after the training finished (and it printed success message), I found nothing in the ./embeddings directory except for syngcn_embeddings.txt, which I put it there as the training set.
In addition, the log files were output to ./log successfully, and those evaluations in them accord with the data given in the paper.

I tried several times and it turned out the same.
Could anyone please tell me why this happened? Thanks!

Question about the edge direction in SemGCN

In Figure 2 of your paper, for hypernym relation the edge direction is water -> liquid. In NLTK WordNet API, liquid is the hypernym of water, why the edge direction is not water <- liquid?

Segmentation fault (core dumped)

Hi, when I run "python syngcn.py -name test_embeddings -gpu 0". A problem has arisen, Segmentation fault (core dumped).Have you encountered such a problem?
Thank you a lot.

urllib.error.URLError

Hello, I am running "python syngcn.py -name test_embeddings -gpu 0 -dump "to get the URL error. I want to ask how to solve this error and how to crawl the dataset after it has been downloaded? Thank you very much for your help.

微信图片_20191215103842

About SemGCN embeddings

I downloaded the pretrained SynGCN embeddings from your WordGCN github and then run the script "python semgcn.py -embed ./embeddings/syngcn_embeddings.txt -gpu 0 -epoch 10 -name fine_tuned_embeddings", but after the model was successfully trained, I cannot find/get the finetuned SemGCN embeddings. What should I do?
Thanks!

Issue with GetBatches function

I am trying to replicate your code with the same dataset which you used for training. My code stops running once it enters run_epoch function in syngcn.py code.
The issue seems to appear in self.getBatches(shuffle), however I ran make command and BatchGen.so is created. So not really sure why my code stops running without any error.
Screen Shot 2020-03-13 at 10 27 40 PM

What does data.txt mean?

Hi,
I got a problem while trying to generate my own data.txt.
Specifically, I found that the initial data.txt is not of the format you have mentioned in README.md. (As follows)
<num_words> <num_dep_rels> tok1 tok2 tok3 ... tokn dep_e1 dep_e2 .... dep_em
They are actually organized like this (the first line of the initial data.txt file)
15 14 15 24351 24351 10 7 436 2083 26 8385 121958 4986 215 13 6932 2293 2 1|0|26 5|1|11 5|2|23 5|3|34 5|4|7 7|6|11 5|7|9 9|8|7 7|9|38 9|10|13 13|11|2 13|12|7 10|13|16 5|14|10 21854 21854 3 15 659 2324 0 2397 0 479 328 4 5905 7965 0
which have 4 parts, the first part '15 14 15' --I guess they are the numbers of the latter three parts? So what does the latter three parts represent?

I re-read the 'batch_generator.cpp', and it seems the last part of each line (i.e. the sequence of numbers after the dependency relations) are read but not stored.
Therefore, would it work if I set the first three numbers as (number of words in the sentence, number of dependency relations, 0), and leave the last part empty?

This problem confuses me for a long time... I tried setting the last part the same as the sentence tokens, and it kept showing segmentation fault...

Would you please give a description of data.txt, and also update the README.md?

Thanks!
@svjan5

About using own text data for SynGCN and SemGCN

Your WordGCN paper is very exciting and very well written, so I want to try to use your code in my current work, and I would like to ask you some questions.
For training SynGCN and SemGCN, If I try to use other text data such as transcripts of speech recognition benchmark corpus (AMI) rather than the Wikipedia corpus and receive the AMI corpus-based SynGCN and SemGCN word embeddings, what is the first step I need to do, or how to process my own text data.
Thanks!

Shih-Hsuan

How can I replace

Hi,
To test and validate the embeddings, I need to "replace the original ~/web_data folder with the provided one", but I don't know who to replace the dataset in word-embeddings-benchmark project. The evaluation tool will automatically download datasets from google drive.
Can you provide a more detailed instruction?
Thank you very much!

Segmentation fault (core dumped)

I clone the bug fixed code and run it with setting max length to 50, 70, 90, 150, segmentation fault comes as before. My TF version is gpu-1.12 and run it on two pieces of Tesla K80.

Would it possible for you to release your pre-trained model checkpoint?

Hi,

Thanks very much for your work, it's really impressive. I have managed to run the code with default setting on given dataset which consists of 57 million sentences on a Titan V, it takes around 18 hours to go over just one epoch(I noticed that your negative samples are set to be 100, wouldn't it be too large?). I wonder would it be possible for you to also release a pre-trained checkpoint and may I also ask your gpu and runtime?

Many thanks.

i need help!!!

why i will have this question?

Traceback (most recent call last):
  File "D:\environment\Anaconda\lib\logging\config.py", line 562, in configure
    handler = self.configure_handler(handlers[name])
  File "D:\environment\Anaconda\lib\logging\config.py", line 735, in configure_handler
    result = factory(**kwargs)
  File "D:\environment\Anaconda\lib\logging\__init__.py", line 1087, in __init__
    StreamHandler.__init__(self, self._open())
  File "D:\environment\Anaconda\lib\logging\__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
OSError: [Errno 22] Invalid argument: 'D:\\workspace\\WordGCN-master\\WordGCN-master\\log\\test_run_15_06_2020_08:32:34'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:/workspace/WordGCN-master/WordGCN-master/syngcn.py", line 592, in <module>
    model = SynGCN(args)
  File "D:/workspace/WordGCN-master/WordGCN-master/syngcn.py", line 439, in __init__
    self.logger = get_logger(self.p.name, self.p.log_dir, self.p.config_dir)
  File "D:\workspace\WordGCN-master\WordGCN-master\helper.py", line 66, in get_logger
    logging.config.dictConfig(config_dict)
  File "D:\environment\Anaconda\lib\logging\config.py", line 799, in dictConfig
    dictConfigClass(config).configure()
  File "D:\environment\Anaconda\lib\logging\config.py", line 570, in configure
    '%r' % name) from e
ValueError: Unable to configure handler 'file_handler'

Process finished with exit code 1

Verse BERT

I think your work is really interesting. But Does it have any meaning after the BERT model?
It seems your work is like word2vec or Glove, these static word embedding. Can it encode text dynamically like BERT?

About the gating mechanism

Hi @svjan5 ,
After reading your source code, I have found some places that make me confused. In the paper, the formula you mentioned is like:
image
And where:
image
while in your code, it is like this:

with tf.name_scope("in_arcs-%s_name-%s_layer-%d" % (lbl, name, layer)):
	inp_in     = tf.tensordot(gcn_in, w_in, axes=[2,0]) + tf.expand_dims(b_in, axis=0)
	adj_matrix = tf.transpose(adj_mat[lbl], [0,2,1])
	in_t 	   = self.aggregate(inp_in, adj_matrix)							
	if self.p.dropout != 1.0: in_t    = tf.nn.dropout(in_t, keep_prob=self.p.dropout)
	if w_gating:
		inp_gin = tf.tensordot(gcn_in, tf.sigmoid(w_gin), axes=[2,0]) + tf.expand_dims(b_gin, axis=0)
		in_act  = self.aggregate(inp_gin, adj_matrix)
	else:
		in_act   = in_t

It seems to me that the calculated in_t or inp_in is never used when enable gating which might not align with the formula where there is a multiplication in between. And the weight w_in and w_out would never be updated in the code. May you please give me some information how the calculated in_t or the w_in and w_out are used under gating mechanism in your code?

Many thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.