GithubHelp home page GithubHelp logo

chenting1104 / ngram2vec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zhezhaoa/ngram2vec

0.0 0.0 0.0 590 KB

Four word embedding models implemented in Python. Supporting arbitrary context features

Shell 11.02% Batchfile 1.68% Makefile 0.56% C 36.48% Python 50.26%

ngram2vec's Introduction

ngram2vec

The toolkit implements the ngram2vec model proposed in EMNLP2017 [Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics] (http://www.aclweb.org/anthology/D17-1023) aiming at learning high quality word embedding and ngram (n-gram) embedding.

ngram2vec toolkit is a natural extension to word2vec, where ngrams are introduced into recent word representation methods inspired by traditional language modeling problem. The toolkit can generate state-of-the-art word embeddings and high-quality ngram embeddings. For example, PPMI achieves 85+ accuracy on Google analogy questions (semantic group).

This toolkit may be also a good startpoint for those who want to learn about word representation models. It includes SGNS, GloVe, PPMI, and SVD and organize them in a pipeline. Arbitrary context features are supported. In terms of efficiency, it enables users to build vocabulary and co-occurrence matrix at a certain memory size. Also, we do optimization on many stages to speed up the process and reduce disk space required.

Requirements

  • Python (both Python2 and 3 are supported)
  • numpy
  • scipy
  • sparsesvd
  • docopt

Example use cases

Firstly, run the following codes to make some files executable.
chmod +x *.sh
chmod +x scripts/clean_corpus.sh
chmod +x word2vecf/word2vecf
chmod +x glovef/build/glove

Also, a corpus should be prepared. We recommend to fetch it at
http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 , a wiki corpus without XML tags. scripts/clean_corpus.sh is used for cleaning corpus in this work.
for example scripts/clean_corpus.sh WestburyLab.wikicorp.201004.txt > wiki2010.clean
A pre-processed (including segmentation) chinese wiki corpus is available at https://pan.baidu.com/s/1kURV0rl , which can be directly used as input of this toolkit.

run ./uni_uni.sh to see baselines
run ./uni_bi.sh and PPMI of uni_bi type will bring you state-of-the-art results on Google semantic questions (85+)
run ./bi_bi.sh to see significant improvments achieved when ngrams are introduced into SGNS

Note that in this toolkit, we remove low-frequency words with a threshold of 100 to speed up training and evaluation process. One can set thr=10 to reproduce the results reported in the paper.

Workflow

Testsets

Besides English word analogy and similarity datasets, we provide several Chinese analogy datasets, which contain comprehensive analogy questions. Some of them are constructed by directly translating English analogy datasets. Some are unique to Chinese. I hope they can become useful resources for evaluating Chinese word embedding. If you have any questions, feel free to contact us. We really appreciate your advice.

Some comments

We put source code in ngram2vec directory. We also provide simplified version of implementation for tutorial in ngram2vec/simplified directory. Run demo_simplified.sh(demo_simplified.bat) in Linux/Mac(Windows) to see how this toolkit works
corpus2vocab builds ngram vocabulary from corpus
corpus2pairs extracts ngram (feature) pairs from corpus (multi-threading implementation), used by SGNS model
line2features extracts ngram (feature) pairs from a line, called by corpus2pairs. Add contents to this file if you want to try different contexts
pairs2vocab generates center word vocabulary and context vocabulary, which are used by all models. (note that the two vocabularies are different. In uni_bi case, center word vocabulary only contains words while context vocabulary contains both words and bigrams)
pairs2counts builds co-occurrence matrix from pairs. We accelerate this stage by using mixed and stripes strategies. By now we only upload a coarse version and we will continue improving this code
counts2ppmi learns PPMI matrix from counts
counts2shuf shuffles the counts
counts2bin transfers counts into binary format, which is supported by glove
word2vecf supports arbitrary context features (implemented by Yoav Goldberg), which is used to train SGNS model. We also re-implement word2vecf in python, which is much easier to read compared with C version. One hundred lines are enough to implement word2vecf in python (including training in multiple processes, print detailed infomation, reading pairs & vocab and etc.)
glovef supports arbitrary context features. In spirit of word2vecf, we implement glovef upon glove

References

@inproceedings{DBLP:conf/emnlp/ZhaoLLLD17,
     author = {Zhe Zhao and Tao Liu and Shen Li and Bofang Li and Xiaoyong Du},
     title = {Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics},   
     booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September 9-11, 2017},      
     year = {2017}
 }

Acknowledgments

This toolkit is inspired by Omer Levy's work http://bitbucket.org/omerlevy/hyperwords
We reuse part of his code in this toolkit. We also thank him for his kind suggestions.
We build glovef upon glove https://github.com/stanfordnlp/GloVe
I can not finish this toolkit without the help from Bofang Li, Shen Li, Jianwei Cui in XiaoMi, and my tutors Tao Liu & Xiaoyong Du

Contact us

We are looking forward to receiving your questions and advice to this toolkit. We will reply you as soon as possible. We will further perfect this toolkit in a few weeks, including reimplement word2vecf and glovef in python and open line2features interface to better support adding arbitrary features.

Zhe Zhao, [email protected] , https://zhezhaoa.github.io/
Bofang Li, [email protected]
Shen Li, [email protected]
Renfen Hu, [email protected]

ngram2vec's People

Contributors

zhezhaoa avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.