alacarte's Introduction

ALaCarte

This repository contains code and transforms to induce your own rare-word/n-gram vectors as well as evaluation code for the A La Carte Embedding paper. An overview is provided in this blog post at OffConvex.

If you find any of this code useful please cite the following:

@inproceedings{khodak2018alacarte,
  title={A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors},
  author={Khodak, Mikhail and Saunshi, Nikunj and Liang, Yingyu and Ma, Tengyu and Stewart, Brandon and Arora, Sanjeev},
  booktitle={Proceedings of the ACL},
  year={2018}
}

Inducing your own à la carte vectors

The following are steps to induce your own vectors for rare words or n-grams in the same semantic space as existing GloVe embeddings. For rare words from the IMDB, PTB-WSJ, SST, and STS tasks you can find vectors induced using Common Crawl / Gigaword+Wikipedia at http://nlp.cs.princeton.edu/ALaCarte/vectors/induced/.

Make a text file containing one word or space-delimited n-gram per line. These are the targets for which vectors are to be induced.
Download source embedding files, which should have the format "word float ... float" on each line. Can find GloVe embeddings here. Choose the appropriate transform in the transform directory.
If using Common Crawl, download a file of WET paths (e.g. here for the 2014 crawl). Run alacarte.py with this passed to the --paths argument. Otherwise pass (one or more) text files to the --corpus argument.

Dependencies:

Required: numpy

Optional: h5py (check-pointing), nltk (n-grams), cld2-cffi (checking English), mpi4py (parallelizing using MPI), boto (Common Crawl)

For inducing vectors from Common Crawl on an AWS EC2 instance:

Start an instance. Best to use a memory-optimized (r4.*) Linux instance.
Download and execute install.sh.
Upload your list of target words to the instance and run alacarte.py.

Evaluation code

http://nlp.cs.princeton.edu/ALaCarte (GloVe Vectors)
http://nlp.cs.princeton.edu/CRW (CRW Dataset)

Note that the code in this directory treats adding up all embeddings of context words in a corpus as a matrix operation. This is memory-intensive and more practical implementations should use simple vector addition to compute context vectors.

Dependencies: nltk, numpy, scipy, text_embedding

Optional: mpi4py (to parallelize coocurrence matrix construction)

alacarte's People

Contributors

Stargazers

Watchers

alacarte's Issues

"induction matrix dimension and word embedding dimension must be the same"

Hey,
thanks for this interesting repository!
I'm trying to experiment with the code by inducing vectors for certain words, nevertheless I run into the issue of "induction matrix dimension and word embedding dimension must be the same".
when invoking python alacarte.py testdump --source ...glove.6B/glove.6B.300d.txt --targets words.txt --corpus corpus.txt --matrix transform/6B.300d.bin, I get this issue.
The dimension of the embeddings seems fine, but the value d of the transformation file is 282. A similar problem appears with the files in other dimensions.

So am I using the script wrong? Or is there a problem with the transformation files?

Contextualized Rare Words details

I'm interested in experimenting using the CRW dataset, and so I was wondering if I could get more details on how the word2vec model was trained. Was gensim used? Was it a skipgram or cbow model? What was the sampling? Epochs? Etc? I want to compare my own model, and would be interested in these details or if possible, the training code itself.

Recommend Projects

nlprinceton / alacarte Goto Github PK

alacarte's Introduction

ALaCarte

Inducing your own à la carte vectors

Evaluation code

alacarte's People

Contributors

Stargazers

Watchers

Forkers

alacarte's Issues

"induction matrix dimension and word embedding dimension must be the same"

Contextualized Rare Words details

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs