GithubHelp home page GithubHelp logo

alacarte's Introduction

ALaCarte

This repository contains code and transforms to induce your own rare-word/n-gram vectors as well as evaluation code for the A La Carte Embedding paper. An overview is provided in this blog post at OffConvex.

If you find any of this code useful please cite the following:

@inproceedings{khodak2018alacarte,
  title={A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors},
  author={Khodak, Mikhail and Saunshi, Nikunj and Liang, Yingyu and Ma, Tengyu and Stewart, Brandon and Arora, Sanjeev},
  booktitle={Proceedings of the ACL},
  year={2018}
}

Inducing your own à la carte vectors

The following are steps to induce your own vectors for rare words or n-grams in the same semantic space as existing GloVe embeddings. For rare words from the IMDB, PTB-WSJ, SST, and STS tasks you can find vectors induced using Common Crawl / Gigaword+Wikipedia at http://nlp.cs.princeton.edu/ALaCarte/vectors/induced/.

  1. Make a text file containing one word or space-delimited n-gram per line. These are the targets for which vectors are to be induced.
  2. Download source embedding files, which should have the format "word float ... float" on each line. Can find GloVe embeddings here. Choose the appropriate transform in the transform directory.
  3. If using Common Crawl, download a file of WET paths (e.g. here for the 2014 crawl). Run alacarte.py with this passed to the --paths argument. Otherwise pass (one or more) text files to the --corpus argument.

Dependencies:

Required: numpy

Optional: h5py (check-pointing), nltk (n-grams), cld2-cffi (checking English), mpi4py (parallelizing using MPI), boto (Common Crawl)

For inducing vectors from Common Crawl on an AWS EC2 instance:

  1. Start an instance. Best to use a memory-optimized (r4.*) Linux instance.
  2. Download and execute install.sh.
  3. Upload your list of target words to the instance and run alacarte.py.

Evaluation code

Note that the code in this directory treats adding up all embeddings of context words in a corpus as a matrix operation. This is memory-intensive and more practical implementations should use simple vector addition to compute context vectors.

Dependencies: nltk, numpy, scipy, text_embedding

Optional: mpi4py (to parallelize coocurrence matrix construction)

alacarte's People

Contributors

mkhodak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alacarte's Issues

"induction matrix dimension and word embedding dimension must be the same"

Hey,
thanks for this interesting repository!
I'm trying to experiment with the code by inducing vectors for certain words, nevertheless I run into the issue of "induction matrix dimension and word embedding dimension must be the same".
when invoking python alacarte.py testdump --source ...glove.6B/glove.6B.300d.txt --targets words.txt --corpus corpus.txt --matrix transform/6B.300d.bin, I get this issue.
The dimension of the embeddings seems fine, but the value d of the transformation file is 282. A similar problem appears with the files in other dimensions.

So am I using the script wrong? Or is there a problem with the transformation files?

Contextualized Rare Words details

I'm interested in experimenting using the CRW dataset, and so I was wondering if I could get more details on how the word2vec model was trained. Was gensim used? Was it a skipgram or cbow model? What was the sampling? Epochs? Etc? I want to compare my own model, and would be interested in these details or if possible, the training code itself.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.