GithubHelp home page GithubHelp logo

alacarte's Introduction

ALaCarte

This repository contains code and transforms to induce your own rare-word/n-gram vectors as well as evaluation code for the A La Carte Embedding paper. An overview is provided in this blog post at OffConvex.

If you find any of this code useful please cite the following:

@inproceedings{khodak2018alacarte,
  title={A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors},
  author={Khodak, Mikhail and Saunshi, Nikunj and Liang, Yingyu and Ma, Tengyu and Stewart, Brandon and Arora, Sanjeev},
  booktitle={Proceedings of the ACL},
  year={2018}
}

Inducing your own à la carte vectors

The following are steps to induce your own vectors for rare words or n-grams in the same semantic space as existing GloVe embeddings. For rare words from the IMDB, PTB-WSJ, SST, and STS tasks you can find vectors induced using Common Crawl / Gigaword+Wikipedia at http://nlp.cs.princeton.edu/ALaCarte/vectors/induced/.

  1. Make a text file containing one word or space-delimited n-gram per line. These are the targets for which vectors are to be induced.
  2. Download source embedding files, which should have the format "word float ... float" on each line. Can find GloVe embeddings here. Choose the appropriate transform in the transform directory.
  3. If using Common Crawl, download a file of WET paths (e.g. here for the 2014 crawl). Run alacarte.py with this passed to the --paths argument. Otherwise pass (one or more) text files to the --corpus argument.

Dependencies:

Required: numpy

Optional: h5py (check-pointing), nltk (n-grams), cld2-cffi (checking English), mpi4py (parallelizing using MPI), boto (Common Crawl)

For inducing vectors from Common Crawl on an AWS EC2 instance:

  1. Start an instance. Best to use a memory-optimized (r4.*) Linux instance.
  2. Download and execute install.sh.
  3. Upload your list of target words to the instance and run alacarte.py.

Evaluation code

Note that the code in this directory treats adding up all embeddings of context words in a corpus as a matrix operation. This is memory-intensive and more practical implementations should use simple vector addition to compute context vectors.

Dependencies: nltk, numpy, scipy, text_embedding

Optional: mpi4py (to parallelize coocurrence matrix construction)

alacarte's People

Contributors

mkhodak avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.