GithubHelp home page GithubHelp logo

nlprinceton / text_embedding Goto Github PK

View Code? Open in Web Editor NEW
54.0 7.0 5.0 53.75 MB

utility class for building/evaluating document representations

License: MIT License

Python 97.78% Shell 1.12% Starlark 1.10%

text_embedding's Introduction

text_embedding

This repository contains a fast, scalable, highly-parallel Python implementation of the GloVe [1] algorithm for word embeddings (found in solvers.py) as well as code and scripts to recreate downstream-task results for unsupervised DisC embedding paper. An overview of the latter is provided in this blog post at OffConvex.

If you find this code useful please cite the following:

@inproceedings{arora2018sensing,
  title={A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs},
  author={Arora, Sanjeev and Khodak, Mikhail and Saunshi, Nikunj and Vodrahalli, Kiran},
  booktitle={Proceedings of the 6th International Conference on Learning Representations (ICLR)},
  year={2018}
}

GloVe implementation

An implementation of the GloVe optimization algorithm (as well as code to build the vocab and cooccurrence files, optimize the related SN objective [2], and optimize a source-regularized objective for domain adaptation) can be found in solvers.py. The code scales to an arbitrary number of processors with virtually no memory/communication overhead. In terms of problem size the code scales linearly in time and memory complexity with the number of nonzero entries in the (sparse) cooccurrence matrix.

On a 32-core computer, 25 epochs of AdaGrad run in 3.8 hours on Wikipedia cooccurrences with vocab size ~80K. The original C implementation runs in 2.8 hours on 32 cores. We also implement the option to use regular SGD, which requires about twice as many iterations to reach the same loss; however, the per-iteration complexity is much lower, and on the same 32-core computer 50 epochs finish in 2.0 hours.

Note that our code takes as input an upper-triangular, zero-indexed cooccurrence matrix rather than the full, one-indexed cooccurrence matrix used by the original GloVe code. To convert to our (more disk-memory efficient) version you can use the method reformat_coocfile in solvers.py. We also allow direct, parallel computation of the vocab and cooccurrence files.

Dependencies: numpy, numba, SharedArray

Optional: h5py, mpi4py*, scipy, scikit-learn

* required for parallelism; MPI can be easily installed on Linux, Mac, and Windows Subsystem for Linux

DisC embeddings

Scripts to recreate the results in the paper are provided in the directory scripts-AKSV2018. 1600-dimensional GloVe embeddings trained on the Amazon Product Corpus [3] are provided here.

Dependencies: nltk, numpy, scipy, scikit-learn

Optional: tensorflow

References:

[1] Pennington et al., "GloVe: Global Vectors for Word Representation," EMNLP, 2014.

[2] Arora et al., "A Latent Variable Model Approach to PMI-based Word Embeddings," TACL, 2016.

[3] McAuley et al., "Inferring networks of substitutable and complementary products," KDD, 2015.

text_embedding's People

Contributors

mkhodak avatar nsaunshi avatar

Stargazers

Elvis avatar Mohamad  Fazeli avatar  avatar  avatar Yiyi Chen avatar Sammy Khalife avatar  avatar Nikolay Karelin avatar Franco Luque avatar Krist Papadopoulos avatar Mustafa Keskin avatar Daniel Mahler avatar gemire avatar Riccardo Grazzi avatar Kento Nozawa avatar  avatar Corneliu Cofaru avatar Stefanos-Dimitrios Stefanidis avatar Peihua Chen avatar Duc Thanh Tran avatar  avatar Maximilian Lehr avatar  avatar Ryan Holbrook avatar Kyoungrok Jang avatar  avatar Suraj Mittal avatar throughtwork avatar Slice avatar Yuqiang Han avatar  avatar Houye avatar Xiao Wang(王逍) avatar 爱可可-爱生活 avatar Evan Davis avatar jerad fields avatar yuanke avatar David Gero avatar Philipp Dowling avatar  avatar  avatar  avatar Tony Lin avatar Pawel Cyrta avatar Naoya Murakami avatar 門出さん avatar Abdelkrime Aries avatar Slava Elizarov avatar liuzhenguo avatar Andreas van Cranenburgh avatar normanj avatar zhouyonglong avatar Mathieu Morey avatar Yong Fu avatar

Watchers

James Cloos avatar  avatar Evan Davis avatar jerad fields avatar Shreeram avatar  avatar paper2code - bot avatar

text_embedding's Issues

Clarifying licensing

We were interested in using code from this repository and saw that it does not explicitly include a license. Would the authors be open to adding one? We see that https://github.com/NLPrinceton/SARC is MIT-licensed and builds off of this, so something like that would be perfect. Appreciate your time.

cc @mkhodak

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.