GithubHelp home page GithubHelp logo

kryndex / infersent Goto Github PK

View Code? Open in Web Editor NEW

This project forked from facebookresearch/infersent

0.0 1.0 0.0 77 KB

Sentence embeddings (InferSent) and training code for NLI.

License: Other

Python 54.28% Shell 1.85% Jupyter Notebook 43.87%

infersent's Introduction

InferSent

InferSent is a sentence embeddings method that provides semantic sentence representations. It is trained on natural language inference data and generalizes well to many different tasks.

We provide our pre-trained sentence encoder for reproducing the results from our paper. See also SentEval for automatic evaluation of the quality of sentence embeddings.

Dependencies

This code is written in python. The dependencies are:

Download datasets

To get GloVe, SNLI and MultiNLI [2GB, 90MB, 216MB], run (in dataset/):

./get_data.bash

This will download GloVe and preprocess SNLI/MultiNLI data/senteval_data.

Use our sentence encoder

See encoder/play.ipynb for an example.

0) Download our model trained on AllNLI (SNLI and MultiNLI) [147MB]:

curl -Lo encoder/infersent.allnli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.allnli.pickle

1) Load our pre-trained model (in encoder/):

import torch
infersent = torch.load('infersent.allnli.pickle')

Note: to load it, you need the file "models.py" (in encoder/) that provides the definition of the model.

2) Set GloVe path for the model:

infersent.set_glove_path(glove_path)

where glove_path is the path to 'glove.840B.300d.txt', containing glove vectors with which our model was trained. Note that using GloVe vectors allows to have a coverage of more than 2 million english words.

3) Build the vocabulary of word vectors (i.e keep only those needed):

infersent.build_vocab(sentences, tokenize=True)

where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK. Use nltk.download('punkt') once to download the NLTK tokenizer.

4) Encode your sentences (list of n sentences):

infersent.encode(sentences, tokenize=True)

This will output an numpy array with n vectors of dimension 4096 (dimension of the sentence embeddings). Speed is around 1000 sentences per second with batch size 128 on a single GPU.

5) Visualize the importance that our model attributes to each word:

Our representations were trained to focus on semantic information such that a classifier can easily tell the difference between contradictory, neutral or entailed sentences. We provide a function to visualize the importance of each word in the encoding of a sentence:

infersent.visualize('A man plays an instrument.', tokenize=True)

Model

Train model on Natural Language Inference (SNLI)

To reproduce our results and train our models on SNLI, set GLOVE_PATH in train_nli.py, then run:

python train_nli.py

You should obtain a dev accuracy of 85 and a test accuracy of 84.5 with the default setting.

Reproduce our results on transfer tasks

To reproduce our results on transfer tasks, clone SentEval and set PATH_SENTEVAL, PATH_TRANSFER_TASKS in evaluate_model.py, then run:

python evaluate_model.py

Using our best model infersent.allnli.pickle, you should obtain the following test results:

Model MR CR SUBJ MPQA STS14 STS Benchmark SICK Relatedness SICK Entailment SST TREC MRPC
InferSent 81.1 86.3 92.4 90.2 .68/.65 75.8/75.5 0.884 86.1 84.6 88.2 76.2/83.1
SkipThought 79.4 83.1 93.7 89.3 .44/.45 72.1/70.2 0.858 79.5 82.9 88.4 -

Note that while InferSent provides good features for many different tasks, our approach also obtains strong results on STS tasks which evaluate the quality of the cosine metrics in the embedding space.

Reference

Please cite 1 if you found this code useful.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@article{conneau2017supervised,
  title={Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  author={Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Loic and Bordes, Antoine},
  journal={arXiv preprint arXiv:1705.02364},
  year={2017}
}

Contact: [email protected]

infersent's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.