GithubHelp home page GithubHelp logo

wpm / bisemantic Goto Github PK

View Code? Open in Web Editor NEW
13.0 2.0 6.0 228 KB

Text pair classification

License: MIT License

Python 100.00%
deep-learning lstm textual-entailment keras spacy quora-question-pairs snli

bisemantic's Introduction

Bisemantic

Bisemantic identifies semantic relationships between pairs of text. It uses a shared LSTM to map two texts to a common representation format which is then aligned with training labels.

Installation

Bisemantic depends on the spaCy text natural language processing toolkit. It may be necessary to install spaCy's English language text model with a command like python -m spacy download en before running. See spaCy's models documentation for more information.

Running

Run Bisemantic with the command bisemantic. Subcommands enable you to train and use models and partition data into cross-validation sets. Run bisemantic --help for details about specific commands.

Input data takes the form of comma-separated-value documents. Training data has the columns text1, text2, and label. Test data takes the same form minus the label column. Command line options allow you to read in files with different formatting.

Trained models are written to a directory that contains the following files:

  • model.info.text: a human-readable description of the model and training parameters
  • training-history.json: history of the training procedure, including the loss and accuracy for each epoch
  • model.h5: serialization of the model structure and its weights

Weights from the epoch with the best loss score are saved in model.h5.

The model directory can be used to predict probability distributions over labels and score test sets. Further training can be done using an existing model directory as a starting point.

Classifier Model

Text pair classification is framed as a supervised learning problem. The sample is a pair of texts and the label is a categorical class label. The meaning of the class varies from data set to data set but usually represents some kind of semantic relationship between the two texts.

GloVe vectors are used to embed the texts into matrices of size maximum tokens × 300, clipping or padding the first dimension for each individual text as needed. If maximum tokens is not specified, the number of tokens in the longest text in the pairs is used. An (optionally bidirectional) shared LSTM converts these embeddings to single vectors, r1 and r2, which are then concatenated into the vector [r1, r2, r1 · r2, (r1 - r2)2]. A single-layer perceptron maps this vector to a softmax prediction over the labels.

Example Uses

Bisemantic can be used for tasks like question de-duplication or textual entailment.

Question Deduplication

The Quora question pair corpus contains pairs of questions annotated as either asking the same thing or not.

Bisemantic creates a model similar to that described in [Homma et al.] and [Addair].
The following command can be used to train a model on the train.csv file in this data set.

bisemantic train train.csv \
    --text-1-name question1 --text-2-name question2 \
    --label-name is_duplicate --index-name id \
    --validation-fraction 0.2 --batch-size 1024 \
    --maximum-tokens 75 --dropout 0.5 --units 256 --bidirectional \
    --model-directory-name quora.model

This achieved an accuracy of 83.71% on the validation split after 9 epochs of training.

Textual Entailment

The Stanford Natural Language Inference corpus is a corpus for the recognizing textual entailment (RTE) task. It labels a "premise" sentence as either entailing, contradicting, or being neutral with respect to a "hypothesis" sentence.

Bisemantic creates a model similar to that described in [Bowman et al., 2015]. The following command can be used to train a model on the train snli_1.0_train.txt and snli_1.0_dev.txt files in this data set.

bisemantic train snli_1.0_train.txt \
		--text-1-name sentence1 --text-2-name sentence2 \
		--label-name gold_label --index-name pairID \
		--invalid-labels "-" --not-comma-delimited \
		--validation-set snli_1.0_dev.txt --batch-size 1024 \
		--dropout 0.5 --units 256 --bidirectional \
		--model-directory-name snli.model

This achieved an accuracy of 80.16% on the development set and 79.49% on the snli_1.0_test.txt test set after 9 epochs of training.

References

  • Travis Addair. Duplicate Question Pair Detection with Deep Learning [pdf]

  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [pdf]

  • Yushi Homma, Stuart Sy, Christopher Yeh. Detecting Duplicate Questions with Deep Learning. [pdf]

bisemantic's People

Contributors

wpm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

bisemantic's Issues

Rewrite using Mycroft

Use the Mycroft text classification framework. That will cut down on the code here, and prove that Mycroft is viable as a programmatic interface.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.