GithubHelp home page GithubHelp logo

pombredanne / get_started_with_deep_learning_for_text_with_allennlp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from argilla-io/get_started_with_deep_learning_for_text_with_allennlp

0.0 1.0 0.0 198 KB

Getting started with AllenNLP and PyTorch by training a tweet classifier

Shell 7.06% Python 92.94%

get_started_with_deep_learning_for_text_with_allennlp's Introduction

Introduction

This repository contains code and experiments using PyTorch, AllenNLP and spaCy and is intended as a learning resource for getting started with this libraries and with deep learning for NLP technologies.

In particular, it contains:

  1. Custom modules for defining a SequenceClassifier and its Predictor.
  2. A basic custom DataReader for reading CSV files.
  3. An experiments folder containing several experiment JSON files to show how to define a baseline and refine it with more sophisticated approaches.

The overall goal is to classify tweets in Spanish corresponding to the COSET challenge dataset: a collection of tweets for a recent Spanish Election. The winning approach of the challenge is described in the following paper: http://ceur-ws.org/Vol-1881/COSET_paper_7.pdf.

Setup

Use a virtual environment, Conda for example:

conda create -n allennlp_spacy
source activate allennlp_spacy

Install PyTorch for your platform:

pip install http://download.pytorch.org/whl/torch-0.2.0.post3-cp36-cp36m-macosx_10_7_x86_64.whl

Install spaCy Spanish model:

python -m spacy download es

Install AllenNLP and other dependencies:

pip install -r requirements.txt

Install custom module for running AllenNLP commands with custom models:

python setup.py develop

Install Tensorboard:

pip install tensorboard

Download pre-trained and prepare word vectors from fastText project:

download_prepare_fasttext.sh

Goals

  1. Understand the basic components of AllenNLP and PyTorch.

  2. Understand how to configure AllenNLP to use spaCy models in different languages, in this case Spanish model.

  3. Understand how to create and connect custom models using AllenNLP and extending its command-line.

  4. Design and compare several experiments on a simple Tweet classification tasks in Spanish. Start by defining a simple baseline and progressively use more complex models.

  5. Use Tensorboard for monitoring the experiments.

  6. Compare your results with existing literature (i.e., results of the COSET Tweet classification challenge)

  7. Learn how to prepare and use external pre-trained word embeddings, in this case fastText's wikipedia-based word vectors.

Exercises

Inspecting Seq2VecEncoders and understanding the basic building blocks of AllenNLP:

Check the basic structure of these modules in AllenNLP.

Defining and running our baseline:

In the folder experiments/definitions/ you can find the definition of our baseline, using a BagOfEmbeddingsEncoder.

Run the experiment using:

python -m recognai.run train experiments/definitions/baseline_boe_classifier.json -s experiments/output/baseline

Monitor your experiments using Tensorboard:

You can monitor your experiments by running TensorBoard and pointing it to the experiments output folder:

tensorboard --logdir=experiments/output

Defining and running a CNN classifier:

In the folder experiments/definitions/ you can find the definition of a CNN classifier. As you see, we only need to configure a new encoder using a CNN.

Run the experiment using:

python -m recognai.run train experiments/definitions/cnn_classifier.json -s experiments/output/cnn

Using pre-trained word embeddings:

Facebook fastText's team has made available pre-trained word embeddings for 294 languages (see https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md). Using the download_prepare_fasttext.sh script, you can download the Spanish vectors and use them as pre-trained weights in either of the models.

To use pre-trained embeddings, you can run the experiment using:

python -m recognai.run train experiments/definitions/cnn_classifier_fasttext_embeddings_fixed.json -s experiments/output/cnn_embeddings_fixed

Or use pre-trained embeddings and let the network tune their weights, using:

python -m recognai.run train experiments/definitions/cnn_classifier_fasttext_embeddings_tunable.json -s experiments/output/cnn_embeddings_tuned

Extra:

get_started_with_deep_learning_for_text_with_allennlp's People

Contributors

dvsrepo avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.