GithubHelp home page GithubHelp logo

pjcv89 / autotag Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 12.63 MB

Automatic Tag Generation for StackOverflow Questions using fastText and StarSpace

Shell 0.03% Dockerfile 0.03% Jupyter Notebook 99.94%

autotag's Introduction

Multi-label text classification with fastText and StarSpace

Tutorial: Automatic Tag Generation for StackOverflow Questions

By: Pablo Campos Viana

Overview

This tutorial shows how to perform multi-label text classification with two Facebook AI Research's tools: fastText and StarSpace. We will use the stacksample data to perform automatic tag generation. More precisely, given (short) text of questions titles, we want to predict their most probable tags.

Requirements

This tutorial assumes a standard installation of miniconda (based on Python 3.7) that is ready to use, running on a Linux system.

The following tools are required:

Apart from the basic scientific stack (numpy, pandas, scikit-learn, matplotlib, and of course, jupyter) the following Python libraries are required (included in the requirements.txt file).

Usage

Docker users: If you will be using the Docker image, you don't have to worry for any of the requirements, but make sure you have Docker installed in your machine. Just run the following command in your terminal to pull and run the latest version of the image.

docker run -p 8080:8888 pjcv89/autotag

If you have some experience with Docker, you may want to give a name to your container with the --name flag and use a volume with the -v flag to transfer data between your machine and the container. More specifically, you may want to choose some model and persist its model file so you can use it in another context (local-mode or another container, for example, to use it in a web application). Once you have chosen a working directory in your machine, you can run the following command in your terminal.

docker run --name myautotag -p 8080:8888 -v $PWD/persist:/AutoTag/persist pjcv89/autotag

Once inside the container and once you have generated the models, you can copy a model file from the /models folder to the /persist folder and it will be copied to your machine in the $PWD/persist folder.

In either case, a notebook instance will be launched and you can go to http://localhost:8080/ to use it. You should copy the token displayed in the command line and paste it in the jupyter welcome page.

Local-mode users: You will need to install the requirements by yourself. You can follow the commands shown below in your terminal once you have chosen a working directory.

# Install tools
apt-get update && apt-get -y install gcc g++ make cmake unzip
# Clone this repo.
git clone https://github.com/pjcv89/AutoTag.git && cd AutoTag
# Install Python libraries
pip install -r requirements.txt

# Give execution permissions to installation scripts
chmod u+x install_fasttext.sh install_starspace.sh
# fastText CLI and Python API installation
./install_fasttext.sh
# StarSpace CLI installation
./install_starspace.sh
# Install jupyter if it isn't installed yet
pip install jupyter
# Launch notebook
jupyter notebook

Files and folders

The following files are provided:

  • requirements.txt: Text file with the required Python libraries.
  • install_starspace.sh: The shell script to install StarSpace (CLI), based on its documentation.
  • install_fasttext.sh: The shell script to install fastText (CLI and Python API) using cmake, based on its documentation.
  • Models.ipynb: The development notebook used for this tutorial and required to reproduce the results. You can view the notebook with Jupyter Notebook Viewer here.

After executing the installation scripts, the following folders will be present:

  • /Starspace: It contains the Starspace's source code.
  • /fastText: It contains the fastText's source code.

While running the notebook, the following folders will be created:

  • /stacksample: It contains the Questions.csv and Tags.csv tables downloaded via the Kaggle API after executing the notebook.
  • /data: It contains the train,valid, and test text files in the appropiate format required for both fastText and StarSpace. Also the train_weighted file to perform training with label weights with StarSpace.
  • /models: It contains the fastText and StarSpace model files.
  • /predictions: It contains raw and processed text files with predictions for the test data, after inference with fastText and StarSpace.

Additionally, you will require to download a Kaggle token (a kaggle.json file containing your Kaggle API credentials, more info. here) so you can copy and paste the credentials to declare environment variables inside the notebook and download the stacksample data.

Structure of the notebook

The notebook is organized as follows.

  1. Getting the data
  2. Preparing the data
  3. (Quick) Data exploration & visualization
  4. Processing the data
  5. Creating training, validation and test sets
  6. Building the models
    1. fastText: baseline
    2. fastText: tuned model
    3. StarSpace: no label weights
    4. StarSpace: label weights
  7. Model evaluation

Please note that the aim of this tutorial is to show how to use fastText and StarSpace, so you should focus on parts 6 and 7.

Resources

  • fastText related papers:
  1. Bag of Tricks for Efficient Text Classification
  2. Enriching Word Vectors with Subword Information
  3. FastText.zip: Compressing text classification models
  4. Misspelling Oblivious Word Embeddings
  • Techniques used in fastText to improve scalability and training time:
  1. Hierarchical softmax based on the Huffman coding tree
  2. The hashing trick
  3. Hyperparameter autotuning for fastText
  • StarSpace paper:
  1. StarSpace: Embed All The Things!

autotag's People

Contributors

pjcv89 avatar

Stargazers

César de Pablo avatar Tsuki avatar

Watchers

paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.