This tutorial shows how to perform multi-label text classification with two Facebook AI Research's tools: fastText and StarSpace. We will use the stacksample data to perform automatic tag generation. More precisely, given (short) text of questions titles, we want to predict their most probable tags.
This tutorial assumes a standard installation of miniconda (based on Python 3.7) that is ready to use, running on a Linux system.
The following tools are required:
Apart from the basic scientific stack (numpy, pandas, scikit-learn, matplotlib, and of course, jupyter) the following Python libraries are required (included in the requirements.txt
file).
Docker users: If you will be using the Docker image, you don't have to worry for any of the requirements, but make sure you have Docker installed in your machine. Just run the following command in your terminal to pull and run the latest version of the image.
docker run -p 8080:8888 pjcv89/autotag
If you have some experience with Docker, you may want to give a name to your container with the --name
flag and use a volume with the -v
flag to transfer data between your machine and the container. More specifically, you may want to choose some model and persist its model file so you can use it in another context (local-mode or another container, for example, to use it in a web application). Once you have chosen a working directory in your machine, you can run the following command in your terminal.
docker run --name myautotag -p 8080:8888 -v $PWD/persist:/AutoTag/persist pjcv89/autotag
Once inside the container and once you have generated the models, you can copy a model file from the /models
folder to the /persist
folder and it will be copied to your machine in the $PWD/persist
folder.
In either case, a notebook instance will be launched and you can go to http://localhost:8080/ to use it. You should copy the token displayed in the command line and paste it in the jupyter welcome page.
Local-mode users: You will need to install the requirements by yourself. You can follow the commands shown below in your terminal once you have chosen a working directory.
# Install tools
apt-get update && apt-get -y install gcc g++ make cmake unzip
# Clone this repo.
git clone https://github.com/pjcv89/AutoTag.git && cd AutoTag
# Install Python libraries
pip install -r requirements.txt
# Give execution permissions to installation scripts
chmod u+x install_fasttext.sh install_starspace.sh
# fastText CLI and Python API installation
./install_fasttext.sh
# StarSpace CLI installation
./install_starspace.sh
# Install jupyter if it isn't installed yet
pip install jupyter
# Launch notebook
jupyter notebook
The following files are provided:
requirements.txt
: Text file with the required Python libraries.install_starspace.sh
: The shell script to install StarSpace (CLI), based on its documentation.install_fasttext.sh
: The shell script to install fastText (CLI and Python API) using cmake, based on its documentation.Models.ipynb
: The development notebook used for this tutorial and required to reproduce the results. You can view the notebook with Jupyter Notebook Viewer here.
After executing the installation scripts, the following folders will be present:
- /Starspace: It contains the Starspace's source code.
- /fastText: It contains the fastText's source code.
While running the notebook, the following folders will be created:
- /stacksample: It contains the
Questions.csv
andTags.csv
tables downloaded via the Kaggle API after executing the notebook. - /data: It contains the
train
,valid
, andtest
text files in the appropiate format required for both fastText and StarSpace. Also thetrain_weighted
file to perform training with label weights with StarSpace. - /models: It contains the fastText and StarSpace model files.
- /predictions: It contains raw and processed text files with predictions for the test data, after inference with fastText and StarSpace.
Additionally, you will require to download a Kaggle token (a kaggle.json
file containing your Kaggle API credentials, more info. here) so you can copy and paste the credentials to declare environment variables inside the notebook and download the stacksample data.
The notebook is organized as follows.
- Getting the data
- Preparing the data
- (Quick) Data exploration & visualization
- Processing the data
- Creating training, validation and test sets
- Building the models
- fastText: baseline
- fastText: tuned model
- StarSpace: no label weights
- StarSpace: label weights
- Model evaluation
Please note that the aim of this tutorial is to show how to use fastText and StarSpace, so you should focus on parts 6 and 7.
- fastText related papers:
- Bag of Tricks for Efficient Text Classification
- Enriching Word Vectors with Subword Information
- FastText.zip: Compressing text classification models
- Misspelling Oblivious Word Embeddings
- Techniques used in fastText to improve scalability and training time:
- Hierarchical softmax based on the Huffman coding tree
- The hashing trick
- Hyperparameter autotuning for fastText
- StarSpace paper: