Build text classifiers using 3 most popular machine learning or deep learning libraries - Scikit-learn, PyTorch, TensorFlow
You can download anaconda individual edition from https://www.anaconda.com/products/individual, which contains all the useful libraries used by data scientists. Another option is to intall the following packages individually use pip package manager.
- python 3.8.3
- tensorflow 2.2.0
- torch 1.5.1
- jupyterlab 2.1.4
- pandas 1.0.4
Text classification has been widely used in real world business processes like email spam detection, support ticket classification, or content recommendation based on text topics. I would like to build multi-class text classfier using the 3 most popular open source machine learning or deep learning libraries: scikit-learn, PyTorch, and TensorFlow. I am interested in seeing how they perform comparing to each other.
- gather_explore_data.ipynb: Gathers sample data used for this project and explore how the data look like
- feature_extraction.ipynb: Transforms texts or words into numerical vector representation in order to feed into models for training
- util.py: The help functions for feature extraction
- model_scikit_learn.ipynb: Build and train text classifiers using Scikit Learn
- model_pytorch.ipynb: Build and train text classification using PyTorch
- model_tensorflow_tfidf.ipynb: Build and train text classification using TensorFlow, and encoding input texts using TF-IDF algorithm
- model_tensorflow.ipynb: Build and train tect classification using TensorFlow, and encode imput text using padded sequences. Also apply word embedding.
The result can be found at the post available at https://medium.com/@donglinchen/text-classification-using-scikit-learn-pytorch-and-tensorflow-a3350808f9f7
Sample data are available at: https://www.kaggle.com/yufengdev/bbc-fulltext-and-category