flipkart-incubator / optimus Goto Github PK

Train, evaluate and deploy Deep Learning based text classifiers. Currently supports CNN

License: Apache License 2.0

Python 96.94% Shell 0.60% Perl 2.47%

optimus's Introduction

Optimus

Quickly train, evaluate and deploy an optimum classifier for your text classification task. Currently, it allows you to train a CNN (Convolutional Neural Network) based text classifier. Using this toolkit, you should be able to train a classifier for most of the text classification tasks without writing a single piece of code.

The main features of Optimus are:

Easily train a CNN classifier
Config driven to make hyperparameter tuning and experimentation easy
Debug mode: which allows you to visualize what is happening in the internal layers of the model
Flask server for querying the trained model through an API

This project is based on: https://github.com/yoonkim/CNN_sentence (Many thanks to Yoon for open sourcing his code for the paper: http://arxiv.org/abs/1408.5882, which is arguably the best generic Deep Learning based text classifier at the time of writing.) The improvements over the original code are:

Multi-channel mode
Complete refactoring to make the code modular
GPU/CPU unpickling of models
Config driven, for easy experimentation
Model serialization/deserialization
Detailed evaluation results
Model deployment on a Flask server
Multi Class classification [In progress]
Debug Mode [In progress]

This project is also inspired by https://github.com/japerk/nltk-trainer, which allows users to easily train NLTK based statistical classifiers.

Why deep learning?

Deep learning has dominated pattern recognition in the last few years, especially in image and speech. Recently deep learning models have outperformed statistical classifiers in a variety of NLP tasks as well. Also, one of the biggest advantage of using deep learning models is that task specific feature engineering is not required. The wiki contains a summary of exciting results we obtained using optimus, on a variety of different text classification tasks. Those interested in understanding how this model works can also check out my talk at Fifth elephant, in which I give an introduction to NLP using deep learning. Other good recommended resources can also be found here and here.

Requirements

Code requires Python 2.7 and Theano 0.7. You can go to the Setting Up page, for instructions on how to quickly set up the python environment required for Optimus. Requirements are also listed in the requirements.txt file.

Start Using it

Visit the Quick Start guide to get started on using Optimus! I have also written a small tutorial on Optimus on my blog.

You can compare models trained using optimus to statistical models by using https://github.com/japerk/nltk-trainer, an awesome tool for easily training statistical classifiers. If you get some good results on a dataset, I would love to know about them!

In case you face any issue, you can create an issue on github or send me a mail at [email protected]. Suggestions and improvements are most welcome. Open github issues are a good place to start. A contributor's guide is under works.

Core contributors

Devashish Shankar (@devashishshankar)
Prerana Singhal (@singhalprerana)

optimus's People

Contributors

Stargazers

Watchers

optimus's Issues

Memory error on large test set

Memory error is thrown on large test set. We can create mini batches of test set in the same way as we are doing it for train set.

Setting up the environment

For pip install -U scikit-learn

Error:

Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-vDJ91c/scikit-learn/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-P3laW8-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-vDJ91c/scikit-learn

I am not able to comprehend this error.

No link available to check the wiki which contains the results of deploying this algorithm

Appreciate the great work you guys have done and nice to see Flipkart supporting research work in Deep Learning for NLP. However, you mentioned in your Readme that "The wiki contains a summary of exciting results we obtained using optimus, on a variety of different text classification tasks" however I cannot figure out the wiki location. It will be interesting to see the results especially on sentences containing double negatives like "Overall it wasn't a bad movie".

Segmentation Fault During Testing

python test.py sample/myFirstModel.p sample/datasets/sst_small_sample.csv sample/outputNonStatic true false

Performing this operation after training results in a segmentation fault

Reading line no. 0
['neg', 'pos']
lds 300
dss (64,)
Segmentation fault (core dumped)\

EDIT: problems arose from not using gpu_to_cpu.py, upon using this everything worked perfectly

NotImplementedError: The image and the kernel must have the same type.inputs

Hi, Im trying to utilize the GPU to train on EC2 with a g2 instance, but on all sorts of configuration types I get this error. Im using theano 0.7 and have changed FloatX=float32 and device=gpu0
This doesn't work, and I noticed that this happens with yoon's original code too.

Has anyone found a way around this error?

If anybody has a configuration where GPU computation is working, please post your config!
Thanks!

Integrate multi label classification

We have modified yoon's original code to support multilabel classification. This was done by changing the last layer to a softmax. The task here is to integrate this code with the current refactored code.

GPU/ CPU pickling

Models trained on GPU are not deserializable on CPU. Many use cases would involve training of models on GPU, and deploying them on CPU.

The exact problem is that in the GPU models, cuda ndarrays are used instead of numpy arrays. When the model is unpickled, an error is thrown saying cuda is not found. This issue cannot be simply fixed by installing cuda on the CPU machine, but all the ndarrays have to be converted into numpy arrays.

There are two ways to solve this problem:

Write a script to open up the pickle file, recursively go through the object and convert all cuda ndarrays into numpy arrays. This is what PyLearn2 does, but this is hacky.
While serializing, convert all cuda nd arrays into numpy arrays. The solutioning would require:
-> Figuring out what variables are cuda ndarrays (This can be done by opening a model file trained on a GPU)
-> When pickle on the model object is called, appropriately convert the variables into numpy arrays, before pickling it

Integrate debug mode

Integrate the debug mode: