GithubHelp home page GithubHelp logo

cs221_project's Introduction

CS221_Project

The purpose of this repository is to implement Quora insincere sentence classification in NLP with deep learning. It is part of CS221 project

Highlights

  • Naive Bayes Classifier as baseline
  • Bi-directional LSTM with combined embeddings and capsule network for aggregation as the final model
  • F1-score of 0.70

Environment:

Python 3.7.1
Tensorflow 1.13.1
Keras 2.2.4
Pandas, numpy, sklearn, scikitplot, matplotlib
GPU with atleast 16GB memory

Dataset:

The dataset is available at Kaggle at https://www.kaggle.com/c/quora-insincere-questions-classification/data.
A subset of this data is available as train.csv file in the input directory
Following Word embeddings are needed.
GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html
Elmo - https://tfhub.dev/google/elmo/2

Models tried

  • Naive Bayes (Baseline)
  • Stacked BiLSTM + Embedding without aggregation layer
  • Glove Embedding + BiLSTM + MaxPooling + Threshold optimization
  • Paradigm Embedding + BiLSTM + MaxPooling + Threshold optimization
  • Elmo Embedding + MaxPooling + Threshold Optimization
  • Self Embedding + BiLSTM + MaxPooling + Threshold optimization
  • Wiki Embedding + BiLSTM + MaxPooling + Threshold optimization
  • Glove+Paradigm Embedding + BiLSTM + Capsule Network + Threshold Optimization

How To Execute

The models were first implemented in notebook and then converted to *.py

  • NB - python ./quora_baseline.py
  • Glove + LSTM - python ./quora_LSTM_glove.py
  • Elmo - python ./quora_LSTM_elmo.py
  • Paragram - python ./quora_LSTM_paragram.py
  • Self Train - python ./quora_LSTM_self_train.py
  • Capsule Network - python ./quora_LSTM_capsule.py

Final Model

alt text

Results

Model Dataset F1 Score on Test Set
Baseline (Naive Bayes) 0.53
Initial Model (No Aggregation Layer) 0.607
Updated Model (Glove Only) 0.677
Updated Model (Paragram Only) 0.676
Updated Model (wiki) 0.665
Updated Model (Elmo) 0.644
Updated Model (Self Trained Emb.) 0.653
Final Model (Glove and Paragram, Capsule Aggregation Layer) 0.70
Augmented Data (Twitter Data with Final Arch) 0.755

Conclusions

In this project we built a sequence machine with embedding input and dynamic route policy aggregation for semantic classification of medium length sentences(under 100 words). The model is developed in stages by assessing the improvement in every step. The conclusion from our experiments is summarized below.

  • Tokenization with minimal pre-processing is adequate for representing features in a sentence.
  • Word embeddings are essential for natural language processing tasks as they reduce the complexity of handling large vocabulary and provide tremendous generalization of words not in the training set.
  • Ensembling of multiple embeddings improves the performance by better generalizations due to enriched context.
  • Capsule networks provide better aggregation of LSTM encoders than the pooling layer. Capsule networks improve the classification performance by better understanding of the spatial/temporal relationships between entities and learn these relationships via dynamic routing.
  • If the data set is unbalanced, threshold optimization based on validation set achieves better precision-recall trade-off and improved performance.

cs221_project's People

Contributors

madhuhegde avatar madhuhegde1 avatar rohit-0907 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.