Kaggle: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
- Records: 50,000
- Columns: 2
- review
- sentiment - "positive" and "negative" => binary classification problem
- Pandas
- Seaborn
- Numpy
- Scikit-learn
- Tensorflow
- Keras
- Matplotlib
- Pickle
pip install -r requirements.txt
- Sequential model
- One Embedding layer
- Flattening layer
- Dense layer
- activation function
[Notebook](https://github.com/likarajo/movie_sentiment/blob/master/model_NN.ipynb)
Primarily used for 2D data classification, such as images. Work well with 1D text data as well. Tries to find specific features in the first layer. In the next layers, the initially detected features are joined together to form bigger features. Ref: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
- Sequential model
- One Embedding layer
- 1D convolutional layer
- features or kernels
- activation function
- Global max pooling layer
- reduce feature size
- Dense layer
- activation function
[Notebook](https://github.com/likarajo/movie_sentiment/blob/master/model_CNN.ipynb)
Long Short Term Memory Network (LSTM)
Recurrent neural networks variant
- Sequential model
- One Embedding layer
- LSTM layer
- neurons
- Dense layer
- activation function
[Notebook](https://github.com/likarajo/movie_sentiment/blob/master/model_RNN.ipynb)
- Keras Embedding Layer
- Stanford CoreNLP GloVe word embeddings
- The difference between the accuracy values for training and test sets is much smaller in Recurrent NN as compared to that in Simple NN and Convolutional NN.
- The difference between the loss values is negligible in Recurrent NN.
- Model is NOT overfitting
So RNN is the best best algorithm for the model for our text classification.
The number of layers, neurons, hyper parameters values, activation functions etc. can be changed to find the best NN model.