For preprocessing the data, the code is present in google colab. One can use any one of the following method to clean the data
- Cleaning Using Regular Expression
- Stemming
The programmer can either run the code for cleaning the dataset or run the code to import the clean dataset directly from drive. The second option is recommended as it will save the time.
It converts the given sequence of text into vectors. Word vectorization is a map words from vocabulary to a corresponding vector of real numbers which is used to find word predictions, word similarities/semantics. We have used TFIDF vectorizer.
The dataset is highly imbalance. So, we have used SMOTE (Synthetic Minority Over-sampling Technique) for minority class and RandomUnderSampler for majority class to balance the dataset.
Label Encoding is used to represent the labels in numeric form. But, the number of labels in train dataset and validation dataset isn't equal. So, one hot encoding is used to ensure the uniformity in the shape of target classes.
Train the neural network from the colab code or directly import it from the model subdirectory in the drive link.
One can make the prediction and calculate f1 score on either validation or test dataset running the function predictions_f1score.
We have used the Fully Connected Neural Network as our model for this project. Packages like Tensorflow,SkLearn, Numpy, Pandas, matplotlib, seaborn, and so on are used as required. It takes almost 1 hour 40 minutes to train the model.
We have used Google Colab for training our model. Google Colab uses python 3 Google Compute engineer backend. Total Ram provided by google colab: 12.69 GB (used almost 4 GB) Disk Space: 42.16GB/107.72GB
Kriti Nyoupane - [email protected] Gaurav Jyakhwa - [email protected]