Classifying Spam emails with the help of countvectorizer, tf-idf and linear svc
Making use of
-
tokenization through WordNetLemmatizer
-
tf-idf transformer[https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Xoesx4hKiUk]
-
Linear SVC for classification [https://www.youtube.com/watch?v=efR1C6CvhmE]
-
Gridsearch Cross validation to find optimal parameters for classifier model[https://www.youtube.com/watch?v=fSytzGwwBVw]
-
Evaluate the model using plot_precision_recall_curve to ensure we are covering our bases in case of an imbalanced datasets [https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/]