We're making this project for Kavach Hackathon. The project aims to detect fake news articles using machine learning techniques. The goal is to build a model that can accurately classify news articles as either real or fake. To mitigate the impact of the spread of fake news, the tool would automatically send official/authenticated news content to the inboxes of those who have spread fake news. This would help to educate them and prevent them from spreading false information in the future.
To get started with this project, you will need to have Python 3 installed on your computer. You will also need to install the following packages:
- pandas
- numpy
- scikit-learn
- nltk
- matplotlib
You can install these packages using pip:
pip install pandas numpy scikit-learn nltk matplotlib
We're using IFND dataset to train our model. This dataset is a collection of news articles from various sources. The dataset includes both real and fake news articles. You can find this dataset on KAGGLE.
Before training the model, the text data needs to be preprocessed. The following steps are performed:
- Lowercasing: All text is converted to lowercase.
- Tokenization: The text is split into individual words.
- Stopword removal: Common words like "the" and "and" are removed.
- Stemming: Words are reduced to their stem form (e.g. "running" becomes "run").
The model is trained using a logistic regression classifier and KNN classifier. The training data is split into a training set and a validation set using a 80/20 split. The model is trained on the training set and evaluated on the validation set.
The model is evaluated using the accuracy and F1 score, which is a weighted average of precision and recall. The F1 score and accuracy ranges from 0 to 1, with 1 being the best possible score. The lr_model.pkl we used got an accuracy of 0.91 which made it rank above DecisionTree classifier which had the accuracy of 0.89 and NB classifier which had the accuracy of 0.90.
We haven't launched our project on a public domain and we are not planning to do it in near future too. But, you can still use our trained model to make predictions with the help of lr_model.pkl, knn_model.pkl and tfidf_model.pkl.
You can use it with the following Python code:-
tfidf_vectorizer = pickle.load(open('tfidf.pkl', 'rb'))
pickled_model = pickle.load(open('lr_model.pkl', 'rb'))
knn_model = pickle.load(open('knn_model.pkl', 'rb'))
You can also find the possible real version of the fake news using the knn_model.pkl. It gives you the 5 closest real versions of the fake news provided to it.
This project demonstrates how machine learning can be used to detect fake news articles. While the model is not perfect, it shows promise and can be improved with more data and better preprocessing techniques.
Huge Thanks to the fellow contributors for making this possible!