Welcome to the GitHub repository for the Authorship Attribution project, where machine learning meets linguistic analysis! This project is all about classifying authors of texts using their unique writing styles. The dataset comprises texts from six different authors, making it a supervised learning challenge with a twist of linguistics.
Authorship Attribution is the process of identifying the author of a text based on their unique writing style or 'fingerprint'. This project is split into two main parts:
- Data Cleaning and Feature Engineering - Where we prepare the text data and extract meaningful features that capture the essence of each author's style.
- Model Training and Evaluation - Where various machine learning models are trained and evaluated to find the one that best identifies the authors.
Part1_Data_Cleaning_and_Feature_Engineering.ipynb
Part2_Model_Training_and_Evaluation.ipynb
cleaned_data.csv
mwe_tokenizer.pkl
Assignment_Data (folder containing dataset)
Python
Pandas & NumPy for data manipulation
NLTK for natural language processing
scikit-learn for machine learning
Matplotlib & Seaborn for visualization
Jupyter Notebook for interactive development
The pinnacle of success in this Authorship Attribution project is the remarkable achievement of a 95% F1 score, meticulously obtained through Stratified 5-Fold Cross-Validation. This exceptional result is far more than a mere indicator of accuracy; it's a compelling evidence of the model's robustness and its consistent performance across diverse data subsets.