Reddit Comments Classification

Access the Streamlit app here: Reddit Comments Classifier

Introduction
Prerequisites
Installation
Usage
Results
Detailed Use
Contributing

Introduction

This project classifies Reddit comments into three categories: Medical Doctor, Veterinarian, or Other. It uses OpenAI's GPT-3.5-turbo for initial classification and then refines the labels using fuzzy matching techniques. The final classification model is trained using a Random Forest classifier with the TF-IDF vectorizer for feature extraction and SMOTE for handling class imbalance.

Prerequisites

Python 3.8+
PostgreSQL database
OpenAI API key
pip (Python package installer)

Installation

Clone the repository:

git clone https://github.com/your-username/reddit-comments-classification.git
cd reddit-comments-classification

Install the required Python packages:

pip install pandas sqlalchemy tqdm python-dotenv openai fuzzywuzzy python-Levenshtein imbalanced-learn scikit-learn

Ensure you have a .env file with your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key
```

Usage

Database Connection

Update the database connection string in the code to connect to your PostgreSQL database:

conn_str = "postgresql://username:password@host/database?options=option&sslmode=require"

Environment Variables

Ensure the environment variables are loaded properly from the .env file:

load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

Data Sampling

The script samples 800 comments from the Reddit comments dataset for classification:

sample_size = 800
df_sample = df.sample(n=sample_size, random_state=1)

Classification

The comments are classified using OpenAI's GPT-3.5-turbo:

for comment in tqdm(df_sample['comments'], desc="Classifying comments"):
    label = classify_comment(comment)
    labels.append(label)

Data Cleaning

The script uses fuzzy matching to ensure consistent labeling:

df_sample['real_label'] = df_sample['label'].apply(lambda x: get_best_match(x, valid_labels))

Model Training

A Random Forest classifier is trained using TF-IDF features and SMOTE for class balancing:

X = vectorizer.fit_transform(df_sample['comments'])
y = df_sample['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

classifier = RandomForestClassifier(class_weight='balanced', random_state=42)
classifier.fit(X_resampled, y_resampled)

Results

The script outputs classification reports, confusion matrix, and ROC AUC scores to evaluate the model's performance:

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, classifier.predict_proba(X_test), multi_class='ovr'))

Detailed Use

Setup

Clone this repository to your local machine.
Install the required Python packages using pip:
```
pip install -r requirements.txt
```
The pre-trained model and necessary files should be available your project directory:
- vectorizer.pkl: The pre-trained TF-IDF vectorizer.
- classifier_model.pkl: The pre-trained RandomForestClassifier model.

How to Run the Model on New Comments - Step-by-Step Instructions

Prepare the CSV File: Ensure you have a CSV file containing the new comments you wish to classify. The CSV file should have a column named comments that contains the text of the comments.
Load the Pre-trained Model and Vectorizer: The script will load the pre-trained TF-IDF vectorizer and RandomForestClassifier model from the pickle files.

Run the Script to Classify New Comments: Use the following script to classify the comments and view the results:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
import joblib

# Load the pre-trained TF-IDF vectorizer and classifier model
vectorizer = joblib.load('vectorizer.pkl')
classifier = joblib.load('classifier_model.pkl')

# Read CSV file into a DataFrame
csv_file_path = 'path_to_your_csv.csv'  # Replace with the path to your CSV file
df = pd.read_csv(csv_file_path)

# Transform the comments using the loaded TF-IDF vectorizer
X_new = vectorizer.transform(df['comments'])

# Predict the labels using the loaded classifier
df['label'] = classifier.predict(X_new)

# Display the labeled DataFrame
print("Labeled comments data:")
print(df.head())

# Save the labeled DataFrame to a CSV file
output_file = "labeled_comments.csv"
df.to_csv(output_file, index=False)
print(f"Labeled comments saved to {output_file}")

Execute the script to classify the comments:

python classify_new_comments.py

Contributing

Contributions/feedbacks are welcome!

bernard-rr / coding_challenge Goto Github PK

coding_challenge's Introduction

Reddit Comments Classification

Access the Streamlit app here: Reddit Comments Classifier

Table of Contents

Introduction

Prerequisites

Installation

Usage

Database Connection

Environment Variables

Data Sampling

Classification

Data Cleaning

Model Training

Results

Detailed Use

Setup

How to Run the Model on New Comments - Step-by-Step Instructions

Execute the script to classify the comments:

Contributing

coding_challenge's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs