GithubHelp home page GithubHelp logo

coding_challenge's Introduction

Reddit Comments Classification

Access the Streamlit app here: Reddit Comments Classifier

Table of Contents

Introduction

This project classifies Reddit comments into three categories: Medical Doctor, Veterinarian, or Other. It uses OpenAI's GPT-3.5-turbo for initial classification and then refines the labels using fuzzy matching techniques. The final classification model is trained using a Random Forest classifier with the TF-IDF vectorizer for feature extraction and SMOTE for handling class imbalance.

Prerequisites

  • Python 3.8+
  • PostgreSQL database
  • OpenAI API key
  • pip (Python package installer)

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/reddit-comments-classification.git
    cd reddit-comments-classification
  2. Install the required Python packages:

    pip install pandas sqlalchemy tqdm python-dotenv openai fuzzywuzzy python-Levenshtein imbalanced-learn scikit-learn
  3. Ensure you have a .env file with your OpenAI API key:

    OPENAI_API_KEY=your_openai_api_key
    

Usage

Database Connection

Update the database connection string in the code to connect to your PostgreSQL database:

conn_str = "postgresql://username:password@host/database?options=option&sslmode=require"

Environment Variables

Ensure the environment variables are loaded properly from the .env file:

load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

Data Sampling

The script samples 800 comments from the Reddit comments dataset for classification:

sample_size = 800
df_sample = df.sample(n=sample_size, random_state=1)

Classification

The comments are classified using OpenAI's GPT-3.5-turbo:

for comment in tqdm(df_sample['comments'], desc="Classifying comments"):
    label = classify_comment(comment)
    labels.append(label)

Data Cleaning

The script uses fuzzy matching to ensure consistent labeling:

df_sample['real_label'] = df_sample['label'].apply(lambda x: get_best_match(x, valid_labels))

Model Training

A Random Forest classifier is trained using TF-IDF features and SMOTE for class balancing:

X = vectorizer.fit_transform(df_sample['comments'])
y = df_sample['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

classifier = RandomForestClassifier(class_weight='balanced', random_state=42)
classifier.fit(X_resampled, y_resampled)

Results

The script outputs classification reports, confusion matrix, and ROC AUC scores to evaluate the model's performance:

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, classifier.predict_proba(X_test), multi_class='ovr'))

Detailed Use

Setup

  1. Clone this repository to your local machine.
  2. Install the required Python packages using pip:
    pip install -r requirements.txt
  3. The pre-trained model and necessary files should be available your project directory:
    • vectorizer.pkl: The pre-trained TF-IDF vectorizer.
    • classifier_model.pkl: The pre-trained RandomForestClassifier model.

How to Run the Model on New Comments - Step-by-Step Instructions

  1. Prepare the CSV File: Ensure you have a CSV file containing the new comments you wish to classify. The CSV file should have a column named comments that contains the text of the comments.

  2. Load the Pre-trained Model and Vectorizer: The script will load the pre-trained TF-IDF vectorizer and RandomForestClassifier model from the pickle files.

  3. Run the Script to Classify New Comments: Use the following script to classify the comments and view the results:

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.ensemble import RandomForestClassifier
    import joblib
    
    # Load the pre-trained TF-IDF vectorizer and classifier model
    vectorizer = joblib.load('vectorizer.pkl')
    classifier = joblib.load('classifier_model.pkl')
    
    # Read CSV file into a DataFrame
    csv_file_path = 'path_to_your_csv.csv'  # Replace with the path to your CSV file
    df = pd.read_csv(csv_file_path)
    
    # Transform the comments using the loaded TF-IDF vectorizer
    X_new = vectorizer.transform(df['comments'])
    
    # Predict the labels using the loaded classifier
    df['label'] = classifier.predict(X_new)
    
    # Display the labeled DataFrame
    print("Labeled comments data:")
    print(df.head())
    
    # Save the labeled DataFrame to a CSV file
    output_file = "labeled_comments.csv"
    df.to_csv(output_file, index=False)
    print(f"Labeled comments saved to {output_file}")

    Execute the script to classify the comments:

    python classify_new_comments.py

Contributing

Contributions/feedbacks are welcome!

coding_challenge's People

Contributors

bernard-rr avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.