GithubHelp home page GithubHelp logo

saurabh9651 / short_text_headlines Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 1.75 MB

A review of supervised classification techniques in R categorizing short text news headlines

License: MIT License

R 100.00%
classification machine-learning nlp nlp-machine-learning r-for-data-science

short_text_headlines's Introduction

News Article Classifier

This repository contains an R script to build various classification models for predicting news article categories based on their headlines. The dataset used is the News Category Dataset v2, which contains articles from various categories, published by HuffPost.

Getting Started

These instructions will help you set up the project on your local machine.

Prerequisites

To run the script, you need to have R and RStudio installed on your local machine. You can download and install them from the following links:

Installing Required Packages

Before running the script, make sure to install the required R packages. You can install them using the following commands in your RStudio console:

install.packages("tidyverse")
install.packages("lubridate")
install.packages("jsonlite")
install.packages("glue")
install.packages("ggplot2")
install.packages("tm")
install.packages("tokenizers")
install.packages("textstem")
install.packages("RSpectra")
install.packages("randomForest")
install.packages("xgboost")
install.packages("caret")
install.packages("parsnip")
install.packages("workflowsets")
install.packages("recipes")
install.packages("yardstick")
install.packages("rsample")
install.packages("keras")

Running the Script

  1. Clone the repository to your local machine.
  2. Open the R script file (news_category_classification.R) in RStudio.
  3. Download dataset from https://arxiv.org/abs/2209.11429
  4. Modify the path to the News_Category_Dataset_v2.json file as needed.
  5. Run the script to train and evaluate the classification models.

Description of the Models

The script trains and evaluates multiple classification models, including:

  • Null Model
  • Naive Bayes
  • K-Nearest Neighbors (KNN)
  • Decision Trees
  • Random Forests
  • Support Vector Machines (SVM) with linear and polynomial kernels
  • XGBoost
  • Multinomial Logistic Regression using Keras

Evaluation Metrics

The models are evaluated based on the following metrics:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • Area Under the Receiver Operating Characteristic curve (ROC AUC)
  • Processing Time A confusion matrix heatmap is also generated for each model to visualize the model's performance.

Results

The performance of each model is summarized in a table and visualized in a bar plot. The best model can be identified based on the highest F1 Score, Accuracy, and ROC AUC, as well as the lowest processing time.

Contributing

Feel free to submit pull requests for any improvements or bug fixes. Please make sure your code follows the same coding style and conventions as the existing code.

License

This project is licensed under the MIT License - see the LICENSE file for details.

short_text_headlines's People

Contributors

saurabh9651 avatar

Watchers

 avatar  avatar

short_text_headlines's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.