GithubHelp home page GithubHelp logo

parkernisbet / newsgroups-naive-bayes Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 12 KB

Multinomial naive Bayes newsgroup document classification without relying on pre-built sklearn modules. Smoothing and inverse document frequencies utilized to improve model accuracy.

Jupyter Notebook 100.00%
text-classification newsgroups-dataset without-sklearn inverse-document-frequency python3 jupyer-notebook laplace-smoothing multinomial-naive-bayes

newsgroups-naive-bayes's Introduction

Newsgroups Naive Bayes

This project seeks to build a multinomial naive Bayes model for text classification. Rather than rely on the pre-built sklearn.naive_bayes.MultinomialNB module for the bulk of my work, I will instead be constructing a .ipynb notebook to mimic said classifier's behaviour. The predictions themselves require two main calculations: the label priors for all classes in the dataset, and the word probabilities per class. These two data structures, after some manipulations, will combine to form values proportional to the posterior probabilities for each class. The final act in classification will be to choose the most probable posterior hypothesis and that class will become the prediction.

The first attempt at classification of my test dataset yielded an error rate of 20.92%, not terrible but certainly could see improvement. Since my application of the naive Bayes theorem already utilizes log smoothing and log frequencies to compensate for zero probabilities and word burstiness, the next reasonable approach was to downweight word probabiliies with their inverse document frequencies. Admittedly this didn't yield as large of an improvement as expected, it did still manage to drop my error rate to 18.09%. Following this, at perhaps another time, it might be useful to also explore completely removing certain words from the dataset vocabulary that are "particularly misleading" (appear at similar frequencies in multiple classes). Making the list of words used in classification more unique to each label should also further decrease my error rate.

newsgroups-naive-bayes's People

Contributors

parkernisbet avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.