GithubHelp home page GithubHelp logo

mailclassifier's Introduction

MailClassifier

Spam Ham Mail Classifier

In this project implemented Naive Bayes Classifier on the Enron mails corpus

Downloaded the preprocessed mail from http://www.aueb.gr/users/ion/data/enron-spam/

Consolidated the spam and ham as two seperate text files.

Imported the two files to HDFS.

Step 1: Developed the MR job SpamClassifierDriver which performs below tasks Removed the numerals Removed special characters Filtered the stop words available from wiki. Calculated the each word count.

Step 2: Developed second MR job TfCalculatorDriver Calculate the each term frequencies (word count/total words in each class). A single file is created as daughters@hamFile 1.673948341954167E-5 claimed@hamFile 3.347896683908334E-5 efface@spamFile 4.634435711107815E-6

Step 3: Developed an utility NaiveBayesClassifier which takes the term frequency file and a new mail as input Parsed the new mail to individual words and applied the text cleaner utility. Calculated the probability of each word and applied log function. Added all the probability values. Highest probability value decided the class the the new mail belongs.

mailclassifier's People

Contributors

balakrishnadhanekula avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.