GithubHelp home page GithubHelp logo

guilherme-deschamps / email-spam-detection Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 762 KB

Project developed to learn about NLP, experimenting with the NLTK library and keeping Flask endpoints available to explore

Jupyter Notebook 97.16% Python 2.84%
flask logistic-regression machine-learning nlp nlp-machine-learning pandas python random-forest-classifier tokenization

email-spam-detection's Introduction

Email-Spam-Detection

In this repository, I implemented a small project to practice NLP, specifically for Email Spam classification. The dataset used for this project is available in Kaggle, and the implementation was based on a tutorial presented by Greg Hogg.

There are two main artifacts that can be executed here:

  • A Jupyter Notebook
  • A Flask application

Requirements

  • pandas
  • nltk
  • matplotlib
  • seaborn
  • tqdm
  • jupyter
  • flask
  • scikit-learn
  • numpy

How to run the Jupyter Notebook

  • Clone the repository
  • (Optional but recommended) Create a virtual environment for the project
  • Activate the virtual environment (On Windows: open the /venv/Scripts folder and run activate)
  • Install the dependencies:
pip install -r requirements.txt
  • Run your Jupyter server, either:
    • Using a platform like Visual Studio Code, OR
    • Using the terminal, in the venv you created, running the jupyter notebook command
  • Open the Email_Spam_Classification.ipynb file
  • Have fun ๐Ÿ™‚!

How to run the Flask application

  • Open your terminal
  • Activate the virtual environment created (On Windows: open the /venv/Scripts folder and run activate)
  • Run:
python app.py

Trying out the end-points

A quick way to try the end-points available is by running the Flask app and executing in terminal the cURL commands below. Please, note that there is a difference for the cURL commands depending if you are using Linux or Windows (did not check for different operating systems): Linux uses apostrophes, while Windows does not, requiring some small changes on the quotation marks. Using Windows, the quotation marks inside of the JSON must be replaced by ", while the apostrophes out of the JSON should be replaced by normal quotation marks.

Having this explained, the end-points available are:

/convert_message

POST end-point that takes a String as input, and returns the tokens present in that message (except stopwords).

To test via terminal (Linux), run:

curl -X POST -H "Content-Type: application/json" -d '{"message": "Hey, this is a test message!"}' http://127.0.0.1:5000/convert_message

In Windows, the corresponding command would be:

curl -X POST -H "Content-Type: application/json" -d "{\"message\": \"Hey, this is a test message!\"}" http://127.0.0.1:5000/convert_message

/predict

POST end-point that takes a String (something like an email message) as input, and returns the prediction to whether this message is or not a spam according to the classifier trained. The examples below were prepared using the Linux syntax of cURL, so if you are on Windows, please make the conversion as explained previously.

An example of message that would NOT be classified as a spam is in the following cURL command (for Linux):

curl -X POST -H "Content-Type: application/json" -d '{"message": "Hey, this is a test message (not spam)!"}' http://127.0.0.1:5000/predict

Opposite to the previous example, if you use the command below you will see a case classified as a spam by the model:

curl -X POST -H "Content-Type: application/json" -d '{"message": "Hi there, go get ur free drink today!"}' http://127.0.0.1:5000/predict

Understanding the classifications

Before training the model, some steps were performed over the dataset (you can find them in details in the Email_Spam_Classification.ipynb notebook). After counting how often each token appeared in the dataset, 16 tokens were defined as 'important features' which appeared more than 200 times in the dataset. These 16 tokens are the following:

['4', 'good', 'go', 'free', 'ok', 'รฅ', 'get', 'gt', 'ur', 'like', 'call', 'day', 'u', '2', 'know', 'lt']

When looking at the most common words in messages that were spams, I could find that they are distributed basically as shown in the chart below.

image

So, to create a message that is classified as a spam, you can simply create a message that contains some of these most common spam words and you should have your own spam messages right there to play around!

To-do

  • Implement a Flask application with end-points:
    • Receives a phrase and returns the vector of tokens after tokenization / lower casing / lemmatization
    • Receives an e-mail message and returns whether it is or not a spam

email-spam-detection's People

Contributors

guilherme-deschamps avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.