GithubHelp home page GithubHelp logo

5l1v3r1 / phishing-detection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from npapernot/phishing-detection

0.0 1.0 0.0 87 KB

Train a simple decision tree classifier to detect websites used for phishing

License: MIT License

Python 100.00%

phishing-detection's Introduction

Detecting phishing websites using a decision tree

This repository is a tutorial explaining how to train a simple decision tree classifier to detect websites that are used for phishing. Typically, phishing websites disguise as trustworthy websites in order to gain the trust of their victims, and malicious parties use them to obtain sensitive information from their victims: e.g., passwords or credit card numbers. In this tutorial, we train a decision tree to detect such websites, with a success rate of 90.5%.

Installation

To get started, you should first clone this repository by running the following command from a UNIX terminal.

git clone https://github.com/npapernot/phishing-detection

This will download the code that trains the phishing detector, as well as the training data required for that operation.

You should also install scikit-learn, which is a collection of tools for machine learning written in Python. You can find instructions on how to install it here. On a UNIX machine configured with pip, the simplest way is to run:

pip install -U scikit-learn

Once you have installed scikit-learn, you can check whether the library is correctly setup by typing the following in a Python shell:

import sklearn

If the command runs with no error, you are ready to train the phishing detector!

Phishing Website Dataset

In this tutorial, we use a dataset of phishing website publicly available on the machine learning repository provided by UCI. You don't have to download the dataset yourself as it is included directly in this repository (dataset.csv file) and was downloaded on your machine when you cloned this repository.

The dataset was collected by analyzing a collection of 2456 websites among which some were used for phishing and others not. For each website included in the dataset, 30 attributes are given. You can find a list here. The list includes for instance the URL length, whether the website uses pop-up windows or Iframes, or how old the domain registration is.

Each website in the dataset is labeled by -1 if it is not a phishing website and by 1 if it is a website used for phishing.

Training the decision tree to detect phishing website

You can find the code that trains the decision tree in the decision_tree.py file. To run the code, simply execute it in a terminal:

python decision_tree.py

This will first train the decision tree on 2,000 websites, then use the trained model to predict whether 456 websites are used for phishing or not (these websites were not analyzed during training). The model should make predictions that are about 90.5% correct, i.e. the accuracy of the model on the testing data should be 90.5%. Here is a dump of the output made by the script.

Tutorial: Training a decision tree to detect phishing websites
Training data loaded.
Decision tree classifier created.
Beginning model training.
Model training completed.
Predictions on testing data computed.
The accuracy of your decision tree on testing data is: 0.906129210381

To understand how this was done, you can read the line by line comments in the decision_tree.py file.

What next?

You can try improving the accuracy of this simple classifier by changing some of the default parameter values for the model. This is done by modifying this line of decision_tree.py. To learn more about the parameters that you can define when calling DecisionTreeClassifier(), take a look at the scikit-learn documentation.

You can also try different models like Support Vector Machines or Neural Networks. For instance, in this file, the machine learning model is now a logistic regression, but it performs worse than the decision tree.

Questions or suggestions

If you have any questions or suggestions, feel free to send me an email at [email protected].

phishing-detection's People

Contributors

npapernot avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.