GithubHelp home page GithubHelp logo

squatphish / 3-phish-page-detection Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 5.0 1.21 MB

A machine learning based phishing detection prototype

Python 97.58% Shell 2.42%
machine-learning security-tools

3-phish-page-detection's Introduction

Phish-Page-Detection


   _____                   _   _____  _     _     _
  / ____|                 | | |  __ \| |   (_)   | |
 | (___   __ _ _   _  __ _| |_| |__) | |__  _ ___| |__
  \___ \ / _` | | | |/ _` | __|  ___/| '_ \| / __| '_ \
  ____) | (_| | |_| | (_| | |_| |    | | | | \__ \ | | | - Detection
 |_____/ \__, |\__,_|\__,_|\__|_|    |_| |_|_|___/_| |_|
            | |
            |_|

Welcome to SquatPhish-phishing-detection!

SquatPhish-phishing-detection is part of SquatPhish project to detect general phishing pages.

A machine learning model to identify phishing pages by looking at:

  • HTML text - searching for brand name and signin keywords in HTML source code
  • HTML structure - searching for submission forms and their attributes
  • IMAGE text - searching for texts directly from image

We apply tesseract (a Deep learning based OCR engine) to extract texts from images.

We also NLP analysis to filter and clean nonsense words.

It supports:

  • Directly detection of potential phishing pages
  • A behavior-based model to investigate general phishing behaviors
  • A machine-learning-based (RandomForest) to combine all the properties to make a final decision

Install OCR, NLTK and ML dependences

bash install.sh

Demo

Run the demo to get predictions of testing samples under test folder.

1 is predicted as a phsihing page and 0 is predicted as a benign page.

python3 demo.py

Prediction result:	 googlw.lt----1.0----[0.28993355726427317, 0.7100664427357269]

Prediction result:	 googlw.cl----1.0----[0.09047988302948216, 0.9095201169705178]

Prediction result:	 googlw.co.il----1.0----[0.24020989615461588, 0.7597901038453841]

Prediction result:	 goofgle.se----1.0----[0.31088049798443707, 0.6891195020155627]

Prediction result:	 facebook-c.com----1.0----[0.22163562365428768, 0.7783643763457124]

Prediction result:	 faceb00k.bid----1.0----[0.17656695383979332, 0.8234330461602067]

Prediction result:	 sewauk.org----1.0----[0.25669347640761164, 0.7433065235923882]

Prediction result:	 100022538-facebook.com----0.0----[0.9193098619543378, 0.08069013804566202]

Get feature vectors.

python3 feature_extract.py

API

We provide a prediction API.

usage: predict.py [-h] [-t HTML] [-i IMG]

running analysis...

optional arguments:
  -h, --help            show this help message and exit
  -t HTML, --html HTML  A html source data to extract features
  -i IMG, --img IMG     A image data to extract features

Example:

python3 predict.py  --img=./test/facebook-c.com..screen.png
                    --html=./test/facebook-c.com..source.txt

Disclaimer and Reference

This is a research prototype, use at your own risk.

If you feel this tool is useful, cite the tool as ๐Ÿ• SquatPhish ๐Ÿ• is highly appreiciated.

Acknowledgement

Core contributor: ke tian @ririhedou

Thanks hang hu @0xorz for reproduction testing.

Current version is 0.0.2, updated at June 14 2018

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.