GithubHelp home page GithubHelp logo

dgadetecter's Introduction

DGADetecter

Traditional machine learning schemes for detecting DGA domain names are limited by feature engineering and classification algorithms. Manual construction of features is time-consuming and laborious and the detection false alarm rate is high. This repo tries to solve the above limitations through manual features plus deep network to extract features automatically.

Feature Extractor

Manual Features

By far, 19 features are selected, detailed description listed blow.

Feature Description
entropy Shannon Entropy of Domain Names
readability Domain readability, implemented via textsat
vowel_ratio Proportion of vowels
gibberish Also a readability test, see here for details
number_ratio Proportion of numbers
character_ratio Proportion of characters
repeated_char_ratio Proportion of repeated chars
consecutive_ratio Proportion of consecutive chars
consecutive_consonants_ratio Proportion of consecutive consonants
unigram_rank_ave Mean of rankings after domain unigram cutting
unigram_rank_std Variance of rankings after domain unigram cutting
bigram_rank_ave Mean of rankings after domain bigram cutting
bigram_rank_std Variance of rankings after domain bigram cutting
trigram_rank_ave Mean of rankings after domain trigram cutting
trigram_rank_std Variance of rankings after domain trigram cutting
markov_prob_2 Based on the Markov implicit chain model, the transfer probability of dual chars
markov_prob_3 Based on the Markov implicit chain model, the transfer probability of tri-chars
length_domain Length of domain
tld_rank TLD rank of domain, based on Umbrella top 1m TLD ranking

Automatical Features

GRU is used to extract a 32-dimension feature matrix from purely raw char-level domain data.

Datasets

View here for concated datasets.

Go

Train

  1. (Optional) Run python dga_generater.py for help, if you want to generate dga domains via dga generating algorithms.
  2. Run python train.py to start the whole progress based on original datasets in Datasets folder, including pre-processing, training and evaluation.
  3. Run python train_processed.py to re-train the model based on processed datasets in Outputs/Datasets/ folder.

Inference

  • Run python api.py to start the API server, and default port is 8000.

  • Send POST requests to server nad embed query domains in the following json format.

{
    "id": 1,
    "domains": [
        "google.com",
        "oqdykzptntg33nzl38f52mxfyoxm49nyorau.ru"
    ]
}
  • Server will response with corresponding request id, labels and probs.
{
    "request_id": 1,
    "labels": [
        "benign",
        "malicious"
    ],
    "Probs": [
        1.0,
        1.0
    ]
}

dgadetecter's People

Contributors

coldwave96 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.