GithubHelp home page GithubHelp logo

briggs599 / hindroid-malware-detection-project Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 1.0 5.39 MB

Based on the HinDroid architecture outlined in the following paper: https://www.cse.ust.hk/~yqsong/papers/2017-KDD-HINDROID.pdf

Python 20.38% Dockerfile 0.25% Jupyter Notebook 79.36%
malware-detection svm-kernels graph-theory-algorithms heterogeneous-information-networks hindroid

hindroid-malware-detection-project's Introduction

Spotting The Wolf In Sheep’s Clothing: Malware Detection for Android Applications Based on Structured Heterogeneous Information Networks

Braden Riggs, Raya Kavosh | Department of Data Science | University of California, San Diego, USA

GitHub Logo

Based on a paper: https://www.cse.ust.hk/~yqsong/papers/2017-KDD-HINDROID.pdf

Written report by Braden Riggs & Raya Kavosh: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing

Running Package:

In command line: >>> python3 run.py

Or for running a test: >>> python3 run.py -t or python3 run.py --test or python3 run.py --Test

Docker Container: >>> dockerfile

For EDA notebook:

First run "python3 run.py" command, this will create the JSON files the EDA uses. Once the data_extract directory is populated with these files the EDA notebook can run.

For adjusting data injestion and params: config/data_params.json

There are 3 config files to adjust:

  • config/data_params.json

        mal_fp: Location of malware apps
    
        benign_fp: Location of benign apps
    
        limiter: if set to false the pipeline will process every app in dir, else process a set amount of apps specified below
    
        lim_mal: limits mal apps parsed
    
        lim_benign: limits benign apps parsed
    
  • config/dict_build.json

        directory: filepath to find processed files
        
        verbose: if set to true more print statments will trigger helping track progress
        
        truncate: if set to true Matrices A, B, P, and I, will have all APIs that occur less than the lower_bound_api_count filtered out, speeding up runtime significantly
        
        lower_bound_api_count: APIs occuring equal to or less than this value will be filtered out, values greater than 1 can result in accuracy loss
    
  • config/parsing_data.json

        multithreading: If enabled will speed up feature parsing stage
        
        out_path: output path for created files
        
        verbose: if set to true more print statments will trigger helping track progress
    
  • config/model.json

        multithreading: If enabled will speed up model training stage
        
        test_split: Portion of the data for testing the model performance on
    

Results:

The models were trained and tested on a selection of 96 apps, 48 benign apps and 48 malicious apps. This was done because 96 divides evenly into 12 groups of 8, allowing us to multithread the feature extraction and matrix creation, effectively cutting computation time in 8. With that said it still took a considerable amount of time, over 2 hours, to extract the features, train the model, and evaluate performance. This balanced dataset was then split, 70% of the apps would be used for training, and 30% of the apps would be used for testing. Additionally we tested a Logistic Regression Model included with the EDA portion of our project, this model represents the “standard” or rather baseline we evaluate the performance of our new SVM kernels on. This logistic regression model was trained on a range of features including the unique APIs in each app and various method counts. The performance of the the baseline logistic regression model and custom SVM kernels is as follows:

GitHub Logo

For analysis and further details see: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing

Acknowledgments:

Special Thanks to Aaron Fraenkel and Shivam Lakhotia for mentoring this project.

Thanks to the UCSD-DSMLP server for hosting the project.

hindroid-malware-detection-project's People

Contributors

briggs599 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

raya-kavosh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.