Spotting The Wolf In Sheep’s Clothing: Malware Detection for Android Applications Based on Structured Heterogeneous Information Networks

Braden Riggs, Raya Kavosh | Department of Data Science | University of California, San Diego, USA

Based on a paper: https://www.cse.ust.hk/~yqsong/papers/2017-KDD-HINDROID.pdf

Written report by Braden Riggs & Raya Kavosh: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing

Running Package:

In command line: >>> python3 run.py

Or for running a test: >>> python3 run.py -t or python3 run.py --test or python3 run.py --Test

Docker Container: >>> dockerfile

For EDA notebook:

First run "python3 run.py" command, this will create the JSON files the EDA uses. Once the data_extract directory is populated with these files the EDA notebook can run.

For adjusting data injestion and params: config/data_params.json

There are 3 config files to adjust:

config/data_params.json

    mal_fp: Location of malware apps

    benign_fp: Location of benign apps

    limiter: if set to false the pipeline will process every app in dir, else process a set amount of apps specified below

    lim_mal: limits mal apps parsed

    lim_benign: limits benign apps parsed

config/dict_build.json

    directory: filepath to find processed files
    
    verbose: if set to true more print statments will trigger helping track progress
    
    truncate: if set to true Matrices A, B, P, and I, will have all APIs that occur less than the lower_bound_api_count filtered out, speeding up runtime significantly
    
    lower_bound_api_count: APIs occuring equal to or less than this value will be filtered out, values greater than 1 can result in accuracy loss

config/parsing_data.json

    multithreading: If enabled will speed up feature parsing stage
    
    out_path: output path for created files
    
    verbose: if set to true more print statments will trigger helping track progress

config/model.json

    multithreading: If enabled will speed up model training stage
    
    test_split: Portion of the data for testing the model performance on

Results:

The models were trained and tested on a selection of 96 apps, 48 benign apps and 48 malicious apps. This was done because 96 divides evenly into 12 groups of 8, allowing us to multithread the feature extraction and matrix creation, effectively cutting computation time in 8. With that said it still took a considerable amount of time, over 2 hours, to extract the features, train the model, and evaluate performance. This balanced dataset was then split, 70% of the apps would be used for training, and 30% of the apps would be used for testing. Additionally we tested a Logistic Regression Model included with the EDA portion of our project, this model represents the “standard” or rather baseline we evaluate the performance of our new SVM kernels on. This logistic regression model was trained on a range of features including the unique APIs in each app and various method counts. The performance of the the baseline logistic regression model and custom SVM kernels is as follows:

For analysis and further details see: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing

Acknowledgments:

Special Thanks to Aaron Fraenkel and Shivam Lakhotia for mentoring this project.

Thanks to the UCSD-DSMLP server for hosting the project.

briggs599 / hindroid-malware-detection-project Goto Github PK

hindroid-malware-detection-project's Introduction

Spotting The Wolf In Sheep’s Clothing: Malware Detection for Android Applications Based on Structured Heterogeneous Information Networks

Braden Riggs, Raya Kavosh | Department of Data Science | University of California, San Diego, USA

Running Package:

For EDA notebook:

For adjusting data injestion and params: config/data_params.json

Results:

Acknowledgments:

hindroid-malware-detection-project's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs