GithubHelp home page GithubHelp logo

seclab-fudan / apigraph Goto Github PK

View Code? Open in Web Editor NEW
75.0 6.0 19.0 18.07 MB

Building relation graph of Android APIs to catch the semantics between APIs, and used to enhancing Android malware detectors

License: GNU General Public License v3.0

Python 100.00%

apigraph's Introduction

APIGraph

This repo hosts the source code and dataset of APIGraph. For more details about our CCS 2020 paper, please see APIGraph-website.

Update: The idea of APIGraph actually not limited to Android malware detection, it can be extended to other tasks on other platforms, e.g. Windows malware detection/classification. As our effort to show the generality of APIGraph, we now adapt APIGraph onto Windows malware detection tasks, refer to the windows branch.

Update2: The top30 families malware in Table 10 in the paper is uploaded, in the format of (hash,time,family). Note the family labels were obtained using Euphony tool in 2020, which may have changed by present. All the malware were also downloaded from the three open repositories.

Source Code

The source code are located in the src directory, including:

  • getAllEntities.py - The script to get all entities from API documents.
  • getAllRelations.py - The script to extract relations between entities according to pre-defined templates.
  • TransE.py - The script to convert each API in the relation graph into an embedding representation.
  • clusterEmbedding.py - The script to cluster API embeddings into semantic-similar groups through k-means.
  • res - This directory stores the resources used in above scripts, including API documents (already parsed into JSON formats), permission relation from PScout, and also some intermedia files.

Dataset

The dataset is located in the Dataset directory. This dataset contains 322,594 Android apps, including 32,089 malicious and 290,505 benign samples spanning 7 years, i.e. 2012 - 2018. The benign samples are all from Google Play, and downloaded from AndroZoo. The malware samples are downloaded from three sources: VirusShare, VirusTotal Academic Samples, and AMD dataset. The hashes are organized according to their years and maliciousness in txt format.

Note: For security and copyright reasons, we can only release the md5 hashes of these samples. Interested users should download these samples from the above four sources.

Baselines

We tested four state-of-the-art Android malware classifiers as the baselines, as listed below.

Classifiers   Publication   API feature format   Algorithms   Reproduction
MamaDroid   NDSS 2017    Markov Chain of API Calls   Random Forest   source code
DroidEvolver Euro S&P 2019 API Occurrence Model Pool source code
Drebin NDSS 2014 Selected API Occurrence SVM re-implemented
Drebin-DL ESORICS 2017 Selected API Occurrence DNN re-implemented

These four classifiers are published in top venues and their source code are publicly available or we can re-implement them, sometimes with the help of their authors. Specially, we thank the authors of DroidEvolver for their help.
We strictly follow their configuration to make sure our reproductions can achieve the results as stated in their paper.

apigraph's People

Contributors

xhzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

apigraph's Issues

Re-implementation of Drebin

Hi, thanks for the contribution.
For the implementation of Drebin, may I ask whether the feature set of strings (e.g., component names, network addresses) is extracted from your 322K samples?
In my opinion, the great amount of uniques strings can lead to a huge feature dimension.
Am I wrong, or can you explain the size of the final feature vector that is sent to the ML classifier?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.