GithubHelp home page GithubHelp logo

fusky / vulnerabilitydetectionresearch Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vulnerabilitydetection/vulnerabilitydetectionresearch

0.0 0.0 0.0 99.73 MB

VulnerabilityDetectionResearch

Shell 7.74% Python 37.45% C 43.28% Scala 4.85% MATLAB 5.77% Makefile 0.90%

vulnerabilitydetectionresearch's Introduction

Vulnerability Detection with Fine-grained Interpretations

This repository contains the code and data for Vulnerability Detection with Fine-grained Interpretations

Introduction

Despite the successes of machine learning-based vulnerability detectors (VD), they are limited to providing only the decision on whether a given code is vulnerable or not, without details on what part of the code is relevant to the detected vulnerability. We present IVDetect, an interpretable vulnerability detector with the philosophy of using Artificial Intelligence (AI) to detect vulnerabilities, while using Intelligence Assistant (IA) via providing VD interpretations at the fine-grained level in term of vulnerable statements. For vulnerability detection, we separately consider the vulnerable statements and their surrounding contexts via data and control dependencies. This allows our model better discriminate vulnerable statements than using the mixture of vulnerable code and contextual code as in existing approaches. In addition to the coarsegrained vulnerability detection result, we leverage interpretable ML to provide users with fine-grained interpretations that include the sub-graph in the PDG with the crucial statements that are relevant to the detected vulnerability. Our empirical evaluation on vulnerability databases shows that IVDetect outperforms the existing ML-based approaches 64–122% and 105–255% in top-10 nDCG and MAP ranking scores. IVDetect correctly points out the vulnerable statements relevant to the vulnerability via its interpretations in 67% of the cases with a top-5 ranked list. It improves over ATT and GRAD interpretation models by 12.3–400% and 9–400% in accuracy.


Contents

  1. Dataset
  2. AST and Graph Generation
  3. Preprocessing
  4. Requirement
  5. Settings
  6. Code
  7. Reference

Dataset

The Dataset we used in the paper:

Fan et al.[1]: https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing

Reveal [2]: https://drive.google.com/drive/folders/1KuIYgFcvWUXheDhT--cBALsfy1I4utOy

FFMPeg+Qemu [3]: https://drive.google.com/file/d/1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF

AST and Graph Generation

In this study, we use Joern to generate AST and graphs. However, the Joern is updating quickly with some functionality changes. So if you want to use the scripts that we used to generate the graphs. Please use:

git checkout cbca30d2631a48aed47be1ba46c6d8b5aa23c103

to roll back the joern to the old version that we previously used. The scripts for generating the graphs can be found in:

https://github.com/vulnerabilitydetection/VulnerabilityDetectionResearch/tree/new_implementation/IVDetect/scripts/joern_graphs.sc

If you are using newer versions of Joern or you have any detailed questions about Joern, please go to Joern's website: https://github.com/joernio/joern for more details on AST and graph generation.

We put an example CSV dataset to show how the generated dataset looks: https://drive.google.com/file/d/1LHOC4JDpnQ7gWnEHGfc4soQYHAPomlNp/view?usp=sharing You can see more details in utils/process.py about how to use the generated dataset.

We want to clarify that the AST and graphs generated by different versions of Joern may have significant differences based on our findings. So if using the newer versions of Joern to generate ASTs and graphs, the model may have a different performance compared with the results we reported in the paper.

Preprocessing

After you generate the AST and graphs and store them into the same format as the example data. You can use our provided preprocessing code in utils/process.py to preprocess the data and generate the features that used in our model.

Or you can directly go to Code section. The gen_graphs.py contains the usage of the preprocessing code in utils/process.py for generating the features for the model.

Requirement

Please check all requirements in the requirement.txt

Settings

Our approach can use NNI (Auto-ML) to tune the parameters. To do so, uncomment all lines with nni in main.py and comment line 195 in main.py. Then run nnictl create --config config.yml to automatically tune the model parameters.

Code

  1. Please use git clone https://github.com/vulnerabilitydetection/VulnerabilityDetectionResearch.git to get the repository

  2. Run gen_graphs.py. The line 166 is the output dir and line 52 is the input data name. This running will end with a file not found error

  3. Run glove/ash.sh and glove/pdg.sh to generate the GloVe embedding.

  4. Comment line 55 in gen_graphs.py and run gen_graphs.py again.

  5. Run train_test_valid.py to split the dataset

  6. Run main.py to train and test the model.

Pre-trained model can be downloaded from: https://drive.google.com/file/d/1KQv0aRUFCh-_jQCu8K7uQsB0c_5uCQKa/view?usp=sharing

The relevant test dataset can be downloaded from: https://drive.google.com/file/d/1uMnm7_W9DgXN4AbJ0iUir052H1AF4hA1/view?usp=sharing

Because of the randomness in the deep learning model and the different data splitting, the model performance may be different from the results reported in the paper.

Reference

[1] Jiahao Fan, Yi Li, Shaohua Wang, and Tien Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In The 2020 International Conference on Mining Software Repositories (MSR). IEEE.

[2] Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2020. Deep Learning based Vulnerability Detection: Are We There Yet? arXiv preprint arXiv:2009.07235 (2020).

[3] Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems. 10197–10207.

vulnerabilitydetectionresearch's People

Contributors

vulnerabilitydetection avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.