GithubHelp home page GithubHelp logo

powerdiao / d2a Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ibm/d2a

0.0 0.0 0.0 18.02 MB

This repository is to support contributions for tools and new data entries for the D2A dataset hosted in DAX

License: Apache License 2.0

Shell 7.67% Python 92.33%

d2a's Introduction

D2A Dataset and Generation Pipeline

This repository is to support contributions for tools that generated the D2A dataset hosted on IBM Data Asset eXchange.

Table of Contents

Introduction

D2A is a differential analysis based approach to label issues reported by inter-procedural static analyzers as ones that are more likely to be true positives and ones that are more likely to be false positives. Our goal is to generate a large labeled dataset that can be used for machine learning approaches for code understanding and vulnerability detection.

Why D2A?

Given programs can exhibit diverse behaviors, training machine learning models for code understanding and vulnerability detection requires large datasets. However, according to a recent survey, lacking good and real-world datasets has become a major barrier for this field. Many existing works created self-constructed datasets based on different criteria and may just release partial datasets. The following table summarizes the characteristics of a few popular datasets for software vulnerability detection tasks.

Dataset Comparison.

Due to the lack of oracle, there is no perfect dataset that is large enough and has 100% correct labels for AI-based vulnerability detection tasks. Datasets generated from manual reviews have better quality labels in general. However, limited by their nature, they are usually not large enough for model training. On the other hand, although the quality of the D2A dataset is bounded by the capacity of static analysis, D2A can produce large datasets with better labels comparing to the ones labeled solely by static analysis, and complement existing manually labelled datasets.

Differential Analysis and D2A Dataset Generation Pipeline

Intuition

For projects with commit histories, we assume some commits are code changes that fix bugs. We run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. If we analyze a large number of consecutive version pairs and aggregate the results, some issues found in a before-commit version never disappear in an after-commit version. We say they are not very likely to be real bugs because they were never fixed. Then, we de-duplicate the issues found in all versions and adjust their classifications according to the commit history. Finally, we label the issues that are very likely to be real bugs as positives and the remaining ones as negatives.

Components

The following figure shows the overview of the D2A dataset generation pipeline.

The Overview of D2A Dataset Generation Pipeline.

  • Commit Message Analysis (scripts/infer_pipeline/commit_msg_analyzer) analyzes the commit messages and identifies the commits that are more likely to refer to vulnerability fixes.

  • Pairwise Static Analysis (scripts/infer_pipeline) run the analyzer on the before-commit and after-commit versions for the commit hashes selected in the previous step.

  • Auto-labeler (scripts/auto_labeler) merges the analysis results for all selected commit versions and label each issue based on differential logic and commit history heuristics.

  • Function Extractor (scripts/dataset_generator) extracts the bodies of the functions involved in the trace.

Sample Types

There are two types of samples:

  • Samples based on static analyzer outputs. Such samples are reported by the static analyzer and thus all have analyzer outputs including bug traces. We extract the functions mentioned in the trace together with other information. We use "label_source": "auto_labeler" to denote such samples. The labels can be 0 (e.g., auto-labeler_0.json) or 1 (e.g., auto-labeler_1.json) according to the auto-labeler. Please refer to Sec.III-C in the D2A paper for details.

  • Samples from the fixed versions. Such samples are not directly generated from static analysis outputs because they are not reported by the analyzer. Therefore, they do not contain static analyzer outputs. Instead, given samples with positive auto-labeler labels (i.e. label_source == "auto_labeler" && label == 1) found in the before-fix version, we extract the corresponding functions in the after-fix version and label them 0 (e.g., after_fix_0.json). We use "label_source": "after_fix_extractor" to denote such samples. More information can be found in the Sec.III-D in the D2A paper.

Downloading the D2A Dataset and the Splits

The D2A dataset and the global splits can be downloaded from IBM Data Asset eXchange.

The latest version: v1.0.0.

Sample Description and Dataset Stats

Details could be found in Sample Description and Dataset Stats.

Using the Dataset

Please refer to Dataset Usage Examples for details.

D2A Leaderboard

Leaderboard

More details

Annotating More Projects

Please refer to Running the Dataset Generation Pipeline.

D2A Paper and Citation

Paper

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

[arXiv] [ICSE-SEIP'21]

Citation

Please cite the following paper, if the D2A dataset or generation pipeline is useful for your research.

@inproceedings{D2A,
  author = {Zheng, Yunhui and Pujar, Saurabh and Lewis, Burn and Buratti, Luca and Epstein, Edward and Yang, Bo and Laredo, Jim and Morari, Alessandro and Su, Zhong},
  title = {D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis},
  year = {2021},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  series = {ICSE-SEIP '21},
  booktitle = {Proceedings of the ACM/IEEE 43rd International Conference on Software Engineering: Software Engineering in Practice}
}

d2a's People

Contributors

bstee615 avatar imgbotapp avatar saurabhpujar avatar stevemar avatar zyh1121 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.