GithubHelp home page GithubHelp logo

ibm / d2a Goto Github PK

View Code? Open in Web Editor NEW
60.0 9.0 18.0 18.02 MB

This repository is to support contributions for tools and new data entries for the D2A dataset hosted in DAX

License: Apache License 2.0

Shell 7.67% Python 92.33%

d2a's Introduction

D2A Dataset and Generation Pipeline

This repository is to support contributions for tools that generated the D2A dataset hosted on IBM Data Asset eXchange.

Table of Contents

Introduction

D2A is a differential analysis based approach to label issues reported by inter-procedural static analyzers as ones that are more likely to be true positives and ones that are more likely to be false positives. Our goal is to generate a large labeled dataset that can be used for machine learning approaches for code understanding and vulnerability detection.

Why D2A?

Given programs can exhibit diverse behaviors, training machine learning models for code understanding and vulnerability detection requires large datasets. However, according to a recent survey, lacking good and real-world datasets has become a major barrier for this field. Many existing works created self-constructed datasets based on different criteria and may just release partial datasets. The following table summarizes the characteristics of a few popular datasets for software vulnerability detection tasks.

Dataset Comparison.

Due to the lack of oracle, there is no perfect dataset that is large enough and has 100% correct labels for AI-based vulnerability detection tasks. Datasets generated from manual reviews have better quality labels in general. However, limited by their nature, they are usually not large enough for model training. On the other hand, although the quality of the D2A dataset is bounded by the capacity of static analysis, D2A can produce large datasets with better labels comparing to the ones labeled solely by static analysis, and complement existing manually labelled datasets.

Differential Analysis and D2A Dataset Generation Pipeline

Intuition

For projects with commit histories, we assume some commits are code changes that fix bugs. We run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. If we analyze a large number of consecutive version pairs and aggregate the results, some issues found in a before-commit version never disappear in an after-commit version. We say they are not very likely to be real bugs because they were never fixed. Then, we de-duplicate the issues found in all versions and adjust their classifications according to the commit history. Finally, we label the issues that are very likely to be real bugs as positives and the remaining ones as negatives.

Components

The following figure shows the overview of the D2A dataset generation pipeline.

The Overview of D2A Dataset Generation Pipeline.

  • Commit Message Analysis (scripts/infer_pipeline/commit_msg_analyzer) analyzes the commit messages and identifies the commits that are more likely to refer to vulnerability fixes.

  • Pairwise Static Analysis (scripts/infer_pipeline) run the analyzer on the before-commit and after-commit versions for the commit hashes selected in the previous step.

  • Auto-labeler (scripts/auto_labeler) merges the analysis results for all selected commit versions and label each issue based on differential logic and commit history heuristics.

  • Function Extractor (scripts/dataset_generator) extracts the bodies of the functions involved in the trace.

Sample Types

There are two types of samples:

  • Samples based on static analyzer outputs. Such samples are reported by the static analyzer and thus all have analyzer outputs including bug traces. We extract the functions mentioned in the trace together with other information. We use "label_source": "auto_labeler" to denote such samples. The labels can be 0 (e.g., auto-labeler_0.json) or 1 (e.g., auto-labeler_1.json) according to the auto-labeler. Please refer to Sec.III-C in the D2A paper for details.

  • Samples from the fixed versions. Such samples are not directly generated from static analysis outputs because they are not reported by the analyzer. Therefore, they do not contain static analyzer outputs. Instead, given samples with positive auto-labeler labels (i.e. label_source == "auto_labeler" && label == 1) found in the before-fix version, we extract the corresponding functions in the after-fix version and label them 0 (e.g., after_fix_0.json). We use "label_source": "after_fix_extractor" to denote such samples. More information can be found in the Sec.III-D in the D2A paper.

Downloading the D2A Dataset and the Splits

The D2A dataset and the global splits can be downloaded from IBM Data Asset eXchange.

The latest version: v1.0.0.

Sample Description and Dataset Stats

Details could be found in Sample Description and Dataset Stats.

Using the Dataset

Please refer to Dataset Usage Examples for details.

D2A Leaderboard

Leaderboard

More details

Annotating More Projects

Please refer to Running the Dataset Generation Pipeline.

D2A Paper and Citation

Paper

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

[arXiv] [ICSE-SEIP'21]

Citation

Please cite the following paper, if the D2A dataset or generation pipeline is useful for your research.

@inproceedings{D2A,
  author = {Zheng, Yunhui and Pujar, Saurabh and Lewis, Burn and Buratti, Luca and Epstein, Edward and Yang, Bo and Laredo, Jim and Morari, Alessandro and Su, Zhong},
  title = {D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis},
  year = {2021},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  series = {ICSE-SEIP '21},
  booktitle = {Proceedings of the ACM/IEEE 43rd International Conference on Software Engineering: Software Engineering in Practice}
}

d2a's People

Contributors

bstee615 avatar imgbotapp avatar saurabhpujar avatar stevemar avatar zyh1121 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

d2a's Issues

commit and its corresponding functions

I found that the commit id does not match the functions and changes you provide in each data sample.

For example, I search the commit accoriding to the commit id, and see what functions are changed during the commit. Then I find the functions changed in the commit are not the functions that shown in your data.

I wonder why? Am I correct? or I make a mistake?

Infer version

Hi,

Thank you for providing this dataset! Good job!!

Which version of Infer was used to generate the dataset? What would be the effort to regenerate the dataset with a new version of infer?

Is it possible to provide a uniform id between Leaderboard and D2A?

Hi,
Thank you for your good work!
Is there a direct way to get the mapping between the id of the "trace" in Leaderboard Data and the id of the sample in the D2A dataset? e.g. "36270 " (an "id" of sample in trace task ) correspond to "openssl_78ad164cf891c8081d89d03a0c062090d0f10551_1" (an "id " in D2A dataset)

I noticed that the id in each task is unique. But, is it possible to provide a uniform id? especially for the "Trace" and "Trace+Code" tasks.

Thanks.

_pickle.UnpicklingError: could not find MARK

Hi friends
Here is an issue that the ffmpeg_after_fix_extractor_0.pickle.gz can't load by pickle. when I load it raised an error "_pickle.UnpicklingError: could not find MARK". I search for some solution, but there is no method suited for fixed this problem. how to deal with it? If there is a solution, please tell me. Thanks very much!

my demo code:

def read_data(file_path, files):
    with gzip.open(file_path+file, mode='rb') as fp:
        fp.seek(0)
        while True:
            try:
                item = pickle.load(fp, encoding='bytes')
                print(item['id'])
                ...
             except EOFError:
                    break

read_data('./',"ffmpeg_after_fix_extractor_0.pickle.gz")

raised error: _pickle.UnpicklingError: could not find MARK

python version == 3.7.9

thanks!

Missing dataset.

Hi.
I have extracted the dataset from the pickle file and found that there are 679 functions are missing from the whole dataset.
Please find attached, a list of all the missing ids.
Is this a bug from the dataset, or am I doing things wrong? Im using the same way of reading the pickle file, like the one in this repository.

Thanks.

missing.txt

How to get Infer traces for Function leaderboard dataset?

I want to get the Infer traces for the Function dataset from the leaderboard, but according to issue #8, there is no uniform ID between the output of split_data.py and the leaderboard datasets. How can I get the Infer traces for the examples in the Function dataset? It is fine if I can only get the traces for the train and dev sets.

I included a screenshot showing the code and trace side-by-side, for one example which I could match with the Infer trace by the function name. For each example in the Function dataset, I want to get the "trace" key on the right, which details each step of the trace output by Infer.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.