GithubHelp home page GithubHelp logo

sedofihope / vul-lmggnn Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vul-lmgnn/vul-lmggnn

0.0 0.0 0.0 31.13 MB

Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph

License: Apache License 2.0

Python 93.98% Scala 6.02%

vul-lmggnn's Introduction

Vul-LMGGNN

Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph

Introduction

In this work, we propose Vul-LMGNN, a unified model that combines pre-trained code language models with code property graphs for code vulnerability detection. Vul-LMGNN constructs a code property graph, thereafter leveraging pre-trained code model to extract local semantic features as node embeddings in the code property graph. Furthermore, we introduce a gated code Graph Neural Network (GNN). By jointly training the code language model and the gated code GNN modules in Vul-LMGNN, our proposed method efficiently leverages the strengths of both mechanisms. Finally, we use a pre-trained CodeBERT as an auxiliary classifier. The proposed method demonstrated superior performance compared to six state-of-the-art approaches.

Getting Started

Create environment and install required packages for LMGGNN

Install packages

The experiments were executed on single NVIDIA A100 80GB GPU. The system specifications comprised NVIDIA driver version 525.85.12 and CUDA version 11.8.

Dataset

We evaluated the performance of our model using four publicly available datasets. The composition of the datasets is as follows, and you can click on the dataset names to download them. Please note that you need to modify the code in the CPG_generator function in run.py to adapt to different dataset formats.

Dataset #Vulnerable #Non-Vulnerable Source
DiverseVul 18,945 330,492 Snyk,Bugzilla
Devign 11,888 14,149 Github
VDSIC 82,411 119,1955 Github, Debian
ReVeal 1664 16,505 Chrome, Debian

Usage

Some tips:
  • Modifications to the configs.json structure should be updated in the configs.py script.
  • Joern processing may be slow or potentially freeze your OS, depending on your system’s specs. To prevent this, reduce the chunk size processed during the CPG_generation process by adjusting the "slice_size" value in the "create" section of the configs.json file.
  • Within the "slice_size" parameter, nodes exceeding the configured size limit will be filtered out and discarded.
  • Follow the instructions on Joern's documentation page and install Joern's command line tools under 'project'\joern\joern-cli\ .
  • You can find the implementation code of the baselines mentioned in the paper in the baselines.zip, which consists of four Jupyter notebooks.
Preparing the CPG :
python run.py -cpg -embed -mode train -path /your/model/path

-cpg and -embed respectively represent using joern to extract the code's CPG and generating corresponding embeddings. -path is used to specify the path for saving the model.

Training and Testing:
python run.py -mode test -path /your/model/saved/path

-mode is used to specify whether only the training process is executed or both the training and testing processes are performed. -path is used to specify the path for saving the model.

Fine-tuning process:

This command is used to fine-tune CodeBERT on a specific dataset and then generate embeddings for subsequent nodes. Pre-trained CodeBERT weights need to be downloaded from here.

python fine-tune.py

Main Results

Here only the accuracy results are displayed; for other metrics, please refer to the paper.

Model DiverseVul VDSIC Devign ReVeal
BERT 91.99 79.41 60.58 86.88
CodeBERT 92.40 83.13 64.80 88.64
GraphCodeBERT 92.96 83.98 64.80 89.25
TextCNN 92.16 66.54 60.38 85.43
TextGCN 91.50 67.55 60.47 87.25
Devign 70.21 59.30 57.66 65.47
Our 93.06 84.38 65.70 90.80

Acknowledgement

Parts of the code for data preprocessing and graph construction using Joern are adapted from Devign. We appreciate their excellent work!

vul-lmggnn's People

Contributors

vul-lmgnn avatar sedofihope avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.