GithubHelp home page GithubHelp logo

cr1msonrose / transferrepresentationlearning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from daniellin1986/transferrepresentationlearning

1.0 0.0 0.0 148.85 MB

Cross-Project Transfer Representation Learning for Vulnerable Function Discovery

Python 100.00%

transferrepresentationlearning's Introduction

Transferable Representation Learning

Hi there, welcome to this page!

This page contains the code and data used in the paper Cross-Project Transfer Representation Learning for Vulnerable Function Discovery by Guanjun Lin; Jun Zhang; Wei Luo; Lei Pan; Yang Xiang; Olivier De Vel and Paul Montague.

Instructions:

The Vulnerabilities_info.xlsx file contains information of the collected function-level vulnerabilities (It just a record for reference.). These vulnerabilities are from 6 open source projects: FFmpeg, LibTIFF, LibPNG, Pidgin, Asterisk and VLC Media Player. And vulnerability information was collected from National Vulnerability Database(NVD) until the end of July 2017.

Requirements for code:

The dependencies can be installed using Anaconda. For example:

$ bash Anaconda3-5.0.1-Linux-x86_64.sh

The "Data" folder contains the following sub folders:

  1. VulnerabilityData -- It contains a ZIP file which stores the vulnerable and part of non_vulnerable functions from 6 open source projects. Unzip the file, one will find 6 folders named with the projects. Each folder contains the source code of the non-vulnerable functions (named with their function names) and vulnerable functions (named with the CVE IDs):
    • The vulnerable functions are all named with the CVE IDs (their names are starting with ‘cve-’ or ‘CVE-’). For example, “cve-2017-14005.c” is a vulnerable function.
    • The non-vulnerable functions are named with the format: “xxxx_file_name_function_name.c” to avoid duplicated file/function names. For example “1374_cmdutils.c_show_devices.c” is a non-vulnerable function.

In the pre-training phase, one can choose any 5 projects as the historical data for training a LSTM network (the labels can be generated based on the file names (vulnerable functions have the CVE IDs as their file names. Please see the code for more details). Then, the remaining 1 project can be used as the input to the pre-trained network for generating representations. Finally, the generated representations can be used as features for training a classifier.

  1. CodeMetrics -- It stores the code metrics extracted from the source code files of the open source projects. The code metrics are used as features to train a random forest classifier as the baseline to compare with the method which uses transfer-learned representations as features. We used Understand which is a commercial code enhancement tool for extracting function-level code metrics. We included 23 code metrics extracted from the vulnerable functions of 6 projects.

  2. TrainedTokenizer -- It contains the trained tokenizer file which is used for converting the serialized AST lists to numeric tokens.

  3. TrainedWord2vecModel -- It includes the trained Word2vec model. The model was trained on the code base of 6 open source projects. The Word2vec model is used in the embedding layer of the LSTM network for converting input sequence to meaningful embeddings.

The "Code" folder contains the Python code samples.

  1. TransferableRepresentationLearning_LSTM_DNN.py file is for LSTM network training. It defines the structure of the Bi-LSTM network used in the paper. The input of the file is the historical vulnerable functions that have labels. The output of the file is a trained LSTM network capable of obtaining vulnerable function representations.

  2. ExtractLearnedFeaturesAndClassification.py file is for obtaining the function representations from the pre-trained LSTM network. It also includes the code for training a random forest classifier based on the obtained function representations as features.

  3. CodeMetrics.py file is to train a random forest classifier based on the selected 23 code metrics.

If you are interested in our project, please contact [email protected] for more information. If you use our code and data in your work, please kindly cite our paper in your work.

The latex format:

@article{lin2018cross,
  title={Cross-Project Transfer Representation Learning for Vulnerable Function Discovery},
  author={Lin, Guanjun and Zhang, Jun and Luo, Wei and Pan, Lei and Xiang, Yang and De Vel, Olivier and Montague, Paul},
  journal={IEEE Transactions on Industrial Informatics},
  year={2018},
  publisher={IEEE}
}

Thank you!

transferrepresentationlearning's People

Contributors

daniellin1986 avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.