GithubHelp home page GithubHelp logo

noamyft / code2vec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tech-srl/code2vec

0.0 1.0 0.0 4.13 MB

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"

Home Page: https://arxiv.org/abs/1803.09473

License: MIT License

Java 13.58% Python 82.67% Shell 3.74%

code2vec's Introduction

THIS IS A DEPRECATED REPOSITORY. PLEASE REFER TO THIS LINK.

Adversarial Examples for Models of Code - Code2vec

An adversary for Code2vec - neural network for learning distributed representations of code. This is an official implemention of the model described in:

Noam Yefet, Uri Alon and Eran Yahav, "Adversarial Examples for Models of Code", 2019 https://arxiv.org/abs/1910.07517

This is a TensorFlow implementation , designed to be easy and useful in research, and for experimenting with new ideas for attacks in machine learning for code tasks. Contributions are welcome.

Table of Contents

Requirements

On Ubuntu:

  • Python3. To check if you have it:

python3 --version

  • TensorFlow - version 1.13 or newer (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

Quickstart

Step 0: Cloning this repository

git clone https://github.com/noamyft/code2vec.git
cd code2vec

Step 1: Creating a new dataset from java sources

In order to have a preprocessed dataset to attack the network on, you can either download our preprocessed dataset, or create a new dataset of your own.

Download our preprocessed dataset (compressed: 200Mb, extracted 1Gb)

We provided a preprocessed dataset (based on Uri Alon's Java-large dataset).

First, you should download and extract the preprocessed datasets below in the dir created earlier:

Then extract it:

tar -xvzf java_large_adversarial_data.tar.gz

This will create directory named "data" with all the relevant data for the model and adversary.

Step 2: Downloading a trained model

We provide a trained code2vec model that was trained on the Java-large dataset (thanks to Uri Alon). Trainable model (3.5 GB):

wget https://code2vec.s3.amazonaws.com/model/java-large-model.tar.gz
tar -xvzf java-large-model.tar.gz

You can also train your own model. see Code2Vec

Step 3: Run adversary on the trained model

Once you download the preprocessed datasets and pretrained model - you can run the adversary on the model, by run:

  • for Varname Attack:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add
  • for Deadcode Attack:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial_with_deadcode.test.c2v --test_adversarial --adversarial_type nontargeted --adversarial_deadcode --adversarial_target merge|from

Where:

  • _--load _ - the path to the pretrained model.
  • _--load_dict _ - the path to the preprocessed dictionary.
  • _--adversarial_deadcode _ - use DeadCode attack (note: you should also specify the path to the deadcode dataset)
  • _--adversarial_type _ - targeted\nontargeted.
  • _--adversarial_target _ - specify the desired target (for the "targeted" type). Names seperated by '|" (e.g. "merge|from")

You can also determine the BFS search's depth and width by setting the --adversarial_depth , --adversarial_topk parameters respectively (2 by default).

Manually examine adversarial examples

You can run the examples we provided in the paper on the Code2vec's online demo. available at https://code2vec.org/.

  • You can copy&paste the sort example from here

  • you can type the following code in each example to get Prediction of sort:

int introsorter = 0;

Defense

You can run the Outlier Detection defense by adding the --guard_input with threshold to either:

  • regular evaluation, e.g. :
python3 code2vec.py --load models/java-large/saved_model_iter3 --test data/java_large_adversarial/java_large_adversarial.test.c2v --guard_input 2.7
  • adversarial evaluation. e.g.:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add --guard_input 2.7

Configuration

Changing hyper-parameters is possible by editing the file config.py. Here are some of the parameters and their description:

config.MAX_WORDS_FROM_VOCAB_FOR_ADVERSARIAL = 100000

The vocabulary size of the adversary.

config.ADVERSARIAL_MINI_BATCH_SIZE = 256

set the batch size for gradients step of the adversary.

config.TEST_BATCH_SIZE = config.BATCH_SIZE = 1024

Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.

config.READING_BATCH_SIZE = 1300 * 4

The batch size of reading text lines to the queue that feeds examples to the network during training.

config.NUM_BATCHING_THREADS = 2

The number of threads enqueuing examples.

config.BATCH_QUEUE_SIZE = 300000

Max number of elements in the feeding queue.

config.DATA_NUM_CONTEXTS = 200

The number of contexts in a single example, as was created in preprocessing.

config.MAX_CONTEXTS = 200

The number of contexts to use in each example.

config.WORDS_VOCAB_SIZE = 1301136

The max size of the token vocabulary.

config.TARGET_VOCAB_SIZE = 261245

The max size of the target words vocabulary.

config.PATHS_VOCAB_SIZE = 911417

The max size of the path vocabulary.

config.EMBEDDINGS_SIZE = 128

Embedding size for tokens and paths.

code2vec's People

Contributors

noamyft avatar urialon avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.