THIS IS A DEPRECATED REPOSITORY. PLEASE REFER TO THIS LINK.

Adversarial Examples for Models of Code - Code2vec

An adversary for Code2vec - neural network for learning distributed representations of code. This is an official implemention of the model described in:

Noam Yefet, Uri Alon and Eran Yahav, "Adversarial Examples for Models of Code", 2019 https://arxiv.org/abs/1910.07517

This is a TensorFlow implementation , designed to be easy and useful in research, and for experimenting with new ideas for attacks in machine learning for code tasks. Contributions are welcome.

Requirements
Quickstart
Configuration

Requirements

On Ubuntu:

Python3. To check if you have it:

python3 --version

TensorFlow - version 1.13 or newer (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

For creating a new dataset - Java JDK

Quickstart

Step 0: Cloning this repository

git clone https://github.com/noamyft/code2vec.git
cd code2vec

Step 1: Creating a new dataset from java sources

In order to have a preprocessed dataset to attack the network on, you can either download our preprocessed dataset, or create a new dataset of your own.

Download our preprocessed dataset (compressed: 200Mb, extracted 1Gb)

We provided a preprocessed dataset (based on Uri Alon's Java-large dataset).

First, you should download and extract the preprocessed datasets below in the dir created earlier:

dataset for VarName & Deadcode atack

Then extract it:

tar -xvzf java_large_adversarial_data.tar.gz

This will create directory named "data" with all the relevant data for the model and adversary.

Step 2: Downloading a trained model

We provide a trained code2vec model that was trained on the Java-large dataset (thanks to Uri Alon). Trainable model (3.5 GB):

wget https://code2vec.s3.amazonaws.com/model/java-large-model.tar.gz
tar -xvzf java-large-model.tar.gz

You can also train your own model. see Code2Vec

Step 3: Run adversary on the trained model

Once you download the preprocessed datasets and pretrained model - you can run the adversary on the model, by run:

for Varname Attack:

python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add

for Deadcode Attack:

python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial_with_deadcode.test.c2v --test_adversarial --adversarial_type nontargeted --adversarial_deadcode --adversarial_target merge|from

Where:

_--load _ - the path to the pretrained model.
_--load_dict _ - the path to the preprocessed dictionary.
_--adversarial_deadcode _ - use DeadCode attack (note: you should also specify the path to the deadcode dataset)
_--adversarial_type _ - targeted\nontargeted.
_--adversarial_target _ - specify the desired target (for the "targeted" type). Names seperated by '|" (e.g. "merge|from")

You can also determine the BFS search's depth and width by setting the --adversarial_depth , --adversarial_topk parameters respectively (2 by default).

Manually examine adversarial examples

You can run the examples we provided in the paper on the Code2vec's online demo. available at https://code2vec.org/.

You can copy&paste the sort example from here
you can type the following code in each example to get Prediction of sort:

int introsorter = 0;

Defense

You can run the Outlier Detection defense by adding the --guard_input with threshold to either:

regular evaluation, e.g. :

python3 code2vec.py --load models/java-large/saved_model_iter3 --test data/java_large_adversarial/java_large_adversarial.test.c2v --guard_input 2.7

adversarial evaluation. e.g.:

python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add --guard_input 2.7

Configuration

Changing hyper-parameters is possible by editing the file config.py. Here are some of the parameters and their description:

config.MAX_WORDS_FROM_VOCAB_FOR_ADVERSARIAL = 100000

The vocabulary size of the adversary.

config.ADVERSARIAL_MINI_BATCH_SIZE = 256

set the batch size for gradients step of the adversary.

config.TEST_BATCH_SIZE = config.BATCH_SIZE = 1024

Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.

config.READING_BATCH_SIZE = 1300 * 4

The batch size of reading text lines to the queue that feeds examples to the network during training.

config.NUM_BATCHING_THREADS = 2

The number of threads enqueuing examples.

config.BATCH_QUEUE_SIZE = 300000

Max number of elements in the feeding queue.

config.DATA_NUM_CONTEXTS = 200

The number of contexts in a single example, as was created in preprocessing.

config.MAX_CONTEXTS = 200

The number of contexts to use in each example.

config.WORDS_VOCAB_SIZE = 1301136

The max size of the token vocabulary.

config.TARGET_VOCAB_SIZE = 261245

The max size of the target words vocabulary.

config.PATHS_VOCAB_SIZE = 911417

The max size of the path vocabulary.

config.EMBEDDINGS_SIZE = 128

Embedding size for tokens and paths.

noamyft / code2vec Goto Github PK

code2vec's Introduction