THIS IS A DEPRECATED REPOSITORY. PLEASE REFER TO THIS LINK.
An adversary for Code2vec - neural network for learning distributed representations of code. This is an official implemention of the model described in:
Noam Yefet, Uri Alon and Eran Yahav, "Adversarial Examples for Models of Code", 2019 https://arxiv.org/abs/1910.07517
This is a TensorFlow implementation , designed to be easy and useful in research, and for experimenting with new ideas for attacks in machine learning for code tasks. Contributions are welcome.
On Ubuntu:
- Python3. To check if you have it:
python3 --version
- TensorFlow - version 1.13 or newer (install). To check TensorFlow version:
python3 -c 'import tensorflow as tf; print(tf.__version__)'
git clone https://github.com/noamyft/code2vec.git
cd code2vec
In order to have a preprocessed dataset to attack the network on, you can either download our preprocessed dataset, or create a new dataset of your own.
We provided a preprocessed dataset (based on Uri Alon's Java-large dataset).
First, you should download and extract the preprocessed datasets below in the dir created earlier:
Then extract it:
tar -xvzf java_large_adversarial_data.tar.gz
This will create directory named "data" with all the relevant data for the model and adversary.
We provide a trained code2vec model that was trained on the Java-large dataset (thanks to Uri Alon). Trainable model (3.5 GB):
wget https://code2vec.s3.amazonaws.com/model/java-large-model.tar.gz
tar -xvzf java-large-model.tar.gz
You can also train your own model. see Code2Vec
Once you download the preprocessed datasets and pretrained model - you can run the adversary on the model, by run:
- for Varname Attack:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add
- for Deadcode Attack:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial_with_deadcode.test.c2v --test_adversarial --adversarial_type nontargeted --adversarial_deadcode --adversarial_target merge|from
Where:
- _--load _ - the path to the pretrained model.
- _--load_dict _ - the path to the preprocessed dictionary.
- _--adversarial_deadcode _ - use DeadCode attack (note: you should also specify the path to the deadcode dataset)
- _--adversarial_type _ - targeted\nontargeted.
- _--adversarial_target _ - specify the desired target (for the "targeted" type). Names seperated by '|" (e.g. "merge|from")
You can also determine the BFS search's depth and width by setting the --adversarial_depth , --adversarial_topk parameters respectively (2 by default).
You can run the examples we provided in the paper on the Code2vec's online demo. available at https://code2vec.org/.
-
You can copy&paste the sort example from here
-
you can type the following code in each example to get Prediction of sort:
int introsorter = 0;
You can run the Outlier Detection defense by adding the --guard_input with threshold to either:
- regular evaluation, e.g. :
python3 code2vec.py --load models/java-large/saved_model_iter3 --test data/java_large_adversarial/java_large_adversarial.test.c2v --guard_input 2.7
- adversarial evaluation. e.g.:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add --guard_input 2.7
Changing hyper-parameters is possible by editing the file config.py. Here are some of the parameters and their description:
The vocabulary size of the adversary.
set the batch size for gradients step of the adversary.
Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.
The batch size of reading text lines to the queue that feeds examples to the network during training.
The number of threads enqueuing examples.
Max number of elements in the feeding queue.
The number of contexts in a single example, as was created in preprocessing.
The number of contexts to use in each example.
The max size of the token vocabulary.
The max size of the target words vocabulary.
The max size of the path vocabulary.
Embedding size for tokens and paths.