GithubHelp home page GithubHelp logo

gadiluna / safe Goto Github PK

View Code? Open in Web Editor NEW
165.0 12.0 41.0 178 KB

SAFE: Self-Attentive Function Embeddings for binary similarity

Python 95.50% Shell 0.91% Perl 1.91% C 0.11% HTML 0.42% Ruby 1.15%
binary neural-networks machine-learning scientific-research

safe's Introduction

SAFE : Self Attentive Function Embedding

Paper

This software is the outcome of our accademic research. See our arXiv paper: arxiv

If you use this code, please cite our accademic paper as:

@inproceedings{massarelli2018safe,
  title={SAFE: Self-Attentive Function Embeddings for Binary Similarity},
  author={Massarelli, Luca and Di Luna, Giuseppe Antonio and Petroni, Fabio and Querzoni, Leonardo and Baldoni, Roberto},
  booktitle={Proceedings of 16th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA)},
  year={2019}
}

What you need

You need radare2 installed in your system.

Quickstart

To create the embedding of a function:

git clone https://github.com/gadiluna/SAFE.git
pip install -r requirements
chmod +x download_model.sh
./download_model.sh
python safe.py -m data/safe.pb -i helloworld.o -a 100000F30

What to do with an embedding?

Once you have two embeddings embedding_x and embedding_y you can compute the similarity of the corresponding functions as:

from sklearn.metrics.pairwise import cosine_similarity

sim=cosine_similarity(embedding_x, embedding_y)
 

Data Needed

SAFE needs few information to work. Two are essentials, a model that tells safe how to convert assembly instructions in vectors (i2v model) and a model that tells safe how to convert an binary function into a vector. Both models can be downloaded by using the command

./download_model.sh

the downloader downloads the model and place them in the directory data. The directory tree after the download should be.

safe/-- githubcode
     \
      \--data/-----safe.pb
               \
                \---i2v/
            

The safe.pb file contains the safe-model used to convert binary function to vectors. The i2v folder contains the i2v model.

Hardcore Details

This section contains details that are needed to replicate our experiments, if you are an user of safe you can skip it.

Safe.pb

This is the freezed tensorflow trained model for AMD64 architecture. You can import it in your project using:

 import tensorflow as tf
 
 with tf.gfile.GFile("safe.pb", "rb") as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())

 with tf.Graph().as_default() as graph:
    tf.import_graph_def(graph_def)
    
 sess = tf.Session(graph=graph)

see file: neural_network/SAFEEmbedder.py

i2v

The i2v folder contains two files. A Matrix where each row is the embedding of an asm instruction. A json file that contains a dictonary mapping asm instructions into row numbers of the matrix above. see file: asm_embedding/InstructionsConverter.py

Train the model

If you want to train the model using our datasets you have to first use:

 python3 downloader.py -td

This will download the datasets into data folder. Note that the datasets are compressed so you have to decompress them yourself. This data will be an sqlite databases. To start the train use neural_network/train.sh. The db can be selected by changing the parameter into train.sh. If you want information on the dataset see our paper.

Create your own dataset

If you want to create your own dataset you can use the script ExperimentUtil into the folder dataset creation.

Create a functions knowledge base

If you want to use SAFE binary code search engine you can use the script ExperimentUtil to create the knowledge base. Then you can search through it using the script into function_search

Related Projects

Thanks

In our code we use godown to download data from Google drive. We thank circulosmeos, the creator of godown.

We thank Davide Italiano for the useful discussions.

safe's People

Contributors

fabiopetroni avatar gadiluna avatar heshananupama avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

safe's Issues

Function not found

I compiled hellow word using gcc in my 64-bit host machine and used objdump utility to restore the address of main function that I would to embed.

I typed the following commands:

python3.7 safe.py -m data/safe_trained_X86.pb -i a.out -a 0000000000400526
python3.7 safe.py -m data/safe_trained_X86.pb -i a.out -a 400526

and set the base to 16, 32, 36 ... but I got the same error "function not found"

image

About the word2id.json file

hi, I want to know how to generate the word2id.json file? This is causing me significant confusion, I hope to revecive your answer. Thanks!

tensorflow version question

I make a try about different version. but it reports different errors every version.
So I want to ask you for tf version used officially.

Labels in OpenSSL dataset

Hi,

We couldn't find any field corresponding to labels in the OpenSSL dataset (as mentioned by you in your publication). Can you please explain how to extract/compute them? We were looking at the AMD64MultipleCompilers.db dataset.

We are planning to perform supervised learning along with similarity measures (similar to what you did in the paper).

My colleague has mailed Luca Massarelli regarding the same.

Thanks in advance!

No count_func table in the AMD64PostgreSQL.db

When I wanted to do a search evaluation, I run python EvaluateSearchEngine.py and found the following error reported.

File "EvaluateSearchEngine.py", line 54, in find_target_fcn
q = cur.execute("SELECT num FROM count_func WHERE file_name='{}' and function_name='{}'".format(fi_name,f_name))
sqlite3.OperationalError: no such table: count_func

Then I go into AMD64PostgreSQL.db and check it, I find that there is no count_func table in the database, but only the following tables

sqlite> .tables
filtered_functions safe_embeddings train             
functions test validation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.