yueduan / deepbindiff Goto Github PK
View Code? Open in Web Editor NEWOfficial repository for DeepBinDiff
License: BSD 3-Clause "New" or "Revised" License
Official repository for DeepBinDiff
License: BSD 3-Clause "New" or "Revised" License
The README file offers zero information about how to actually use or evaluate this code base nor does it provide any information on system/environment requirements.
I want to run deepbindiff and have read some of your codes.
It seems deepbindiff leverages Angr to extract basic block information and generate CFG instead of IDA?
Thanks for your outstanding work, but I met some problems with your source code.
In preprocessing.py ,
def offsetStrMappingGen(cfg1, cfg2, binary1, binary2, mneList):
# count type of constants for feature vector generation
# offsetStrMapping[offset] = strRef.strip()
offsetStrMapping = {}
# lists that store all the non-binary functions in bin1 and 2
externFuncNamesBin1 = []
externFuncNamesBin2 = []
for func in cfg1.functions.values():
if func.binary_name == binary1:
for offset, strRef in func.string_references(vex_only=True):
offset = str(offset)
#offset = str(hex(offset))[:-1]
if offset not in offsetStrMapping:
offsetStrMapping[offset] = ''.join(strRef.split())
elif func.binary_name not in externFuncNamesBin1:
externFuncNamesBin1.append(func.name)
def externBlocksAndFuncsToBeMerged(cfg1, cfg2, nodelist1, nodelist2, binary1, binary2, nodeDic1, nodeDic2, externFuncNamesBin1, externFuncNamesBin2, string_bid1, string_bid2):
# toBeMerged[node1_id] = node2_id
toBeMergedBlocks = {}
toBeMergedBlocksReverse = {}
# toBeMergedFuncs[func1_addr] = func2_addr
toBeMergedFuncs = {}
toBeMergedFuncsReverse = {}
externFuncNameBlockMappingBin1 = {}
externFuncNameBlockMappingBin2 = {}
funcNameAddrMappingBin1 = {}
funcNameAddrMappingBin2 = {}
for func in cfg1.functions.values():
binName = func.binary_name
funcName = func.name
funcAddr = func.addr
blockList = list(func.blocks)
if (binName == binary1) and (func.name in externFuncNamesBin1) and (len(blockList) == 1):
for node in nodelist1:
if (node.block is not None) and (node.block.addr == blockList[0].addr):
externFuncNameBlockMappingBin1[funcName] = nodeDic1[node]
funcNameAddrMappingBin1[funcName] = funcAddr
The non-binary functions are stored in externFuncNamesBin1, whose binary names are not binary1. And when it goes to
if (binName == binary1) and (func.name in externFuncNamesBin1) and (len(blockList) == 1):
, the condition will never be satisfied.
The error message says that the ".\vec_all" file is missing.
The codebase requires specific versions of tensorflow and gensim to run.
Please add the following requirements.txt
to the repo:
tensorflow==1.15
gensim==3.8.3
angr
networkx
lapjv
scikit-learn
Nice work.
However, there are also some questions about the instruction embedding. I hope to get your answer.
The practice of instruction embedding in your paper is the opcode times TF-IDF weight, and concat the average of operands. For example, if the dimension of a single opcode or operand is 64, the final dimension of instruction is 64+64=128 (I hope I understand correctly).
However, there are some opcode with none operand such as 'retn', 'pusha', 'popa', 'cdq', etc. How did you deal with such instructions? Ignore or concat a zero embedding as operands?
Thanks.
How can I solve this problem
Hello,
I am writing to inquire some advices to interpret the output of DeepBinDiff. In particular, I have two questions as follows:
➜ DeepBinDiff git:(master) ✗ python3 src/deepbindiff.py --input1 experiment_data/coreutils/binaries/coreutils-7.6-O0/true --input2 experiment_data/coreutils/binaries/coreutils-7.6-O3/true --outputDir output/
And the processing time is:
python3 src/deepbindiff.py --input1 --input2 --outputDir output/ 63233.15s user 103785.18s system 1966% cpu 2:21:33.48 total
It takes quite a long time (we are running it on a 32-core server machine with 256GB RAM). Is it normal?
true
vs. true
is as follows:Reading...
time: 7696.551887512207
Saving embeddings...
Perform matching...
[[0.8654591 0.92791235 0.7441185 ... 0.9215279 0.9736992 0.97301173]
[0.74939525 0.8753574 0.6855561 ... 0.9378971 0.95241654 0.9951242 ]
[0.6706515 0.82596886 0.8066987 ... 0.805171 0.9380803 0.9999579 ]
...
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]]
(1044, 1044)
matched pairs:
[[161, 875], [164, 867], [84, 828], [389, 1309], [71, 811], [346, 1287], [302, 1212], [208, 987],
[90, 833], [74, 814], [218, 1292], [467, 1556], [110, 844], [91, 1102], [456, 1562], [279, 1196]
, [75, 816], [213, 1317], [264, 1166], [77, 815], [76, 819], [102, 834], [291, 1206], [70, 856],
[329, 1248], [602, 1578], [560, 1581], [455, 1584], [692, 1543], [222, 999], [375, 1301], [392, 1
218], [49, 789], [1, 809], [374, 1219], [635, 1724], [267, 1177], [458, 1541], [201, 977], [250,
1139], [597, 1699], [410, 1338], [257, 1154], [341, 1319], [244, 1161], [248, 1100], [203, 978],
[734, 1920], [546, 1644], [598, 1696], [304, 1327], [372, 1302], ...
May I ask how to interpret the results? I am familiar with BinDiff and expecting similar output format like BinDiff (function-level and binary-level similarity). Is it possible to covert the current output into function or binary-level similarity score? Thank you very much!
Hello there,
Thanks a lot for providing such a nice tool to use. I am writing to inquire an error encountered when running this tool. Here is the error message I received:
python3 src/deepbindiff.py --input1 experiment_data/coreutils/binaries/coreutils-7.6-O0/ls --input2 experiment_data/coreutils/binaries/coreutils-7.6-O3/ls --outputDir output/
....
....
Initialized
Average loss at step 0 : 134.17404174804688
Average loss at step 2000 : 5.122461198568344
Average loss at step 4000 : 3.3189247410297393
Average loss at step 6000 : 3.2604737248420714
Traceback (most recent call last):
File "src/deepbindiff.py", line 230, in <module>
main()
File "src/deepbindiff.py", line 223, in main
copyEverythingOver(outputDir, 'data/DeepBD/')
File "src/deepbindiff.py", line 172, in copyEverythingOver
copyfile(src_dir + node_features, dst_dir + node_features)
File "/usr/lib/python3.6/shutil.py", line 121, in copyfile
with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'data/DeepBD/features'
Can anyone shed some lights on this? Thank you very much!
Dear sir or Madam,
When I run this code, it shows that the code can only run on the cpu. But the model running on the cpu took too much time. I want to know which part of the code makes the model unable to run on the cpu and whether it can be transplanted to the gpu to run. Thank you with my greatest respect.
I run this command
python3 src/deepbindiff.py --input1 ./experiment_data/findutils/binaries/findutils-4.41-O2/find --input2 ./experiment_data/findutils/binaries/findutils-4.6-O2/find --outputDir ./result
It has a KeyError
Traceback (most recent call last):
File "src/deepbindiff.py", line 243, in <module>
main()
File "src/deepbindiff.py", line 224, in main
block_embeddings = cal_block_embeddings(blockIdxToTokens, blockIdxToOpcodeNum, blockIdxToOpcodeCounts, insToBlockCounts, tokenEmbeddings, reversed_dictionary)
File "src/deepbindiff.py", line 118, in cal_block_embeddings
tf_weight = opcodeCounts[token] / opcodeNum
KeyError: 'and'
Hi,
Why Openssl exmaple is not included in the shared binaries?
I have compiled both version for test example. Please provide yours to verify the results
If I want to run this script under IDA Pro, what should I modify?
it seems that deepbindiff produced the following files, but how to interpret them?
edgelist
edgelist_merged_tadw
nodeIndexToCode
thank you.
There are two types of function names. One of them is a string, and the other is a memory address. I didn't find how deepbindiff handles them. Thank you.
push eax
call memset
push eax
call sub_8084480
Does the function 'normalization' handle that?
def normalization(opstr, offsetStrMapping):
optoken = ''
opstrNum = ""
if opstr.startswith("0x") or opstr.startswith("0X"):
opstrNum = str(int(opstr, 16))
# normalize ptr
if "ptr" in opstr:
optoken = 'ptr'
# nodeToIndex.write("ptr\n")
# substitude offset with strings
elif opstrNum in offsetStrMapping:
optoken = offsetStrMapping[opstrNum]
# nodeToIndex.write("str\n")
# nodeToIndex.write(offsetStrMapping[opstr] + "\n")
elif opstr.startswith("0x") or opstr.startswith("-0x") or opstr.replace('.','',1).replace('-','',1).isdigit():
optoken = 'imme'
# nodeToIndex.write("IMME\n")
elif opstr in register_list_1_byte:
optoken = 'reg1'
elif opstr in register_list_2_byte:
optoken = 'reg2'
elif opstr in register_list_4_byte:
optoken = 'reg4'
elif opstr in register_list_8_byte:
optoken = 'reg8'
else:
optoken = str(opstr)
# nodeToIndex.write(opstr + "\n")
return optoken
Hi, i am trying to run this code in a server, but the result is not good. I have noticed that there is a sentence in the paper: "We randomly select half of the binaries in our dataset for token embedding model training", but i cannot find a function in this code to load half of binaries of a dataset in one-time. Do i miss any important details? Or this code can only run two binary in one-time?
Nice work.
However, can the grounp truth collection script be open source?
I was trying to execute the code, and I came across an error where;
embedding_file = "\\vec_all"
But this file is being called without being created. How can I go about resolving this issue?
Any and every help is appreciated.
UPD: Resloved. The issue stems from the fact that python3
command malfunctions. Instead, using python
on terminal commands seems to solve the problem.
Hey, I would love to get this up and running. It would seem though that I’m using an incorrect version of TensorFlow though.
I think having a requirements.txt would make installation easier.
I use deepbindiff as following:
➜ DeepBinDiff git:(master) ✗ cat test.c
#include<stdio.h>
void main()
{
printf("hello world\n");
}
➜ DeepBinDiff git:(master) ✗ gcc test.c -o test1
➜ DeepBinDiff git:(master) ✗ gcc test.c -o test2
➜ DeepBinDiff git:(master) ✗ python3 src/deepbindiff.py --input1 ./test1 --input2 ./test2 --outputDir ./out
The error full output is as following. It seems the problem is "Sampler's range is too small".
Traceback (most recent call last):
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Sampler's range is too small.
[[{{node nce_loss/LogUniformCandidateSampler}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/deepbindiff.py", line 233, in <module>
main()
File "src/deepbindiff.py", line 220, in main
tokenEmbeddings = featureGen.tokenEmbeddingGeneration(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, reversed_dictionary, opcode_idx_list)
File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 392, in tokenEmbeddingGeneration
embeddings = buildAndTraining(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, opcode_idx_list)
File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 348, in buildAndTraining
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 957, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1180, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1358, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Sampler's range is too small.
[[node nce_loss/LogUniformCandidateSampler (defined at /mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py:310) ]]
Original stack trace for 'nce_loss/LogUniformCandidateSampler':
File "src/deepbindiff.py", line 233, in <module>
main()
File "src/deepbindiff.py", line 220, in main
tokenEmbeddings = featureGen.tokenEmbeddingGeneration(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, reversed_dictionary, opcode_idx_list)
File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 392, in tokenEmbeddingGeneration
embeddings = buildAndTraining(article, blockBoundaryIndex, insnStartingIndices, indexToCurrentInsnsStart, dictionary, opcode_idx_list)
File "/mnt/hgfs/deepbindiff/DeepBinDiff/src/featureGen.py", line 310, in buildAndTraining
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/nn_impl.py", line 2046, in nce_loss
logits, labels = _compute_sampled_logits(
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/nn_impl.py", line 1742, in _compute_sampled_logits
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/candidate_sampling_ops.py", line 149, in log_uniform_candidate_sampler
return gen_candidate_sampling_ops.log_uniform_candidate_sampler(
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/ops/gen_candidate_sampling_ops.py", line 656, in log_uniform_candidate_sampler
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3319, in _create_op_internal
ret = Operation(
File "/home/ling/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1791, in __init__
self._traceback = tf_stack.extract_stack()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.