lifting-bits / anvill Goto Github PK
View Code? Open in Web Editor NEWanvill forges beautiful LLVM bitcode out of raw machine code
License: GNU Affero General Public License v3.0
anvill forges beautiful LLVM bitcode out of raw machine code
License: GNU Affero General Public License v3.0
E.g. for teh SysV ABI, this is often in al
.
Sometimes we generate bitcode that does not pass verifier.
As a sanity check, lets ensure that we can always disassemble the bitocode via llvm-dis
as a part of our test suite.
We want to be able to mark certain instructions and stack locations with attributes.
While talking with Peter he suggested that these two features might be useful to start
So it looks like Clang segfaults on bitcode generated by Anvill :). The current suspicion is that it is caused by inline assembly we generate to mark capture of registers not in the specification.
Proposing the following fixes
%9 = getelementptr inbounds %struct.State, %struct.State* %8, i32 0, i32 6, i32 1, i32 0, i32 0, !remill_register !5
%10 = call i64 asm sideeffect "# read register RAX", "=r"()
store i64 %10, i64* %9, align 8
%11 = getelementptr inbounds %str
via a feature flag.
really, the most accurate way to lift this would be two functions:
the first function would have the naked attribute; it'd contain only the inline assembly, read out the "missing" dependent registers, and pass them as explicit args to the second function, along with the already stated args
The idea is that, if we are using a disassembler such as IDA Pro or Binary Ninja to generate the spec, then we will be giving concrete addresses for stuff in .tbss
, .tls
, and other TLS-related sections. We should be able to add on an address space to memory in those ranges, that is carried through into the LLVM side.
Use something like:
to have extensive single-function tests.
These will match similar tests in Rellic.
Anvill does a limited amount of what could be charitably described as type propagation. It is mostly centred around the GetPointer
function, and what it invokes. The merging of #62 means that, besides the address operands to memory read/write intrinsics, we have another jumping off point for type information.
The goals of upstream and downstream type propagation is to lift low-level operations into a slightly higher level. Here are some examples:
Suppose we have used hinted type information to recover the following:
%y_ptr = __anvill_type_...(%y_orig)
%y = ptrtoint %y_ptr
%x = add %y, 8
Then our goal is to produce the following:
%y_ptr = __anvill_type_...(%y_orig)
%x_ptr = getelementptr %y_ptr, 0, N
%x = ptrtoint %x_ptr ; Add to work list for processing
What is gained by this transformation, what is N
, and how is this an improvement? This depends on the pointer type returned by __anvill_type_...
. If it's a pointer to a structure, then N
might be the index of one of the elements of the structure, so long as there is an element in that structure at byte offset 8
. If it is a pointer to pointer, then N
might be 1
on a 64-bit architecture, or 2
on a 32-bit architecture, i.e. 8 / sizeof(void *)
, telling us what index of an array we're accessing. For non-pointer types we can do similar arithmetic to try to convert the byte offset 8
into an array index. When this isn't possible, we can bitcast to a type and convert to GEPs, if we can get some better downstream type information.
In x86 especially, the following pattern is common in function prologues:
push eax
push ebx
...
push ...
In bitcode, this might manifest as the following:
%esp = load %state, 0, ... ; state->gpr.rsp.dword
%eax = load %state, 0, ... ; state->gpr.rax.dword
%ebx = load %state, 0, ... ; state->gpr.rbx.dword
...
%esp_less_4 = sub %esp, 4
%memory.0 = __remill_write_memory_32(%memory, %esp_less_4, %eax)
%esp_less_8 = sub %esp, 8
%memory.1 = __remill_write_memory_32(%memory.0, %esp_less_8, %ebx)
...
store %esp_less_N, ... ; state->gpr.rsp.dword -= N
From the perspective of GetPointer
, we can observe that %esp_less_4
and %esp_less_8both need to be
i32 *. These two values share a common data dependency at their root, and a common base type. Ideally, we'd like be be able to treat
%espas a
i32 *so that we can use
getelementptr` with negative indices to index into the stack, and then treat the prologue like writing into elements of an array. This can be done locally with an aggressive algorithm, or perhaps via some more global means.
In practice, I have just looked at bitcode, looked for integer arithmetic that eventually leads to pointer casts, and tried to re-imagine how they could be lifted to a slightly higher level of abstraction.
What is needed to work toward a solution to this problem? The current state of the codebase is just a mishmash of complex, recursive functions that look for fixed patterns -- it's not extensible. I don't know what the right solution is just yet. In theory, if we have c++ codegen for dr. lojekyll, then maaaaybe that could be useful, but that might be trying to fit a square peg into a round hole. If we had MLIR as an intermediate representation then I'd say we should use its transform infrastructure. I think we should also look into how people have tried to automate instcombine rules. There's probably a whole lot of work here.
Thus, we have two problems:
The current CMake project can only auto-install the Python frontend to the local machine, which breaks packaging
To have other tools understand and parse the JSON spec, we should describe what is in it, and what the values mean.
This is one step toward support of lifting jump tables. The devirtualization list can be an array of pairs of integers that is at the top-most "scope" of the specification.
When parsing a *v
type, we generate an LLVM pointer to void, which is not an actual valid type. Instead, we should generate a pointer to an i8
.
Right now varargs functions are detected as varargs, but we only emit the formal but not variable parameters.
As an example, we see:
unsigned int printf(unsigned char *arg0, ...);
...
unsigned int val6 = printf(data_4007c1);
Lets handle varargs functions better. For now, it is fine if it just hits *printf
and *scanf
and handles static strings. This may be some kind of specific pass or set of heuristics, as long as it is identified as such and not comingled with other code.
Want to be able to distinguish between unsigned and signed integers when translating an LLVM type to spec. This is nontrivial because LLVM doesn't support the notion of signedness in their integer type.
Given a binary compiled from tests/branch.c
via clang-9.0
anvill's binja frontend crashes with
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/__main__.py", line 77, in <module>
File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/__main__.py", line 66, in main
File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/program.py", line 97, in add_function_definition
File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/binja.py", line 348, in visit
File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/binja.py", line 398, in _extract_types
File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/binja.py", line 405, in _extract_types
File "/home/my_user/git/binaryninja/python/binaryninja/lowlevelil.py", line 1078, in get_reg_value
reg = self._function.arch.get_reg_index(reg)
File "/home/my_user/git/binaryninja/python/binaryninja/architecture.py", line 1564, in get_reg_index
return self.regs[reg].index
KeyError: 'temp0'
This is likely due to binaryninja.LowLevelILInstruction.get_reg_value()
being called on
temp0.q = divs.dp.q(rdx:rax, rcx)
where temp0
is a virtual register as opposed to a real one like rax
or al
.
The anvill python should provide version information via --version
.
This version should look like the version information for the C++ code. Specifically, to output the latest git commit hash.
Following the instructions for building on Linux, and after make finishes with no errors:
$ sudo make install
make: *** No rule to make target 'install'. Stop.
It may be difficult to remember which bitcode goes with which anvill.
Lets track the bitcode generator via a debug annotation or comments in the bitcode.
This should emit the anvill version and commit hash used to generate the bitcode.
Things that would have been nice to have while getting ANVILL up and running
There is the alloca
function in C which dynamically allocates data on the stack. We need to investigate how to lift that.
When pushing new commits to an existing PR, any running or queued workflow for the same PR should be aborted. This is currently causing macOS builds to get stuck (due to the org-wide limit of 5 concurrent macOS workers).
ARM64 binaries often reference data via a combination of adrp
and add
(or similar instructions). The initial reference via adrp
is (correctly) detected as not present in the binary, and then Anvill (incorrectly) bails out in terror and abandons lifting.
Anvill should not abandon lifting, and adrp
/add
combination references should work in bitcode generated from ARM64 binaries.
Example output:
Traceback (most recent call last):
File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/program.py", line 116, in try_add_referenced_entity
File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/program.py", line 90, in add_function_definition
File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/program.py", line 45, in get_function
File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/binja.py", line 484, in get_function_impl
anvill.exc.InvalidFunctionException: No function defined at or containing address 410000```
Python>import anvill
Python>p = anvill.get_program()
Python>p.add_function_definition(0x100000F40)
True
Python>p.add_function_definition(0x100000F60)
True
Python>p.add_function_definition(0x100000F70)
True
Python>p.add_function_definition(0x100000F80)
True
Python>p.proto()
{"os": "macos", "functions": [{"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_main", "parameters": [{"register": "EDI", "type": "i", "name": "argc"}, {"register": "RSI", "type": "**b", "name": "argv"}, {"register": "RDX", "type": "**b", "name": "envp"}], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "EAX", "type": "i"}], "address": 4294971200}, {"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_voidptr_function", "parameters": [], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "RAX", "type": "l"}], "address": 4294971232}, {"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_uint64_function", "parameters": [], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "RAX", "type": "l"}], "address": 4294971264}, {"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_intptr_function", "parameters": [], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "RAX", "type": "l"}], "address": 4294971248}], "arch": "amd64", "stack": {"size": 24576, "start_offset": 4096, "address": 87960930217984}, "memory": []}
When running Anvill on the solo.elf
binary from the solokey firmware I get a python type error
prog = anvill.get_program(input_bin)
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/binja.py", line 475, in get_program
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/binja.py", line 386, in __init__
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 169, in __init__
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 184, in _load_types
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 179, in _process_dies
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 177, in _process_dies
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 287, in _process_types
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 259, in _process_indirect_types
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 291, in _process_types
File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 308, in _process_pointer_types
TypeError: can't concat str to bytes
I will hunt this down and make a fix
After successfully building and installing anvill (on both macOS and Linux), 'python3 -m anvill' results in a 'No module named anvill' error.
After manually installing the anvill Python module, 'python3 -m anvill' gives the following error:
NotImplementedError: Could not find either IDA or Binary Ninja APIs
So it doesn't seem usable from the CLI as per the README.
Trying to use the IDA plugin from within IDA results in:
AttributeError: module 'anvill' has no attribute 'get_program'
Some documentation on how this relates to mcsema (successor? stripped-down version?), how to build and use it would be nice.
If the CI test runner fails, it should print out the failing JSON spec to help with debugging.
These should be able to be populated both from LLVM + debug metadata, as well as DWARF parsing, as well as from IDA's stuff.
Using binja, we're sometimes missing type information for global variables. In this case binja assigns them a VoidType
. When creating a JSON spec, Anvill translates this into a 1-byte byte array. Which then gets translated into a [1 x i8]
type.
We should try to improve this via heuristics.
The key idea is this: instead of storing a hard-coded constant value into the State
's stack pointer, we'll have a global, __anvill_stack_pointer
, create a ptrtoint
constant expression of it, and store that. Then we can investigate all uses of __anvill_stack_pointer
and the subsequent integer displacements of those uses, and implement some folding over those.
The ELF thunk recognition code of McSema should be copied and adapted for Anvill so that if a function references an ELF thunk, then we go and follow through and find the referenced external and use its name in the prototype, rather than the name of the function itself, which may be prefixed with junk.
That is, instead of a prototype of this function having the name _signal
or .signal
:
We should instead follow through to the .plt
segment...
And take the info from here:
The relevant code to adapt from McSema is:
https://github.com/lifting-bits/mcsema/blob/master/tools/mcsema_disass/ida7/get_cfg.py#L334-L466
Also, basic block addresses can be observed. Also, this function uses a retn
, which is not recognized correctly in IDA.
{"functions": [{"return_stack_pointer": {"register": "ESP", "type": "I", "offset": 4}, "return_values": [{"register": "EAX", "type": "i"}], "return_address": {"type": "I", "memory": {"register": "ESP", "offset": 0}}, "parameters": [{"type": "i", "memory": {"register": "ESP", "offset": 4}}, {"type": "i", "memory": {"register": "ESP", "offset": 8}}, {"type": "i", "memory": {"register": "ESP", "offset": 12}}], "address": 6622921}, {"return_stack_pointer": {"register": "ESP", "type": "I", "offset": 4}, "return_values": [{"register": "EAX", "type": "i"}], "return_address": {"type": "I", "memory": {"register": "ESP", "offset": 0}}, "parameters": [{"type": "i", "memory": {"register": "ESP", "offset": 4}}, {"type": "i", "memory": {"register": "ESP", "offset": 8}}, {"type": "i", "memory": {"register": "ESP", "offset": 12}}, {"type": "i", "memory": {"register": "ESP", "offset": 16}}, {"type": "i", "memory": {"register": "ESP", "offset": 20}}, {"type": "i", "memory": {"register": "ESP", "offset": 24}}], "address": 6595488}], "arch": "x86", "variables": [{"type": "o", "address": 7436608}], "memory": [{"is_writeable": false, "data": "9081ec0c01000089ac240001000089b4240401000089bc24080100008bec83e4f08b8d240100008b85100100008b952001000083f9047d428bb51801000083f9000f84d901000083ec0c8904248974240489542408e8d06a00000385140100008bb51801000003b51c01000089b518010000ff8d24010000ebc7f30f1002f30f104a04f30f105208f30f105a0c0fc6c0000fc6c9000fc6d2000fc6db00f30f106210f30f106a14f30f107218f30f107a1c0fc6e4000fc6ed000fc6f6000fc6ff000f2904240f294c24100f295424200f295c24300f296424400f296c24500f297424600f297c2470f30f104230f30f104a34f30f105238f30f105a3c0fc6c0000fc6c9000fc6d2000fc6db000f298424c00000000f298c24d00000000f299424e00000000f299c24f00000008b95180100008bb51c0100008bbd1401000083f9040f8cb10000000f12020f120c168d14720f14c10f12220f122c160f14e50f28c80f16c40f12e18d14720f28c80f28e80f5904240f594c24100f596c24300f28d40f28fc0f596424400f595424500f597c24700f58c40f58ca0f58ef0f2815407971000f588424c00000000f588c24d00000000f58ac24f00000000f53e50f59ec0f5cd50f59e20f59c40f59cc0f28d00f14c10f15d10f13000f12c00f1304078d04780f13100f12d20f1314078d047883e904e946ffffff83f900742383e9040faff90faff103d603c7b9040000008bb51c0100008bbd14010000e91effffff8b85100100008be58bac24000100008bb424040100008bbc240801000081c40c010000c218008da424000000008da424000000008da424000000008d9b00000000", "is_executable": true, "is_readable": true, "address": 6595487}], "os": "windows", "stack": {"size": 24576, "start_offset": 4096, "address": 1344647168}}
This feature will exist to support data-to-code and data-to-data references. This will be an array of entries, kind of like:
[
[<addr of relocation>,
<size of value, in bytes>
<op str>
<expr values>]
// .. more entries
]
The op string will be a short string of characters that act as a stack-based interpreter, and where the starting stack state is provided by the following sequence of expression values, which must be integral constants.
%2 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 8
%3 = ptrtoint i8* %2 to i64
...
%27 = phi i64 [ %33, %32 ], [ %3, %0 ]
...
%33 = ptrtoint %main.frame_type* %1 to i64
...
%37 = inttoptr i64 %27 to i64*
; Function Attrs: noinline nounwind ssp
define i32 @main(i32 %argc, i8** %argv) local_unnamed_addr #0 {
%1 = alloca %main.frame_type, align 8
%2 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 8
%3 = ptrtoint i8* %2 to i64
%4 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 16
%5 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 28
%6 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 32
%7 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 40
%8 = tail call i8* @llvm.returnaddress(i32 0)
%9 = ptrtoint i8* %8 to i64
%10 = bitcast i8* %7 to i64*
store i64 %9, i64* %10, align 8
%11 = bitcast i8* %6 to i64*
store i64 0, i64* %11, align 8
%12 = bitcast i8* %5 to i32*
store i32 %argc, i32* %12, align 4
%13 = ptrtoint i8** %argv to i64
%14 = bitcast i8* %4 to i64*
store i64 %13, i64* %14, align 8
%15 = add i32 %argc, -2
%16 = icmp eq i32 %15, 0
%17 = lshr i32 %15, 31
%18 = trunc i32 %17 to i8
%19 = lshr i32 %argc, 31
%20 = xor i32 %17, %19
%21 = add nuw nsw i32 %20, %19
%22 = icmp eq i32 %21, 2
%23 = icmp ne i8 %18, 0
%24 = xor i1 %23, %22
%25 = or i1 %16, %24
br i1 %25, label %26, label %32
26: ; preds = %32, %0
%27 = phi i64 [ %33, %32 ], [ %3, %0 ]
%28 = icmp ne i32 %15, 0
%29 = icmp eq i8 %18, 0
%30 = xor i1 %29, %22
%31 = and i1 %28, %30
br i1 %31, label %39, label %36
32: ; preds = %0
%33 = ptrtoint %main.frame_type* %1 to i64
%34 = bitcast i8* %2 to i64*
store i64 48, i64* %34, align 8
%35 = tail call i32 @foo() #3, !noalias !0
br label %26
36: ; preds = %26
%37 = inttoptr i64 %27 to i64*
store i64 59, i64* %37, align 8
%38 = tail call i32 @bar() #3, !noalias !3
%.pre = load i64, i64* %10, align 8
br label %39
39: ; preds = %26, %36
%40 = phi i64 [ %9, %26 ], [ %.pre, %36 ]
%41 = tail call %struct.Memory* @__remill_function_return(%struct.State* nonnull undef, i64 %40, %struct.Memory* null) #3
ret i32 0
}
{
"arch":"amd64",
"variables":[
],
"functions":[
{
"name":"foo",
"address":0,
"return_values":[
{
"register":"EAX",
"type":"i"
}
],
"return_stack_pointer":{
"register":"RSP",
"type":"L"
},
"parameters":[
],
"return_address":{
"memory":{
"register":"RSP"
},
"type":"L"
}
},
{
"name":"bar",
"address":11,
"return_values":[
{
"register":"EAX",
"type":"i"
}
],
"return_stack_pointer":{
"register":"RSP",
"type":"L"
},
"parameters":[
],
"return_address":{
"memory":{
"register":"RSP"
},
"type":"L"
}
},
{
"name":"main",
"address":22,
"return_values":[
{
"register":"EAX",
"type":"i"
}
],
"return_stack_pointer":{
"register":"RSP",
"type":"L"
},
"parameters":[
{
"name":"argc",
"type":"i",
"register":"EDI"
},
{
"name":"argv",
"type":"**b",
"register":"RSI"
}
],
"return_address":{
"memory":{
"register":"RSP"
},
"type":"L"
}
}
],
"os":"linux",
"stack":{
"size":24576,
"start_offset":4096,
"address":87960955441152
},
"memory":[
{
"is_writeable":false,
"data":"554889E5B8010000005DC3554889E5B8020000005DC3554889E54883EC10897DFC488975F0837DFC027E05E8D0FFFFFF837DFC027F05E8D0FFFFFFB800000000C9C3",
"is_executable":true,
"is_readable":true,
"address":0
}
]
}
Since we are working on ARMv7 and eventually Thumb2 support in Remill, we should include the architecture descriptions for these in Anvill to begin testing Anvill on real programs.
There is enough ARMv7 support that we should be able to start testing Anvill in its current state.
I have encountered an issue where anvill-decompile-json
behaves incorrectly.
C++ code:
struct Dummy {
long i;
long j;
long k;
};
void dummy_function(struct Dummy& dummy, long i, long j, long k) {
dummy.i = i;
dummy.j = j;
dummy.k = k;
}
Bitcode from LLVM:
; Function Attrs: noinline norecurse nounwind ssp uwtable writeonly
define void @_Z14dummy_functionR5Dummylll(%struct.Dummy* nocapture dereferenceable(24), i64, i64, i64) local_unnamed_addr #5 !dbg !2476 {
call void @llvm.dbg.value(metadata %struct.Dummy* %0, metadata !2486, metadata !DIExpression()), !dbg !2490
call void @llvm.dbg.value(metadata i64 %1, metadata !2487, metadata !DIExpression()), !dbg !2491
call void @llvm.dbg.value(metadata i64 %2, metadata !2488, metadata !DIExpression()), !dbg !2492
call void @llvm.dbg.value(metadata i64 %3, metadata !2489, metadata !DIExpression()), !dbg !2493
%5 = getelementptr inbounds %struct.Dummy, %struct.Dummy* %0, i64 0, i32 0, !dbg !2494
store i64 %1, i64* %5, align 8, !dbg !2495, !tbaa !2496
%6 = getelementptr inbounds %struct.Dummy, %struct.Dummy* %0, i64 0, i32 1, !dbg !2498
store i64 %2, i64* %6, align 8, !dbg !2499, !tbaa !2500
%7 = getelementptr inbounds %struct.Dummy, %struct.Dummy* %0, i64 0, i32 2, !dbg !2501
store i64 %3, i64* %7, align 8, !dbg !2502, !tbaa !2503
ret void, !dbg !2504
}
JSON Specification:
{
"arch": "amd64",
"functions": [
{
"name": "dummy_function(Dummy&, long, long, long)",
"address": 4096,
"parameters": [
{
"name": "dummy",
"register": "RDI",
"type": "*{lll}"
},
{
"name": "i",
"register": "RSI",
"type": "l"
},
{
"name": "j",
"register": "RDX",
"type": "l"
},
{
"name": "k",
"register": "RCX",
"type": "l"
}
],
"return_stack_pointer": {
"offset": 8,
"register": "RSP",
"type": "l"
},
"return_address": {
"memory": {
"register": "RSP"
},
"type": "L"
},
"return_values": []
}
],
"memory": [
{
"address": 4096,
"data": "554889E54889374889570848894F105DC3",
"is_readable": true,
"is_executable": true
}
],
"stack": {
"address": 10000,
"size": 4176,
"start_offset": 4096
},
"os": "macos"
}
Lifted Bitcode
; Function Attrs: noinline nounwind ssp
define void @"dummy_function(Dummy&, long, long, long)"({ i64, i64, i64 }* nocapture %dummy, i64 %i, i64 %j, i64 %k) local_unnamed_addr #0 {
%1 = tail call i8* @llvm.returnaddress(i32 0)
%2 = ptrtoint i8* %1 to i64
%3 = getelementptr inbounds { i64, i64, i64 }, { i64, i64, i64 }* %dummy, i64 0, i32 0
store i64 %k, i64* %3, align 8
%4 = tail call %struct.Memory* @__remill_function_return(%struct.State* nonnull undef, i64 %2, %struct.Memory* null) #3
ret void
}
The lifted bitcode is wrong. One easy way to tell is that there are no offsets and no references to %i or %j. But basically, it is not handling assignment to struct elements correctly.
Go through Anvill code and document, at a high level, what it should be doing & why, to capture the information somewhere in written form.
We should be able to use the binary ninja API to determine file type rather than python-magic.
We're already adding type hints, and the Python 2.7 EOL should mean that we don't even bother trying to run a Python 2 setup.py.
So if the same address has multiple names, then the second call will declare a new version of the thing. For global variables, I thnk you'd want to use a GlobalAlias
, so that all the "public" names can be aliases to an internal private data. For functions, a similar thing could be done, where "public" versions tail-call the internal private version.
Originally posted by @pgoodman in #45 (comment)
Currently the docker image does not include lift.py
; we should include it since its a core part of using Anvill.
define i32 @main(i32 %argc, i8** %argv) local_unnamed_addr #0 {
%1 = alloca { [128 x i8], i64, [8 x i8], i64, [8 x i8], i64, [16 x i8], i8**, [20 x i8], i32, [4 x i8], i32, [4 x i8], i32, [8 x i8], i64, [16 x i8], i8*, [8 x i8], i64 }, align 8
%2 = ptrtoint { [128 x i8], i64, [8 x i8], i64, [8 x i8], i64, [16 x i8], i8**, [20 x i8], i32, [4 x i8], i32, [4 x i8], i32, [8 x i8], i64, [16 x i8], i8*, [8 x i8], i64 }* %1 to i64
...
25: ; preds = %31, %0
%26 = phi i64 [ 87960955445208, %31 ], [ %2, %0 ]
...
34: ; preds = %25
%35 = inttoptr i64 %26 to i64*
store i64 59, i64* %35, align 8
...
{
"arch":"amd64",
"variables":[
],
"functions":[
{
"name":"foo",
"address":0,
"return_values":[
{
"register":"EAX",
"type":"i"
}
],
"return_stack_pointer":{
"register":"RSP",
"type":"L"
},
"parameters":[
],
"return_address":{
"memory":{
"register":"RSP"
},
"type":"L"
}
},
{
"name":"bar",
"address":11,
"return_values":[
{
"register":"EAX",
"type":"i"
}
],
"return_stack_pointer":{
"register":"RSP",
"type":"L"
},
"parameters":[
],
"return_address":{
"memory":{
"register":"RSP"
},
"type":"L"
}
},
{
"name":"main",
"address":22,
"return_values":[
{
"register":"EAX",
"type":"i"
}
],
"return_stack_pointer":{
"register":"RSP",
"type":"L"
},
"parameters":[
{
"name":"argc",
"type":"i",
"register":"EDI"
},
{
"name":"argv",
"type":"**b",
"register":"RSI"
}
],
"return_address":{
"memory":{
"register":"RSP"
},
"type":"L"
}
}
],
"os":"linux",
"stack":{
"size":24576,
"start_offset":4096,
"address":87960955441152
},
"memory":[
{
"is_writeable":false,
"data":"554889E5B8010000005DC3554889E5B8020000005DC3554889E54883EC10897DFC488975F0837DFC027E05E8D0FFFFFF837DFC027F05E8D0FFFFFFB800000000C9C3",
"is_executable":true,
"is_readable":true,
"address":0
}
]
}
foo:
push rbp
mov rbp, rsp
mov eax, 1
pop rbp
ret
bar:
push rbp
mov rbp, rsp
mov eax, 2
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], edi
mov QWORD PTR [rbp-16], rsi
cmp DWORD PTR [rbp-4], 2
jle L6
call foo
L6:
cmp DWORD PTR [rbp-4], 2
jg L7
call bar
L7:
mov eax, 0
leave
ret
Sometimes BinaryNinja doesn't know the size of a global. In that case, we report it as a one byte array.
Lets use some kind of heuristic to pick a better size instead.
While lifting function calls from machine code (MC) to LLVM IR, we lift code that stores return addresses into the stack frame of the callee. This is of course needed for proper return from the callee back into the caller. Due to having to support ABIs of several architectures, this can require special attention. Anvill currently in some cases stores incorrect addresses into the callee stack frame, which down the line can cause accesses / references to code memory to remain in the output bitcode.
The core issue seems to be how "registers" PC
and NEXT_PC
of Remill's State
structure are initialized at the start of lifted functions and manipulated before and after function calls. PC
and NEXT_PC
can be found via uses of remill::kPCVariableName
and remill::kNextPCVariableName
.
McSema which also uses Remill doesn't suffer from these issues, so it can be used as a reference.
McSema currently contains code for folding and fixing up pointer usage into GEP operations and similar. If anvill has any use fro this it would be great to port it back. Possibly to Analyze.cpp
. Example:
https://github.com/lifting-bits/mcsema/blob/d9bbd1ea7c31a83d3b201b1c8711675ca7522408/mcsema/BC/Optimize.cpp#L474-L516
I'm attempting to run McSema on a reasonably large binary. After a number of hours (about 12 I believe) when using a machine equipped with enough RAM (256 GiB seems to be enough) I finally hit this point:
F20201031 04:22:21.503217 6 Analyze.cpp:481] Check failed: dest_size < 64 (64 vs. 64)
*** Check failure stack trace: ***
@ 0x9c6bcc google::LogMessageFatal::~LogMessageFatal()
@ 0x70d856 anvill::XrefExprFolder::VisitSExt()
@ 0x5f76cb mcsema::(anonymous namespace)::LowerXrefs()::$_1::operator()()
@ 0x5f2b37 mcsema::(anonymous namespace)::LowerXrefs()
@ 0x5f1ca4 mcsema::OptimizeModule()
@ 0x5ef329 mcsema::LiftCodeIntoModule()
@ 0x6057bc main
@ 0x7f9be066d0b3 __libc_start_main
@ 0x5b71ae _start
Unfortunately I don't see any other context that might help identify where things ate getting stuck at :)
I used McSema at git master with IDA Pro v7.5.200728 and IDAPython with Python 3 on Windows for flow control reconstruction. I used the latest Docker image of McSema to perform the lifting phase, on a Ubuntu 20.04 VM (An Amazon r5.8xlarge. It OOMs on my local machine with 64 GiB. It also OOMs with 128 GiB of RAM, which is the maximum my local machine is able to handle, so that leaves me in a position of being unable to run locally. Probably will switch to a better CPU platform in the future to reduce runtime though.)
The binary is an x86 Windows binary compiled with Microsoft Visual C++, not obfuscated, and with the accompanied PDB.
It seems like this means that somewhere along the lines an LLVM sext
instruction is created that attempts to sign-extend an integer, but for some reason a dest size of 64-bit is an error case for this particular code in anvill. Is this a bug, or perhaps missing functionality?
Lets make some basic tests that run on CI
This would let the LLVM side of things know what is mapped vs. not mapped, and then that would better inform our integer-to-pointer conversion stuff.
Currently, the CI will download the latest version of cxx-common and link against the HEAD commit in the remill branch. This means that running the CI jobs multiple times will likely cause the process to change as new commits appear in branches or new releases of cxx-common are tagged.
Previously, the .remill_commit_id
was being used to pin the remill version to a known commit/tag, but this seems to have been deprecated.
In order to fix this issue, we have select a cxx-common version when we download it on the CI (vs using latest
) and either bring back the .remill_commit_id
special file or add remill as a submodule
The goal of this issue is to instrument the bitcode pre-optimization in such a way that post-optimization, we can reason about register values/types at specific points in time. The hope is that this will help us diagnose the provenance of "unusual values" that show up in the bitcode. Similarly, the output of this step should feasibly be able to be piped back into disassembly or even to produce updated specs that include augmented type or value information.
The key idea is that we want to create a printf
-like function, declared as variadic, and that reads the values of registers just before each instruction. These function calls will be chained together, taking and returning memory pointers. These functions will be declared, not defined, thus be external. Finally, they will be attributed as ReadNone
so LLVM treats them as not touching global state, and thus not interfering with optimizations.
Memory *__anvill_trace_<hexaddr>(Memory *, ...)
std::vector<llvm::Value *>
to collect the arguments to pass to a function call.Okay, so now the bitcode is instrumented. It gets optimized. What we should observe is that some of the loaded register values will now have constant values, some will have the arguments, some will have other expressions. It'll be a big mix of things.
Our goal now is to report what we know, and then erase/remove all such function calls. We want to do this as a last step, just before removing the function that lets the memory pointer escape.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.