GithubHelp home page GithubHelp logo

lifting-bits / anvill Goto Github PK

View Code? Open in Web Editor NEW
330.0 26.0 39.0 72.89 MB

anvill forges beautiful LLVM bitcode out of raw machine code

License: GNU Affero General Public License v3.0

CMake 1.13% C++ 27.12% Python 7.53% Dockerfile 0.12% Shell 0.91% C 0.12% LLVM 63.00% SMT 0.06%
decompiler llvm remill

anvill's People

Contributors

2over12 avatar aaronyoo avatar alessandrogario avatar artemdinaburg avatar azeezah avatar carsonharmon avatar dependabot[bot] avatar ekilmer avatar frabert avatar kumarak avatar maxammann avatar ninja3047 avatar oldsj avatar pgoodman avatar sschriner avatar surovic avatar tetsuo-cpp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anvill's Issues

Verify bitcode can pass `llvm-dis` in tests

Sometimes we generate bitcode that does not pass verifier.
As a sanity check, lets ensure that we can always disassemble the bitocode via llvm-dis as a part of our test suite.

Binja.py Byte/Instruction Metadata Tagging

We want to be able to mark certain instructions and stack locations with attributes.

While talking with Peter he suggested that these two features might be useful to start

  • Mark and discover operands/stack locations that use pointer operands
  • Mark and discover jump tables (if you can)

Clang segfaults on Anvill-generated bitcode

So it looks like Clang segfaults on bitcode generated by Anvill :). The current suspicion is that it is caused by inline assembly we generate to mark capture of registers not in the specification.

Proposing the following fixes

  1. Gate the generation of inline assembly like:
  %9 = getelementptr inbounds %struct.State, %struct.State* %8, i32 0, i32 6, i32 1, i32 0, i32 0, !remill_register !5
  %10 = call i64 asm sideeffect "# read register RAX", "=r"()
  store i64 %10, i64* %9, align 8
  %11 = getelementptr inbounds %str

via a feature flag.

  1. Implement a proper fix for capturing those registers. The ideal solution, from Slack:
really, the most accurate way to lift this would be two functions:
the first function would have the naked attribute; it'd contain only the inline assembly, read out the "missing" dependent registers, and pass them as explicit args to the second function, along with the already stated args
  1. Submit a bug report to Clang since no matter how awful the bitcode, Clang shouldn't segfault.

Add an address_space field into the memory range specifications

The idea is that, if we are using a disassembler such as IDA Pro or Binary Ninja to generate the spec, then we will be giving concrete addresses for stuff in .tbss, .tls, and other TLS-related sections. We should be able to add on an address space to memory in those ranges, that is carried through into the LLVM side.

Upstream and downstream type propagation

Anvill does a limited amount of what could be charitably described as type propagation. It is mostly centred around the GetPointer function, and what it invokes. The merging of #62 means that, besides the address operands to memory read/write intrinsics, we have another jumping off point for type information.

The goals of upstream and downstream type propagation is to lift low-level operations into a slightly higher level. Here are some examples:

Problem examples

Downstream rewriting

Suppose we have used hinted type information to recover the following:

%y_ptr = __anvill_type_...(%y_orig)
%y = ptrtoint %y_ptr
%x = add %y, 8

Then our goal is to produce the following:

%y_ptr = __anvill_type_...(%y_orig)
%x_ptr = getelementptr %y_ptr, 0, N
%x = ptrtoint %x_ptr  ; Add to work list for processing

What is gained by this transformation, what is N, and how is this an improvement? This depends on the pointer type returned by __anvill_type_.... If it's a pointer to a structure, then N might be the index of one of the elements of the structure, so long as there is an element in that structure at byte offset 8. If it is a pointer to pointer, then N might be 1 on a 64-bit architecture, or 2 on a 32-bit architecture, i.e. 8 / sizeof(void *), telling us what index of an array we're accessing. For non-pointer types we can do similar arithmetic to try to convert the byte offset 8 into an array index. When this isn't possible, we can bitcast to a type and convert to GEPs, if we can get some better downstream type information.

Up-and-around rewriting

In x86 especially, the following pattern is common in function prologues:

push eax
push ebx
...
push ...

In bitcode, this might manifest as the following:

%esp = load %state, 0, ...  ; state->gpr.rsp.dword
%eax = load %state, 0, ... ; state->gpr.rax.dword
%ebx = load %state, 0, ... ; state->gpr.rbx.dword
...
%esp_less_4 = sub %esp, 4
%memory.0 = __remill_write_memory_32(%memory, %esp_less_4, %eax)
%esp_less_8 = sub %esp, 8
%memory.1 = __remill_write_memory_32(%memory.0, %esp_less_8, %ebx)
...
store %esp_less_N, ... ; state->gpr.rsp.dword -= N

From the perspective of GetPointer, we can observe that %esp_less_4 and %esp_less_8both need to bei32 *. These two values share a common data dependency at their root, and a common base type. Ideally, we'd like be be able to treat %espas ai32 *so that we can usegetelementptr` with negative indices to index into the stack, and then treat the prologue like writing into elements of an array. This can be done locally with an aggressive algorithm, or perhaps via some more global means.

Other examples

In practice, I have just looked at bitcode, looked for integer arithmetic that eventually leads to pointer casts, and tried to re-imagine how they could be lifted to a slightly higher level of abstraction.

What is needed

What is needed to work toward a solution to this problem? The current state of the codebase is just a mishmash of complex, recursive functions that look for fixed patterns -- it's not extensible. I don't know what the right solution is just yet. In theory, if we have c++ codegen for dr. lojekyll, then maaaaybe that could be useful, but that might be trying to fit a square peg into a round hole. If we had MLIR as an intermediate representation then I'd say we should use its transform infrastructure. I think we should also look into how people have tried to automate instcombine rules. There's probably a whole lot of work here.

Thus, we have two problems:

  • How can we effectively do these transforms
  • What are the transforms that we want to do

Describe the JSON Spec

To have other tools understand and parse the JSON spec, we should describe what is in it, and what the values mean.

Handle varargs functions, specifically printf and scanf and friends

Right now varargs functions are detected as varargs, but we only emit the formal but not variable parameters.

As an example, we see:

unsigned int printf(unsigned char *arg0, ...);
...
unsigned int val6 = printf(data_4007c1);

Lets handle varargs functions better. For now, it is fine if it just hits *printf and *scanf and handles static strings. This may be some kind of specific pass or set of heuristics, as long as it is identified as such and not comingled with other code.

Distinguish between signed and unsigned integers

Want to be able to distinguish between unsigned and signed integers when translating an LLVM type to spec. This is nontrivial because LLVM doesn't support the notion of signedness in their integer type.

Binja frontend attempts to get value of LLIL virtual register

Given a binary compiled from tests/branch.c via clang-9.0 anvill's binja frontend crashes with

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/__main__.py", line 77, in <module>
  File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/__main__.py", line 66, in main
  File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/program.py", line 97, in add_function_definition
  File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/binja.py", line 348, in visit
  File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/binja.py", line 398, in _extract_types
  File "/home/my_user/.local/lib/python3.8/site-packages/anvill-1.0-py3.8.egg/anvill/binja.py", line 405, in _extract_types
  File "/home/my_user/git/binaryninja/python/binaryninja/lowlevelil.py", line 1078, in get_reg_value
    reg = self._function.arch.get_reg_index(reg)
  File "/home/my_user/git/binaryninja/python/binaryninja/architecture.py", line 1564, in get_reg_index
    return self.regs[reg].index
KeyError: 'temp0'

This is likely due to binaryninja.LowLevelILInstruction.get_reg_value() being called on

temp0.q = divs.dp.q(rdx:rax, rcx)

where temp0 is a virtual register as opposed to a real one like rax or al.

No rule to make target 'install'

Following the instructions for building on Linux, and after make finishes with no errors:

$ sudo make install
make: *** No rule to make target 'install'.  Stop.

Update README.md

Things that would have been nice to have while getting ANVILL up and running

  • Table of dependencies similar to remill, rellic, ...
  • Step by step build process
  • Brief usage example

Adding commits to a PR aborts stale workflows

When pushing new commits to an existing PR, any running or queued workflow for the same PR should be aborted. This is currently causing macOS builds to get stuck (due to the org-wide limit of 5 concurrent macOS workers).

ARM64 binaries make the frontend die on invalid references

ARM64 binaries often reference data via a combination of adrp and add (or similar instructions). The initial reference via adrp is (correctly) detected as not present in the binary, and then Anvill (incorrectly) bails out in terror and abandons lifting.

Anvill should not abandon lifting, and adrp/add combination references should work in bitcode generated from ARM64 binaries.

Example output:

Traceback (most recent call last):
  File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/program.py", line 116, in try_add_referenced_entity
  File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/program.py", line 90, in add_function_definition
  File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/program.py", line 45, in get_function
  File "/home/artem/.local/lib/python3.9/site-packages/anvill-1.0-py3.9.egg/anvill/binja.py", line 484, in get_function_impl
anvill.exc.InvalidFunctionException: No function defined at or containing address 410000```

Empty memory dump in JSON spec for IDA Pro

Python>import anvill
Python>p = anvill.get_program()
Python>p.add_function_definition(0x100000F40)
True
Python>p.add_function_definition(0x100000F60)
True
Python>p.add_function_definition(0x100000F70)
True
Python>p.add_function_definition(0x100000F80)
True
Python>p.proto()
{"os": "macos", "functions": [{"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_main", "parameters": [{"register": "EDI", "type": "i", "name": "argc"}, {"register": "RSI", "type": "**b", "name": "argv"}, {"register": "RDX", "type": "**b", "name": "envp"}], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "EAX", "type": "i"}], "address": 4294971200}, {"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_voidptr_function", "parameters": [], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "RAX", "type": "l"}], "address": 4294971232}, {"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_uint64_function", "parameters": [], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "RAX", "type": "l"}], "address": 4294971264}, {"return_stack_pointer": {"register": "RSP", "type": "L", "offset": 8}, "name": "_intptr_function", "parameters": [], "return_address": {"type": "L", "memory": {"register": "RSP", "offset": 0}}, "return_values": [{"register": "RAX", "type": "l"}], "address": 4294971248}], "arch": "amd64", "stack": {"size": 24576, "start_offset": 4096, "address": 87960930217984}, "memory": []}

hello.out.zip

Binja.py TypeError: Str/Bytes

When running Anvill on the solo.elf binary from the solokey firmware I get a python type error

    prog = anvill.get_program(input_bin)
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/binja.py", line 475, in get_program
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/binja.py", line 386, in __init__
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 169, in __init__
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 184, in _load_types
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 179, in _process_dies
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf.py", line 177, in _process_dies
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 287, in _process_types
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 259, in _process_indirect_types
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 291, in _process_types
  File "/root/.local/lib/python3.6/site-packages/anvill-1.0-py3.6.egg/anvill/dwarf_type.py", line 308, in _process_pointer_types
TypeError: can't concat str to bytes

I will hunt this down and make a fix

Python module errors

After successfully building and installing anvill (on both macOS and Linux), 'python3 -m anvill' results in a 'No module named anvill' error.

After manually installing the anvill Python module, 'python3 -m anvill' gives the following error:

NotImplementedError: Could not find either IDA or Binary Ninja APIs

So it doesn't seem usable from the CLI as per the README.

Trying to use the IDA plugin from within IDA results in:

AttributeError: module 'anvill' has no attribute 'get_program'

Documentation

Some documentation on how this relates to mcsema (successor? stripped-down version?), how to build and use it would be nice.

Recovering global variable types

Using binja, we're sometimes missing type information for global variables. In this case binja assigns them a VoidType. When creating a JSON spec, Anvill translates this into a 1-byte byte array. Which then gets translated into a [1 x i8] type.

We should try to improve this via heuristics.

Eliminate "fake stack" address range

The key idea is this: instead of storing a hard-coded constant value into the State's stack pointer, we'll have a global, __anvill_stack_pointer, create a ptrtoint constant expression of it, and store that. Then we can investigate all uses of __anvill_stack_pointer and the subsequent integer displacements of those uses, and implement some folding over those.

ELF external thunk recognition

The ELF thunk recognition code of McSema should be copied and adapted for Anvill so that if a function references an ELF thunk, then we go and follow through and find the referenced external and use its name in the prototype, rather than the name of the function itself, which may be prefixed with junk.

That is, instead of a prototype of this function having the name _signal or .signal:
image

We should instead follow through to the .plt segment...

image

And take the info from here:

image

The relevant code to adapt from McSema is:

https://github.com/lifting-bits/mcsema/blob/master/tools/mcsema_disass/ida7/get_cfg.py#L334-L466

State structure isn't eliminated

Also, basic block addresses can be observed. Also, this function uses a retn, which is not recognized correctly in IDA.

{"functions": [{"return_stack_pointer": {"register": "ESP", "type": "I", "offset": 4}, "return_values": [{"register": "EAX", "type": "i"}], "return_address": {"type": "I", "memory": {"register": "ESP", "offset": 0}}, "parameters": [{"type": "i", "memory": {"register": "ESP", "offset": 4}}, {"type": "i", "memory": {"register": "ESP", "offset": 8}}, {"type": "i", "memory": {"register": "ESP", "offset": 12}}], "address": 6622921}, {"return_stack_pointer": {"register": "ESP", "type": "I", "offset": 4}, "return_values": [{"register": "EAX", "type": "i"}], "return_address": {"type": "I", "memory": {"register": "ESP", "offset": 0}}, "parameters": [{"type": "i", "memory": {"register": "ESP", "offset": 4}}, {"type": "i", "memory": {"register": "ESP", "offset": 8}}, {"type": "i", "memory": {"register": "ESP", "offset": 12}}, {"type": "i", "memory": {"register": "ESP", "offset": 16}}, {"type": "i", "memory": {"register": "ESP", "offset": 20}}, {"type": "i", "memory": {"register": "ESP", "offset": 24}}], "address": 6595488}], "arch": "x86", "variables": [{"type": "o", "address": 7436608}], "memory": [{"is_writeable": false, "data": "9081ec0c01000089ac240001000089b4240401000089bc24080100008bec83e4f08b8d240100008b85100100008b952001000083f9047d428bb51801000083f9000f84d901000083ec0c8904248974240489542408e8d06a00000385140100008bb51801000003b51c01000089b518010000ff8d24010000ebc7f30f1002f30f104a04f30f105208f30f105a0c0fc6c0000fc6c9000fc6d2000fc6db00f30f106210f30f106a14f30f107218f30f107a1c0fc6e4000fc6ed000fc6f6000fc6ff000f2904240f294c24100f295424200f295c24300f296424400f296c24500f297424600f297c2470f30f104230f30f104a34f30f105238f30f105a3c0fc6c0000fc6c9000fc6d2000fc6db000f298424c00000000f298c24d00000000f299424e00000000f299c24f00000008b95180100008bb51c0100008bbd1401000083f9040f8cb10000000f12020f120c168d14720f14c10f12220f122c160f14e50f28c80f16c40f12e18d14720f28c80f28e80f5904240f594c24100f596c24300f28d40f28fc0f596424400f595424500f597c24700f58c40f58ca0f58ef0f2815407971000f588424c00000000f588c24d00000000f58ac24f00000000f53e50f59ec0f5cd50f59e20f59c40f59cc0f28d00f14c10f15d10f13000f12c00f1304078d04780f13100f12d20f1314078d047883e904e946ffffff83f900742383e9040faff90faff103d603c7b9040000008bb51c0100008bbd14010000e91effffff8b85100100008be58bac24000100008bb424040100008bbc240801000081c40c010000c218008da424000000008da424000000008da424000000008d9b00000000", "is_executable": true, "is_readable": true, "address": 6595487}], "os": "windows", "stack": {"size": 24576, "start_offset": 4096, "address": 1344647168}}

Extend specification to include relocations

This feature will exist to support data-to-code and data-to-data references. This will be an array of entries, kind of like:

[
[<addr of relocation>,
 <size of value, in bytes>
 <op str>
 <expr values>]
// .. more entries
]

The op string will be a short string of characters that act as a stack-based interpreter, and where the starting stack state is provided by the following sequence of expression values, which must be integral constants.

Convert ptrtoint+phi?+inttoptr into getelemenptrs and bitcasts

Bitcode

  %2 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 8
  %3 = ptrtoint i8* %2 to i64
  ...
  %27 = phi i64 [ %33, %32 ], [ %3, %0 ]
  ...
  %33 = ptrtoint %main.frame_type* %1 to i64
  ...
  %37 = inttoptr i64 %27 to i64*
; Function Attrs: noinline nounwind ssp
define i32 @main(i32 %argc, i8** %argv) local_unnamed_addr #0 {
  %1 = alloca %main.frame_type, align 8
  %2 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 8
  %3 = ptrtoint i8* %2 to i64
  %4 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 16
  %5 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 28
  %6 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 32
  %7 = getelementptr inbounds %main.frame_type, %main.frame_type* %1, i64 0, i32 0, i64 40
  %8 = tail call i8* @llvm.returnaddress(i32 0)
  %9 = ptrtoint i8* %8 to i64
  %10 = bitcast i8* %7 to i64*
  store i64 %9, i64* %10, align 8
  %11 = bitcast i8* %6 to i64*
  store i64 0, i64* %11, align 8
  %12 = bitcast i8* %5 to i32*
  store i32 %argc, i32* %12, align 4
  %13 = ptrtoint i8** %argv to i64
  %14 = bitcast i8* %4 to i64*
  store i64 %13, i64* %14, align 8
  %15 = add i32 %argc, -2
  %16 = icmp eq i32 %15, 0
  %17 = lshr i32 %15, 31
  %18 = trunc i32 %17 to i8
  %19 = lshr i32 %argc, 31
  %20 = xor i32 %17, %19
  %21 = add nuw nsw i32 %20, %19
  %22 = icmp eq i32 %21, 2
  %23 = icmp ne i8 %18, 0
  %24 = xor i1 %23, %22
  %25 = or i1 %16, %24
  br i1 %25, label %26, label %32

26:                                               ; preds = %32, %0
  %27 = phi i64 [ %33, %32 ], [ %3, %0 ]
  %28 = icmp ne i32 %15, 0
  %29 = icmp eq i8 %18, 0
  %30 = xor i1 %29, %22
  %31 = and i1 %28, %30
  br i1 %31, label %39, label %36

32:                                               ; preds = %0
  %33 = ptrtoint %main.frame_type* %1 to i64
  %34 = bitcast i8* %2 to i64*
  store i64 48, i64* %34, align 8
  %35 = tail call i32 @foo() #3, !noalias !0
  br label %26

36:                                               ; preds = %26
  %37 = inttoptr i64 %27 to i64*
  store i64 59, i64* %37, align 8
  %38 = tail call i32 @bar() #3, !noalias !3
  %.pre = load i64, i64* %10, align 8
  br label %39

39:                                               ; preds = %26, %36
  %40 = phi i64 [ %9, %26 ], [ %.pre, %36 ]
  %41 = tail call %struct.Memory* @__remill_function_return(%struct.State* nonnull undef, i64 %40, %struct.Memory* null) #3
  ret i32 0
}

JSON

{ 
  "arch":"amd64",
  "variables":[ 

  ],
  "functions":[ 
    { 
      "name":"foo",
      "address":0,
      "return_values":[ 
        { 
          "register":"EAX",
          "type":"i"
        }
      ],
      "return_stack_pointer":{ 
        "register":"RSP",
        "type":"L"
      },
      "parameters":[ 

      ],
      "return_address":{ 
        "memory":{ 
          "register":"RSP"
        },
        "type":"L"
      }
    },
    { 
      "name":"bar",
      "address":11,
      "return_values":[ 
        { 
          "register":"EAX",
          "type":"i"
        }
      ],
      "return_stack_pointer":{ 
        "register":"RSP",
        "type":"L"
      },
      "parameters":[ 

      ],
      "return_address":{ 
        "memory":{ 
          "register":"RSP"
        },
        "type":"L"
      }
    },
    { 
      "name":"main",
      "address":22,
      "return_values":[ 
        { 
          "register":"EAX",
          "type":"i"
        }
      ],
      "return_stack_pointer":{ 
        "register":"RSP",
        "type":"L"
      },
      "parameters":[ 
        { 
          "name":"argc",
          "type":"i",
          "register":"EDI"
        },
        { 
          "name":"argv",
          "type":"**b",
          "register":"RSI"
        }
      ],
      "return_address":{ 
        "memory":{ 
          "register":"RSP"
        },
        "type":"L"
      }
    }
  ],
  "os":"linux",
  "stack":{ 
    "size":24576,
    "start_offset":4096,
    "address":87960955441152
  },
  "memory":[ 
    { 
      "is_writeable":false,
      "data":"554889E5B8010000005DC3554889E5B8020000005DC3554889E54883EC10897DFC488975F0837DFC027E05E8D0FFFFFF837DFC027F05E8D0FFFFFFB800000000C9C3",
      "is_executable":true,
      "is_readable":true,
      "address":0
    }
  ]
}

ARMv7 and Thumb2 Architecture Descriptions

Since we are working on ARMv7 and eventually Thumb2 support in Remill, we should include the architecture descriptions for these in Anvill to begin testing Anvill on real programs.

There is enough ARMv7 support that we should be able to start testing Anvill in its current state.

Incorrect JSON-based Lift

I have encountered an issue where anvill-decompile-json behaves incorrectly.

C++ code:

struct Dummy {
    long i;
    long j;
    long k;
};

void dummy_function(struct Dummy& dummy, long i, long j, long k) {
    dummy.i = i; 
    dummy.j = j; 
    dummy.k = k;
}

Bitcode from LLVM:

; Function Attrs: noinline norecurse nounwind ssp uwtable writeonly
define void @_Z14dummy_functionR5Dummylll(%struct.Dummy* nocapture dereferenceable(24), i64, i64, i64) local_unnamed_addr #5 !dbg !2476 {
  call void @llvm.dbg.value(metadata %struct.Dummy* %0, metadata !2486, metadata !DIExpression()), !dbg !2490
  call void @llvm.dbg.value(metadata i64 %1, metadata !2487, metadata !DIExpression()), !dbg !2491
  call void @llvm.dbg.value(metadata i64 %2, metadata !2488, metadata !DIExpression()), !dbg !2492
  call void @llvm.dbg.value(metadata i64 %3, metadata !2489, metadata !DIExpression()), !dbg !2493
  %5 = getelementptr inbounds %struct.Dummy, %struct.Dummy* %0, i64 0, i32 0, !dbg !2494
  store i64 %1, i64* %5, align 8, !dbg !2495, !tbaa !2496
  %6 = getelementptr inbounds %struct.Dummy, %struct.Dummy* %0, i64 0, i32 1, !dbg !2498
  store i64 %2, i64* %6, align 8, !dbg !2499, !tbaa !2500
  %7 = getelementptr inbounds %struct.Dummy, %struct.Dummy* %0, i64 0, i32 2, !dbg !2501
  store i64 %3, i64* %7, align 8, !dbg !2502, !tbaa !2503
  ret void, !dbg !2504
}

JSON Specification:

{
    "arch": "amd64",
    "functions": [
        {
            "name": "dummy_function(Dummy&, long, long, long)",
            "address": 4096,
            "parameters": [
                {
                    "name": "dummy",
                    "register": "RDI",
                    "type": "*{lll}"
                },
                {
                    "name": "i",
                    "register": "RSI",
                    "type": "l"
                },
                {
                    "name": "j",
                    "register": "RDX",
                    "type": "l"
                },
                {
                    "name": "k",
                    "register": "RCX",
                    "type": "l"
                }
            ],
            "return_stack_pointer": {
                "offset": 8,
                "register": "RSP",
                "type": "l"
            },
            "return_address": {
                "memory": {
                    "register": "RSP"
                },
                "type": "L"
            },
            "return_values": []
        }
    ],
    "memory": [
        {
            "address": 4096,
            "data": "554889E54889374889570848894F105DC3",
            "is_readable": true,
            "is_executable": true
        }
    ],
    "stack": {
        "address": 10000,
        "size": 4176,
        "start_offset": 4096
    },
    "os": "macos"
}

Lifted Bitcode

; Function Attrs: noinline nounwind ssp
define void @"dummy_function(Dummy&, long, long, long)"({ i64, i64, i64 }* nocapture %dummy, i64 %i, i64 %j, i64 %k) local_unnamed_addr #0 {
  %1 = tail call i8* @llvm.returnaddress(i32 0)
  %2 = ptrtoint i8* %1 to i64
  %3 = getelementptr inbounds { i64, i64, i64 }, { i64, i64, i64 }* %dummy, i64 0, i32 0
  store i64 %k, i64* %3, align 8
  %4 = tail call %struct.Memory* @__remill_function_return(%struct.State* nonnull undef, i64 %2, %struct.Memory* null) #3
  ret void
}

The lifted bitcode is wrong. One easy way to tell is that there are no offsets and no references to %i or %j. But basically, it is not handling assignment to struct elements correctly.

Code Documentation & Comments

Go through Anvill code and document, at a high level, what it should be doing & why, to capture the information somewhere in written form.

Remove Python 2.x support

We're already adding type hints, and the Python 2.7 EOL should mean that we don't even bother trying to run a Python 2 setup.py.

Support N:1 symbol:address mapping

So if the same address has multiple names, then the second call will declare a new version of the thing. For global variables, I thnk you'd want to use a GlobalAlias, so that all the "public" names can be aliases to an internal private data. For functions, a similar thing could be done, where "public" versions tail-call the internal private version.

Originally posted by @pgoodman in #45 (comment)

Missed stack pointer reference

Bitcode

define i32 @main(i32 %argc, i8** %argv) local_unnamed_addr #0 {
  %1 = alloca { [128 x i8], i64, [8 x i8], i64, [8 x i8], i64, [16 x i8], i8**, [20 x i8], i32, [4 x i8], i32, [4 x i8], i32, [8 x i8], i64, [16 x i8], i8*, [8 x i8], i64 }, align 8
  %2 = ptrtoint { [128 x i8], i64, [8 x i8], i64, [8 x i8], i64, [16 x i8], i8**, [20 x i8], i32, [4 x i8], i32, [4 x i8], i32, [8 x i8], i64, [16 x i8], i8*, [8 x i8], i64 }* %1 to i64
  ...
25:                                               ; preds = %31, %0
  %26 = phi i64 [ 87960955445208, %31 ], [ %2, %0 ]
  ...
34:                                               ; preds = %25
  %35 = inttoptr i64 %26 to i64*
  store i64 59, i64* %35, align 8
  ...

Specification

{ 
  "arch":"amd64",
  "variables":[ 

  ],
  "functions":[ 
    { 
      "name":"foo",
      "address":0,
      "return_values":[ 
        { 
          "register":"EAX",
          "type":"i"
        }
      ],
      "return_stack_pointer":{ 
        "register":"RSP",
        "type":"L"
      },
      "parameters":[ 

      ],
      "return_address":{ 
        "memory":{ 
          "register":"RSP"
        },
        "type":"L"
      }
    },
    { 
      "name":"bar",
      "address":11,
      "return_values":[ 
        { 
          "register":"EAX",
          "type":"i"
        }
      ],
      "return_stack_pointer":{ 
        "register":"RSP",
        "type":"L"
      },
      "parameters":[ 

      ],
      "return_address":{ 
        "memory":{ 
          "register":"RSP"
        },
        "type":"L"
      }
    },
    { 
      "name":"main",
      "address":22,
      "return_values":[ 
        { 
          "register":"EAX",
          "type":"i"
        }
      ],
      "return_stack_pointer":{ 
        "register":"RSP",
        "type":"L"
      },
      "parameters":[ 
        { 
          "name":"argc",
          "type":"i",
          "register":"EDI"
        },
        { 
          "name":"argv",
          "type":"**b",
          "register":"RSI"
        }
      ],
      "return_address":{ 
        "memory":{ 
          "register":"RSP"
        },
        "type":"L"
      }
    }
  ],
  "os":"linux",
  "stack":{ 
    "size":24576,
    "start_offset":4096,
    "address":87960955441152
  },
  "memory":[ 
    { 
      "is_writeable":false,
      "data":"554889E5B8010000005DC3554889E5B8020000005DC3554889E54883EC10897DFC488975F0837DFC027E05E8D0FFFFFF837DFC027F05E8D0FFFFFFB800000000C9C3",
      "is_executable":true,
      "is_readable":true,
      "address":0
    }
  ]
}

Assembly:

foo:
        push    rbp
        mov     rbp, rsp
        mov     eax, 1
        pop     rbp
        ret
bar:
        push    rbp
        mov     rbp, rsp
        mov     eax, 2
        pop     rbp
        ret
main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], edi
        mov     QWORD PTR [rbp-16], rsi
        cmp     DWORD PTR [rbp-4], 2
        jle     L6
        call    foo
L6:
        cmp     DWORD PTR [rbp-4], 2
        jg      L7
        call    bar
L7:
        mov     eax, 0
        leave
        ret

Lifted calls store incorrect return addresses into stack frame

While lifting function calls from machine code (MC) to LLVM IR, we lift code that stores return addresses into the stack frame of the callee. This is of course needed for proper return from the callee back into the caller. Due to having to support ABIs of several architectures, this can require special attention. Anvill currently in some cases stores incorrect addresses into the callee stack frame, which down the line can cause accesses / references to code memory to remain in the output bitcode.

The core issue seems to be how "registers" PC and NEXT_PC of Remill's State structure are initialized at the start of lifted functions and manipulated before and after function calls. PC and NEXT_PC can be found via uses of remill::kPCVariableName and remill::kNextPCVariableName.

McSema which also uses Remill doesn't suffer from these issues, so it can be used as a reference.

Hitting "Check failed: dest_size < 64" on binary

I'm attempting to run McSema on a reasonably large binary. After a number of hours (about 12 I believe) when using a machine equipped with enough RAM (256 GiB seems to be enough) I finally hit this point:

F20201031 04:22:21.503217     6 Analyze.cpp:481] Check failed: dest_size < 64 (64 vs. 64)
*** Check failure stack trace: ***
    @           0x9c6bcc  google::LogMessageFatal::~LogMessageFatal()
    @           0x70d856  anvill::XrefExprFolder::VisitSExt()
    @           0x5f76cb  mcsema::(anonymous namespace)::LowerXrefs()::$_1::operator()()
    @           0x5f2b37  mcsema::(anonymous namespace)::LowerXrefs()
    @           0x5f1ca4  mcsema::OptimizeModule()
    @           0x5ef329  mcsema::LiftCodeIntoModule()
    @           0x6057bc  main
    @     0x7f9be066d0b3  __libc_start_main
    @           0x5b71ae  _start

Unfortunately I don't see any other context that might help identify where things ate getting stuck at :)

I used McSema at git master with IDA Pro v7.5.200728 and IDAPython with Python 3 on Windows for flow control reconstruction. I used the latest Docker image of McSema to perform the lifting phase, on a Ubuntu 20.04 VM (An Amazon r5.8xlarge. It OOMs on my local machine with 64 GiB. It also OOMs with 128 GiB of RAM, which is the maximum my local machine is able to handle, so that leaves me in a position of being unable to run locally. Probably will switch to a better CPU platform in the future to reduce runtime though.)

The binary is an x86 Windows binary compiled with Microsoft Visual C++, not obfuscated, and with the accompanied PDB.

It seems like this means that somewhere along the lines an LLVM sext instruction is created that attempts to sign-extend an integer, but for some reason a dest size of 64-bit is an error case for this particular code in anvill. Is this a bug, or perhaps missing functionality?

Test suite for Anvill

Lets make some basic tests that run on CI

  1. Tests with already working JSON to just lift
  2. Tests to convert binaries to JSON, and then lift. We can use headless binary ninja to automate these.

The CI builds and tests against known dependencies

Currently, the CI will download the latest version of cxx-common and link against the HEAD commit in the remill branch. This means that running the CI jobs multiple times will likely cause the process to change as new commits appear in branches or new releases of cxx-common are tagged.

Previously, the .remill_commit_id was being used to pin the remill version to a known commit/tag, but this seems to have been deprecated.

In order to fix this issue, we have select a cxx-common version when we download it on the CI (vs using latest) and either bring back the .remill_commit_id special file or add remill as a submodule

Provenance of values in post-optimized bitcode

The goal of this issue is to instrument the bitcode pre-optimization in such a way that post-optimization, we can reason about register values/types at specific points in time. The hope is that this will help us diagnose the provenance of "unusual values" that show up in the bitcode. Similarly, the output of this step should feasibly be able to be piped back into disassembly or even to produce updated specs that include augmented type or value information.

The key idea is that we want to create a printf-like function, declared as variadic, and that reads the values of registers just before each instruction. These function calls will be chained together, taking and returning memory pointers. These functions will be declared, not defined, thus be external. Finally, they will be attributed as ReadNone so LLVM treats them as not touching global state, and thus not interfering with optimizations.

  • Create a command-line flag that will turn on this feature. Default is off.
  • Before lifting each instruction, but after type injection, create a function like: Memory *__anvill_trace_<hexaddr>(Memory *, ...)
  • Create a std::vector<llvm::Value *> to collect the arguments to pass to a function call.
  • Load the current memory pointer value, append to the arg vec.
  • For each register in the architecture: if the widest version of the register is itself, and if it has an integer type, then load the register's value, and append it to the argvec.
  • Inject a call, passing in the argvec.
  • Store the call (i.e. its return value) to the memory pointer.

Okay, so now the bitcode is instrumented. It gets optimized. What we should observe is that some of the loaded register values will now have constant values, some will have the arguments, some will have other expressions. It'll be a big mix of things.

Our goal now is to report what we know, and then erase/remove all such function calls. We want to do this as a last step, just before removing the function that lets the memory pointer escape.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.