GithubHelp home page GithubHelp logo

danielplohmann / smda Goto Github PK

View Code? Open in Web Editor NEW
213.0 16.0 36.0 2.42 MB

SMDA is a minimalist recursive disassembler library that is optimized for accurate Control Flow Graph (CFG) recovery from memory dumps.

License: BSD 2-Clause "Simplified" License

Makefile 0.13% Python 99.87%
disassembler code-analysis control-flow-graph control-flow-analysis memory-dump x86 x64

smda's Introduction

SMDA

SMDA is a minimalist recursive disassembler library that is optimized for accurate Control Flow Graph (CFG) recovery from memory dumps. It is based on Capstone and currently supports x86/x64 Intel machine code. As input, arbitrary memory dumps (ideally with known base address) can be processed. The output is a collection of functions, basic blocks, and instructions with their respective edges between blocks and functions (in/out). Optionally, references to the Windows API can be inferred by using the ApiScout method.

Installation

With version 1.2.0, we have finally simplified things by moving to PyPI!
So installation now is as easy as:

$ pip install smda

Usage

A typical workflow using SMDA could like this:

>>> from smda.Disassembler import Disassembler
>>> disassembler = Disassembler()
>>> report = disassembler.disassembleFile("/bin/cat")
>>> print(report)
 0.777s -> (architecture: intel.64bit, base_addr: 0x00000000): 143 functions
>>> for fn in report.getFunctions():
...     print(fn)
...     for ins in fn.getInstructions():
...         print(ins)
...
0x00001720: (->   1,    1->)   3 blocks,    7 instructions.
0x00001720: (      4883ec08) - sub rsp, 8
0x00001724: (488b05bd682000) - mov rax, qword ptr [rip + 0x2068bd]
0x0000172b: (        4885c0) - test rax, rax
0x0000172e: (          7402) - je 0x1732
0x00001730: (          ffd0) - call rax
0x00001732: (      4883c408) - add rsp, 8
0x00001736: (            c3) - ret 
0x00001ad0: (->   1,    4->)   1 blocks,   12 instructions.
[...]
>>> json_report = report.toDict()

There is also a demo script:

  • analyze.py -- example usage: perform disassembly on a file or memory dump and optionally store results in JSON to a given output path.

The code should be fully compatible with Python 2 and 3. Further explanation on the innerworkings follow in separate publications but will be referenced here.

To take full advantage of SMDA's capabilities, make sure to (optionally) install:

Version History

  • 2024-03-12: v1.13.18 - Added functionality to extract and store all referenced strings along SmdaFunctions (has to be enabled via SmdaConfig).
  • 2024-03-12: v1.13.17 - Extended disassembleBuffer() to now take additional arguments code_areas and oep.
  • 2024-02-21: v1.13.16 - BREAKING IntelInstructionEscaper.escapeMnemonic: Escaper now handles another 200 instruction names found in other capstone source files (THX for reporting @malwarefrank!).
  • 2024-02-15: v1.13.15 - Fixed issues with version recognition in SmdaFunction which cause issues in MCRIT (THX to @ malwarefrank!)
  • 2024-02-02: v1.13.12 - Versions might be non-numerical, addressed that in SmdaFunction.
  • 2024-01-23: v1.13.11 - Introduced indicator in SmdaConfig for compatibility of instruction escaping.
  • 2024-01-23: v1.13.10 - Parsing of PE files should work again with lief >=0.14.0.
  • 2024-01-23: v1.13.9 - Improved parsing robustness for section/segment tables in ELF files, also now padding with zeroes when finding less content than expected physical size in a segment (THX for reporting @schrodyn!).
  • 2024-01-23: v1.13.8 - BREAKING adjustments to IntelInstructionEscaper.escapeMnemonic: Escaper now is capable of handling all known x86/x64 instructions in capstone (THX for reporting @schrodyn!).
  • 2023-12-01: v1.13.7 - Skip processing of Delphi structs for large files, workaround until this is properly reimplemented.
  • 2023-11-29: v1.13.6 - Made OpcodeHash an attribute with on-demand calculation to save processing time.
  • 2023-11-29: v1.13.3 - Implemented an alternative queue working with reference count based brackets in pursuit of accelerated processing.
  • 2023-11-28: v1.13.2 - IndirectCallAnalyzer will now analyze at most a configurable amount of calls per basic block, default 50.
  • 2023-11-21: v1.13.1 - SmdaBasicBlock now has getPredecessors() and getSuccessors().
  • 2023-11-21: v1.13.0 - BREAKING adjustments to PicHashing (now wildcarding intraprocedural jumps in functions, additionally more immediates if within address space). Introduction of OpcodeHash (OpcHash), which wildcards all but prefixes and opcode bytes.
  • 2023-10-12: v1.12.7 - Bugfix for parsing Delphi structs.
  • 2023-09-15: v1.12.6 - Bugfix in BlockLocator (THX to @cccs-ay!).
  • 2023-08-28: v1.12.5 - Bugfix for address dereferencing where buffer sizes were not properly checked (THX to @yankovs!).
  • 2023-08-08: v1.12.4 - SmdaBasicBlock can now do getPicBlockHash().
  • 2023-05-23: v1.12.3 - Fixed bugs in PE parser and Go parser.
  • 2023-05-08: v1.12.1 - Get rid of deprecation warning in IDA 8.0+.
  • 2023-03-24: v1.12.0 - SMDA now parses PE export directories for symbols, as well as MinGW DWARF information if available.
  • 2023-03-14: v1.11.2 - SMDA report now also contains SHA1 and MD5.
  • 2023-03-14: v1.11.1 - rendering dotGraph can now include API references instead of plain calls.
  • 2023-02-06: v1.11.0 - SmdaReport now has functionality to find a function/block by a given offset contained within in (THX to @cccs-ay!).
  • 2023-02-06: v1.10.0 - Adjusted to LIEF 0.12.3 API for binary parsing (THX to @lainswork!).
  • 2022-08-12: v1.9.1 - Added support for parsing intel MachO files, including Go parsing.
  • 2022-08-01: v1.8.0 - Added support for parsing Go function information (THX to @danielenders1!).
  • 2022-01-27: v1.7.0 - SmdaReports now contains a field oep; SmdaFunctions now indicate is_exported and can provide CodeXrefs via getCodeInrefs() and getCodeOutrefs(). (THX for the ideas: @mr-tz)
  • 2021-08-20: v1.6.0 - Bugfix for alignment calculation of binary mappings. (THX: @williballenthin)
  • 2021-08-19: v1.6.0 - Bugfix for truncation during ELF segment/section loading. API usage in ELF files is now resolved as well! (THX: @williballenthin)
  • 2020-10-30: v1.5.0 - PE section table now contained in SmdaReport and added SmdaReport.getSection(offset).
  • 2020-10-26: v1.4.0 - Adding SmdaBasicBlock. Some convenience code to ease intgration with capa. (GeekWeek edition!)
  • 2020-06-22: v1.3.0 - Added DominatorTree (Implementation by Armin Rigo) to calculate function nesting depth, shortened PIC hash to 8 byte, added some missing instructions for the InstructionEscaper, IdaInterface now demangles names.
  • 2020-04-29: v1.2.0 - Restructured config.py into smda/SmdaConfig.py to similfy usage and now available via PyPI! The smda/Disassembler.py now emits a report object (smda.common.SmdaReport) that allows direct (pythonic) interaction with the results - a JSON can still be easily generated by using toDict() on the report.
  • 2020-04-28: v1.1.0 - Several improvements, including: x64 jump table handling, better data flow handling for calls using registers and tailcalls, extended list of common prologues based on much more groundtruth data, extended padding instruction list for gap function discovery, adjusted weights in candidate priority score, filtering code areas based on section tables, using exported symbols as candidates, new function output metadata: confidence score based on instruction mnemonic histogram, PIC hash based on escaped binary instruction sequence
  • 2018-07-01: v1.0.0 - Initial Release.

Credits

Thanks to Steffen Enders for his extensive contributions to this project! Thanks to Paul Hordiienko for adding symbol parsing support (ELF+PDB)! Thanks to Jonathan Crussell for helping me to beef up SMDA enough to make it a disassembler backend in capa! Thanks to Willi Ballenthin for improving handling of ELF files, including properly handling API usage! Thanks to Daniel Enders for his contributions to the parsing of the Golang function registry and label information! The project uses the implementation of Tarjan's Algorithm by Bas Westerbaan and the implementation of Lengauer-Tarjan's Algorithm for the DominatorTree by Armin Rigo.

Pull requests welcome! :)

smda's People

Contributors

bonusplay avatar cccs-ay avatar danielenders1 avatar danielplohmann avatar jcrussell avatar lainswork avatar steffenenders avatar williballenthin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smda's Issues

Does it handle indirect jumps?

I am building a Value Set Analysis on top of this, but first question is does it handle indirect jumps, if so, how? Thanks!

Unhandled mnemonics related to XMM

I initiated a recalculate minhashes in mcrit, and while it was running, I saw numerous repeats of these errors (not in this exact order):

ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpeqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpleps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpltss
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpnleps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpnles
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpnltpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpnltps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpunordpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: cmpunordps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpeqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpeqps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpge_oqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpge_oqps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpgt_oqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpgt_oqps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmple_oqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmple_oqps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmplt_oqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmplt_oqps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpnle_uqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpnle_uqps
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpnlt_uqpd
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcmpnlt_uqps

According to the Intel documentation, they are Pseudo-ops for the opcodes CMPPD, CMPPS, CMPSD, and CMPSS. There may be more in the Intel documentation than those listed above.

Do they just need to be added to the _xmm_group in the smda/intel/INtelInstructionEscaper.py file? If so, then I am happy to make the edits and submit a PR if that would help.

Crash when handling InRefs in SMDAReport

In a edge case the disassembler can run into a situation where an incoming Xref is not coming from an instruction offset. This will cause a crash at:

function.code_inrefs.append(CodeXref(offset2ins[inref], offset2ins[function.offset]))

The attached file is a example of this. (it is malware, password is infected)
check.zip

I added a quick check that will work for our usecase but I'm not sure if it will break anything.

for inref in function.inrefs:
    if inref in offset2ins:
        function.code_inrefs.append(CodeXref(offset2ins[inref], offset2ins[function.offset]))

Investigate breaking functions in Go

In binary hello.exe, function runtime.cgoIsGoPointer is suddenly ended when encountering a multibyte NOP (0F 1F 00 nop dword ptr [rax]) and function analysis apparently ends. This leads to another gap function being found starting at 0x4039a0 that should instead be part of the previous function.

OverflowError: cannot fit 'int' into an index-sized integer

Spotted this error while submitting files with MCRIT in dir mode and wanted to report it. File hash f3af394d9c3f68dff50b467340ca59a11a14a3d56361e6cffd1cf2312a7028ad

0.001s -> (architecture: intel.32bit, base_addr: 0x00000000): 0 functions
ERROR: SMDA caused an exception while processing this file: Samples/f3af394d9c3f68dff50b467340ca59a11a14a3d56361e6cffd1cf2312a7028ad
Traceback (most recent call last):
  File "/usr/home/schrodinger/.local/src/mcrit/mcrit/client/McritConsole.py", line 109, in getSmdaReportFromFilepath
    smda_report = disassembler.disassembleFile(filepath)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/Disassembler.py", line 42, in disassembleFile
    loader = FileLoader(file_path, map_file=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/utility/FileLoader.py", line 22, in __init__
    self._loadFile()
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/utility/FileLoader.py", line 36, in _loadFile
    self._data = loader.mapBinary(self._raw_data)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/utility/ElfFileLoader.py", line 97, in mapBinary
    mapped_binary = bytearray(align(virtual_size, 0x1000))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: cannot fit 'int' into an index-sized integer

Output Consistency

Using methods from DisassemblyResult gives less information than iterating over the output JSON files.

Bug: dereferencing a buffer that may be too small

Hey!

I've found a few examples where SMDA analysis fails, both for the same reason. Looking into the issue, it seems to be in dereferenceQword method inside DisassemblyResult.py. It seems like, in some cases, the rel_start_addr may be too close to the end of the binary buffer. Toy example: binary is 100 bytes, rel_start_addr is index 94 in the binary buffer, and rel_end_addr in this case is 102. In this case binary[rel_start_addr:rel_end_addr] will just return the last 6 bytes of that binary, resulting in an error since struct.unpack("Q", ...) expects it to be 8 bytes.

List of affected files I've found:

  • 0deaaed7b6b6bfb3d96b1354377b9dbc01c44f6a09e72d1620a38296e61adb48
  • 136960b3e46c46c5057d4d222190c2bad2e490f6b5fe0c62765a3a7644cab276
  • 395f83caedfdf3b3403bbaa950c08515919ab3799cef9a6c72f19e4462a216bf

Traceback:

An error occurred while disassembling file.
 0.085s -> Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/smda/Disassembler.py", line 57, in disassembleFile
    smda_report = self._disassemble(binary_info, timeout=self.config.TIMEOUT)
  File "/usr/local/lib/python3.8/dist-packages/smda/Disassembler.py", line 109, in _disassemble
    self.disassembly = self.disassembler.analyzeBuffer(binary_info, self._callbackAnalysisTimeout)
  File "/usr/local/lib/python3.8/dist-packages/smda/intel/IntelDisassembler.py", line 450, in analyzeBuffer
    state = self.analyzeFunction(candidate.addr)
  File "/usr/local/lib/python3.8/dist-packages/smda/intel/IntelDisassembler.py", line 331, in analyzeFunction
    self._analyzeCallInstruction(i, state)
  File "/usr/local/lib/python3.8/dist-packages/smda/intel/IntelDisassembler.py", line 168, in _analyzeCallInstruction
    dereferenced = self.disassembly.dereferenceQword(call_destination)
  File "/usr/local/lib/python3.8/dist-packages/smda/DisassemblyResult.py", line 154, in dereferenceQword
    return struct.unpack("Q", extracted_qword)[0]
struct.error: unpack requires a buffer of 8 bytes

Not sure if it has any connection to the issue but both files seem to be dotnet drivers, so I tried some other similar files and they seemed to be ok. Not sure if this file type has anything to do with this or they're both that just by chance.
In any case, maybe some guard checks regarding size should be implemented? This seems to be a potential issue also in similar methods like dereferenceDword so it may be good to look at them too.

Are entrypoint(s) and calls_from exposed?

Are the below features exposed and if so, how can I access them?
I'd like to update the SMDA capa backend with this data.

  1. entry points, e.g. AddressOfEntryPoint and exported functions
  2. optionally calls/xrefs from a function (on a higher level, otherwise I can reuse some existing code)

Thanks!

Issue with Push+Call String Obfuscation

Hello,

It appears that SMDA is disassembling instructions that are strings. The malicious sample 57E441BBB7345E63F2EE9547A4C108C1B70448C107E17CBBC92739A56F77F12E uses a call to mimic a stack push of a string.

.flat:0040100F FF 15 28 31 40 00                       call    ds:LoadLibraryA
.flat:00401015 83 F8 00                                cmp     eax, 0
.flat:00401018 74 32                                   jz      short loc_40104C
.flat:0040101A E8 13 00 00 00                          call    loc_401032
.flat:0040101A                               ; ---------------------------------------------------------------------------
.flat:0040101F 52 74 6C 41 64 6A 75 73 74 50+aRtladjustprivi db 'RtlAdjustPrivilege',0
.flat:00401032                               ; ---------------------------------------------------------------------------
.flat:00401032
.flat:00401032                               loc_401032:                             ; CODE XREF: .flat:0040101A↑p
.flat:00401032 50                                      push    eax

Output from the smda json. You can see at offset 543 that smda is disassembling the string RtlAdjustPrivilege.

    ],
    "538": [
     [
      538,
      "e813000000",
      "call",
      "0x232"
     ],
     [
      543,
      "52",
      "push",
      "edx"
     ],
     [
      544,
      "746c",
      "je",
      "0x28e"
     ]
    ],

This issue is usually caused when a disassemblers assumes that the offset following a call is a return address.

Integrate with IDA

IDA won't recognize any code/functions when loaded as binary buffer for x64.
Therefore, have SMDA run over the buffer loaded and IDA and provide CFG information to be then used by IDA.
This should be also usable for x86 with incomplete function discovery.

Broken pypi build

Sadly package on pypi does not include dependencies, which means performing

python -m venv .venv
source .venv/bin/activate
pip install smda
python3 -c "from smda.Disassembler import Disassembler"

will fail, as capstone dependency is missing.

Unhandled mnemonics

While bulk processing some files into MCRIT I witnessed some errors related to smda and unhandled mnemonics.

ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: endbr64
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vpsllw
ERROR:smda.intel.IntelInstructionEscaper:********************************************** Unhandled mnemonic: vcvtss2sd

Unfortunately I don't have hashes available as it was a bulk job of a large directory of samples. At least endbr64 is a common instruction in ELF files compiled with GCC.

Windows: struct.error: argument out of range

I'm not sure I was able to locate the right error source here, but running this on Windows we get two exceptions in capa, see https://github.com/fireeye/capa/pull/470/checks?check_run_id=2277082296#step:7:28

================================== FAILURES ===================================
_ test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True] _

ins = <smda.common.SmdaInstruction.SmdaInstruction object at 0x0000025A8DC23550>

    @staticmethod
    def escapeBinaryPtrRef(ins):
        escaped_sequence = ins.bytes
        addr_match = re.search(r"\[(rip (\+|\-) )?(?P<dword_offset>0x[a-fA-F0-9]+)\]", ins.operands)
        if addr_match:
            offset = int(addr_match.group("dword_offset"), 16)
            if "rip -" in ins.operands:
                offset = 0x100000000 - offset
            #TODO we need to check if this is actually a 64bit absolute offset (e.g. used by movabs)
            try:
>               packed_hex = str(codecs.encode(struct.pack("I", offset), 'hex').decode('ascii'))
E               struct.error: argument out of range

c:\hostedtoolcache\windows\python\3.9.2\x64\lib\site-packages\smda\intel\IntelInstructionEscaper.py:334: error

During handling of the above exception, another exception occurred:

self = <smda.Disassembler.Disassembler object at 0x0000025A97BF9310>
file_path = 'D:\\a\\capa\\capa\\tests\\data\\al-khaser_x64.exe_', pdb_path = ''

    def disassembleFile(self, file_path, pdb_path=""):
        loader = FileLoader(file_path, map_file=True)
        file_content = loader.getData()
        binary_info = BinaryInfo(file_content)
        binary_info.raw_data = loader.getRawData()
        binary_info.file_path = file_path
        binary_info.base_addr = loader.getBaseAddress()
        binary_info.bitness = loader.getBitness()
        binary_info.code_areas = loader.getCodeAreas()
        start = datetime.datetime.utcnow()
        try:
            self.disassembler.addPdbFile(binary_info, pdb_path)
>           smda_report = self._disassemble(binary_info, timeout=self.config.TIMEOUT)

c:\hostedtoolcache\windows\python\3.9.2\x64\lib\site-packages\smda\Disassembler.py:52: 
...
ERROR    smda.Disassembler:Disassembler.py:56 An error occurred while disassembling file.
_ test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True] _

ins = <smda.common.SmdaInstruction.SmdaInstruction object at 0x0000025A9AA2A9A0>

    @staticmethod
    def escapeBinaryPtrRef(ins):
        escaped_sequence = ins.bytes
        addr_match = re.search(r"\[(rip (\+|\-) )?(?P<dword_offset>0x[a-fA-F0-9]+)\]", ins.operands)
        if addr_match:
            offset = int(addr_match.group("dword_offset"), 16)
            if "rip -" in ins.operands:
                offset = 0x100000000 - offset
            #TODO we need to check if this is actually a 64bit absolute offset (e.g. used by movabs)
            try:
>               packed_hex = str(codecs.encode(struct.pack("I", offset), 'hex').decode('ascii'))
E               struct.error: argument out of range

c:\hostedtoolcache\windows\python\3.9.2\x64\lib\site-packages\smda\intel\IntelInstructionEscaper.py:334: error

During handling of the above exception, another exception occurred:

self = <smda.Disassembler.Disassembler object at 0x0000025A9B4FFAC0>
file_path = 'D:\\a\\capa\\capa\\tests\\data\\a198216798ca38f280dc413f8c57f2c2.exe_'
pdb_path = ''

    def disassembleFile(self, file_path, pdb_path=""):
        loader = FileLoader(file_path, map_file=True)
        file_content = loader.getData()
        binary_info = BinaryInfo(file_content)
        binary_info.raw_data = loader.getRawData()
        binary_info.file_path = file_path
        binary_info.base_addr = loader.getBaseAddress()
        binary_info.bitness = loader.getBitness()
        binary_info.code_areas = loader.getCodeAreas()
        start = datetime.datetime.utcnow()
        try:
            self.disassembler.addPdbFile(binary_info, pdb_path)
>           smda_report = self._disassemble(binary_info, timeout=self.config.TIMEOUT)

c:\hostedtoolcache\windows\python\3.9.2\x64\lib\site-packages\smda\Disassembler.py:52: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <smda.Disassembler.Disassembler object at 0x0000025A9B4FFAC0>
binary_info = <smda.common.BinaryInfo.BinaryInfo object at 0x0000025A9B4FFE20>
timeout = 300

    def _disassemble(self, binary_info, timeout=0):
        self._start_time = datetime.datetime.utcnow()
        self._timeout = timeout
        self.disassembly = self.disassembler.analyzeBuffer(binary_info, self._callbackAnalysisTimeout)
>       return SmdaReport(self.disassembly, config=self.config)

c:\hostedtoolcache\windows\python\3.9.2\x64\lib\site-packages\smda\Disassembler.py:101: 
...

SMDA incorrectly maps sections/segments from ELF files

The variable min_raw_segment_offset used

mapped_binary[0:min_raw_section_offset] = binary[0:min_raw_section_offset]
and
mapped_binary[0:min_raw_segment_offset] = binary[0:min_raw_segment_offset]
in the ElfFileLoader are used incorrectly, leading to sections/segments being truncated.

This variable contains a virtual address; however, it is used as a raw file offset (the raw file offset of the first section/segment). When the virtual address is something like 0x401000, then the assignment tries to copy 0x401000 from the source binary, ends up with much less data (the size of the file amount of data), and this truncates the mapped data to len(size of file). Subsequent assignments to the mapped data somehow don't throw exceptions despite failing to write beyond the end of the truncated mapping.

A reasonable fix is to operate on physical/raw file offsets rather than virtual addresses.

Double check calculation of PIC hashes for functions

Looking at the example provided in the IDA blog post for Diaphora, the two DLLs are almost identical, which should lead to many overlapping PIC hashes.
Given binaries:

  • 408cb1604d003f38715833a48485b6a4e620edf163fb59aef792595866e4796b
  • c115d15807b96dcb9871ebc69618ef77473f1451c427e7349f9aa3c72891ddc2
    this is however not the case, e.g. function 0x633b3b81 (both binaries, same offset) appears identical in both binaries, yet produces two distinct PIC hashes?
  • 0x304cda7a0ed10b98
  • 0xaa2ceb7c973bc0c5

Speed up analysis of bigger buffers

It's a known issue that disassembly of dumps containing a couple thousand functions is currently very slow. This is likely to some suboptimal choices of data structures and will be investigated asap.

escaping instructions replaces menmonic bytes

In this case only the [ptr] is replaced.
IntelInstructionEscaper.escapeBinaryPtrRef: 2 occurrences for ac974a00 in c705ac974a00ac974a00 (mov dword ptr [0x4a97ac], 0x4a97ac), escaping only the first one

Here might be a problem:
IntelInstructionEscaper.escapeBinaryPtrRef: 2 occurrences for 05050505 in 010505050505 (add dword ptr [0x5050505], eax), escaping only the first one

IntelInstructionEscaper.escapeBinaryPtrRef: 2 occurrences for 05050505 in 020505050505 (add al, byte ptr [0x5050505]), escaping only the first one

IntelInstructionEscaper.escapeBinaryPtrRef: 2 occurrences for 05050505 in 030505050505 (add eax, dword ptr [0x5050505]), escaping only the first one

IntelInstructionEscaper.escapeBinaryPtrRef: 2 occurrences for 15151515 in 011515151515 (add dword ptr [0x15151515], edx), escaping only the first one

IntelInstructionEscaper.escapeBinaryPtrRef: 2 occurrences for 15151515 in 0f101515151515 (movups xmm2, xmmword ptr [0x15151515]), escaping only the first one

Investigate gap function analysis

when analyzing Go binary hello.exe, a gap function consisting of just a single int3 is found at 0x4032f0 - why is this identified as a gap function and not the surrounding int3s etc.

Support more architectures

The same method for CFG recovery should be applicable to other popular architectures such as ARM and MIPS(el).
Therefore, refactor and expand the disassembler part to be mostly architecture-agnostic.

Logging is broken

Currently all files are using

LOGGER = logging.getLogger(__name__)

which creates a new logger for each of the files. Moreover, config:

smda/smda/SmdaConfig.py

Lines 36 to 38 in 97a5a6d

def __init__(self, log_level=logging.INFO):
if len(logging._handlerList) == 0:
logging.basicConfig(level=log_level, format=self.LOG_FORMAT)

  • doesn't actually assign self.LOG_LEVEL to variable passed
  • uses logging.basicConfig which overwrites root logger config (that means, any application using smda)

The solution I propse is to replace

logging.getLogger(__name__)

with

logging.getLogger("smda")

which would make smda use only 1 logger in all of the files and remove call to logging.basicConfig in SmdaConfig.

IntelInstructionEscaper escapeBinary check logic bug

In the IntelInstructionEscaper on line 272 there is a flaw in the logic.

if lower_addr is not None and upper_addr is not None and ins.operands.startswith("0x") or ", 0x" in ins.operands:

This will sometimes crash if ins.operands.startswith("0x") or ", 0x" in ins.operands is true.

I think the intended logic here should be:

if lower_addr is not None and upper_addr is not None and (ins.operands.startswith("0x") or ", 0x" in ins.operands):

Exception when parsing Delphi structs

When trying to parse Delphi structs, processing may fail due to exceptions involving negative offsets

Example file: 62f2adbc73cbdde282ae3749aa63c2bc9c5ded8888f23160801db2db851cde8f
Trace:

  File "smda/Disassembler.py", line 57, in disassembleFile
    smda_report = self._disassemble(binary_info, timeout=self.config.TIMEOUT)
  File "smda/Disassembler.py", line 109, in _disassemble
    self.disassembly = self.disassembler.analyzeBuffer(binary_info, self._callbackAnalysisTimeout)
  File "smda/intel/IntelDisassembler.py", line 443, in analyzeBuffer
    self.fc_manager.init(self.disassembly)
  File "smda/intel/FunctionCandidateManager.py", line 46, in init
    self.disassembly.language = self.lang_analyzer.identify()
  File "smda/intel/LanguageAnalyzer.py", line 222, in identify
    t_objects = self.getDelphiObjects()
  File "smda/intel/LanguageAnalyzer.py", line 164, in getDelphiObjects
    data.seek(method_table - image_base)
ValueError: negative seek value -4194260

Unhandled AssertionError Processing ELF File

Spotted this error while processing an ELF file with MCRIT in dir mode and wanted to report it. File hash d2f94e178c254669fb9656d5513356d2

ERROR: SMDA caused an exception while processing this file: Samples/d2f94e178c254669fb9656d5513356d2
Traceback (most recent call last):
  File "/usr/home/schrodinger/.local/src/mcrit/mcrit/client/McritConsole.py", line 109, in getSmdaReportFromFilepath
    smda_report = disassembler.disassembleFile(filepath)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/Disassembler.py", line 42, in disassembleFile
    loader = FileLoader(file_path, map_file=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/utility/FileLoader.py", line 22, in __init__
    self._loadFile()
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/utility/FileLoader.py", line 36, in _loadFile
    self._data = loader.mapBinary(self._raw_data)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/schrodinger/.venv/mcrit/lib/python3.11/site-packages/smda/utility/ElfFileLoader.py", line 113, in mapBinary
    assert len(segment.content) == segment.physical_size
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.