Disassembly binary code
Assembly language source code generally permits the use of constants and programmer comments. These are usually removed from the assembled machine code by the assembler. If so, a disassembler operating on the machine code would produce disassembly lacking these constants and comments; the disassembled output becomes more difficult for a human to interpret than the original annotated source code.
Some disassemblers provide a built-in code commenting feature where the generated output gets enriched with comments regarding called API functions or parameters of called functions.
Some disassemblers make use of the symbolic debugging information present in object files such as ELF. For example, IDA allows the human user to make up mnemonic symbols for values or regions of code in an interactive session: On CISC platforms with variable-width instructions, more than one disassembly may be correct.
Disassemblers do not handle code that varies during execution. Writing a disassembler which produces code which, when assembled, produces exactly the original binary is possible; however, there are often differences.
This poses demands on the expressivity of the assembler. If the original code uses the other choice, the original code simply cannot be reproduced at any given point in time. However, even when a fully correct disassembly is produced, problems remain if the program requires modification. For example, the same machine language jump instruction can be generated by assembly code to jump to a specified location for example, to execute specific code , or to jump to a specified number of bytes for example, to skip over an unwanted branch.
A disassembler cannot know what is intended, and may use either syntax to generate a disassembly which reproduces the original binary. However, if a programmer wants to add instructions between the jump instruction and its destination, it is necessary to understand the program's operation to determine whether the jump should be absolute or relative, i.
A disassembler may be stand-alone or interactive. A stand-alone disassembler, when executed, generates an assembly language file which can be examined; an interactive one shows the effect of any change the user makes immediately. For example, the disassembler may initially not know that a section of the program is actually code, and treat it as data; if the user specifies that it is code, the resulting disassembled code is shown immediately, allowing the user to examine it and take further action during the same run.
Scripting your own "crawler" in this way is more efficient; for large programs interactive disassembling may be impractical to the point of being unfeasible. The general problem of separating code from data in arbitrary executable programs is equivalent to the halting problem. As a consequence, it is not possible to write a disassembler that will correctly separate code and data for all possible input programs.
Reverse engineering is full of such theoretical limitations, although by Rice's theorem all interesting questions about program properties are undecidable so compilers and many other tools that deal with programs in any form run into such limits as well.
In practice a combination of interactive and automatic analysis and perseverance can handle all but programs specifically designed to thwart reverse engineering, like using encryption and decrypting code just prior to use, and moving code around in memory. User defined textual identifiers, such as variable names, label names, and macros are removed by the assembly process. They may still be present in generated object files, for use by tools like debuggers and relocating linkers, but the direct connection is lost and re-establishing that connection requires more than a mere disassembler.
Especially small constants may have more than one possible name. Operating system calls like DLLs in MS-Windows, or syscalls in Unices may be reconstructed, as their names appear in a separate segment or are known beforehand. Many disassemblers allow the user to attach a name to a label or constant based on his understanding of the code. These identifiers, in addition to comments in the source file, help to make the code more readable to a human, and can also shed some clues on the purpose of the code.
Without these comments and identifiers, it is harder to understand the purpose of the source code, and it can be difficult to determine the algorithm being used by that code.
When you combine this problem with the possibility that the code you are trying to read may, in reality, be data as outlined above , then it can be even harder to determine what is going on. Another challenge is posed by modern optimising compilers; they inline small subroutines, then combine instructions over call and return boundaries. This loses valuable information about the way the program is structured.
Akin to Disassembly, Decompilers take the process a step further and actually try to reproduce the code in a high level language. Frequently, this high level language is C, because C is simple and primitive enough to facilitate the decompilation process. Decompilation does have its drawbacks, because lots of data and readability constructs are lost during the original compilation process, and they cannot be reproduced. Since the science of decompilation is still young, and results are "good" but not "great", this page will limit itself to a listing of decompilers, and a general but brief discussion of the possibilities of decompilation.
Wikipedia has related information at decompiler. In the face of optimizing compilers, it is not uncommon to be asked "Is decompilation even possible?
Make no mistake, however: An example is in-lining, as explained above, where code called is combined with its surroundings, such that the places where the original subroutine is called cannot even be identified.
An optimizer that reverses that process is comparable to an artificial intelligence program that recreates a poem in a different language. So perfectly operational decompilers are a long way off. At most, current Decompilers can be used as simply an aid for the reverse engineering process leaving lots of arduous work. Normally when a subroutine is finished, it returns to executing the next address immediately following the call instruction. However, assembly-language programmers occasionally use several different techniques that adjust the return address, making disassembly more difficult:.
On 8-bit CPUs, calculated jumps are often implemented by pushing a calculated "return" address to the stack, then jumping to that address using the "return" instruction. For example, the RTS Trick uses this technique to implement jump tables w: Instead of picking up their parameters off the stack or out of some fixed global address, some subroutines provide parameters in the addresses of memory that follow the instruction that called that subroutine.
Subroutines that use this technique adjust the return address to skip over all the constant parameter data, then return to an address many bytes after the "call" instruction.