RetDec is an open-source machine-code decompiler based on LLVM. It isn’t limited by a target architecture, operating system, or executable file format:
- Runs on Windows, Linux, and macOS.
- Supports all the major object-file formats: Windows PE, Unix ELF, macOS Mach-O.
- Supports all the prevailing architectures: x86, x64, arm, arm64, mips, powerpc.
Since its initial public release in December 2017, we have released three other stable versions:
- v3.0 — The initial public release.
- v3.1 — Added macOS support, simplified the repository structure, reimplemented recursive traversal decoder.
- v3.2 — Replaced all shell scripts with Python and thus made the usage much simpler.
- v3.3 — Added x64 architecture, added FreeBSD support (maintainted by the community), deployed a new LLVM-IR-to-BIR converter
Now, we are glad to announce a new version 4.0 release with the following major features:
- added arm64 architecture,
- added JSON output option,
- implemented a new build system, and
- implemented
retdec
library.
See changelog for the complete list of new features, enhancements, and fixes.
1. arm64 architecture
This one is clear — now you can decompile arm64 binary files with RetDec!
Adding a new architecture is isolated to the capstone2llvmir
library. Thus, it is doable with little knowledge about the rest of RetDec. In fact, the library already also supports mips64 and powerpc64. These aren’t yet enabled by RetDec itself because we haven’t got around to adequately test them. Any architecture included in Capstone could be implemented. We even put together a how-to-do-it wiki page so that anyone can contribute.
2. JSON output option
As one would expect, RetDec by default produces a C source code as its output. This is fine for consumption by humans, but what if another program wants to make use of it? Parsing high-level-language source code isn’t trivial. Furthermore, additional meta-information may be required to enhance user experience or automated analysis — information that is hard to convey in a traditional high-level language.
For this reason, we added an option to generate the output as a sequence of annotated lexer tokens. Two output formats are possible:
- Human-readable JSON containing proper indentation (option
-f json-human
). - Machine-readable JSON without any indentation (option
-f json
).
This means that if you run retdec-decompiler.py -f json-human input
, you get the following output:
{ "tokens": [ { "addr": "0x804851c" }, { "kind": "i_var", "val": "result" }, { "addr": "0x804854c" }, { "kind": "ws", "val": " " }, { "kind": "op", "val": "=" }, { "kind": "ws", "val": " " }, { "kind": "i_var", "val": "ack" }, { "kind": "punc", "val": "(" }, { "kind": "i_var", "val": "m" }, { "kind": "ws", "val": " " }, { "kind": "op", "val": "-" }, { "kind": "ws", "val": " " }, { "kind": "l_int", "val": "1" }, { "kind": "op", "val": "," }, { "kind": "ws", "val": " " }, { "kind": "l_int", "val": "1" }, { "kind": "punc", "val": ")" }, { "kind": "punc", "val": ";" } ], "language": "C" }
instead of this one:
result = ack(m - 1, 1);
In addition to the source-code token values, there is meta-information on token types, and even assembly instruction addresses from which these tokens were generated. The addresses are on a per-command basis at the moment, but we plan to make them even more granular in the future. See the Decompiler outputs wiki page for more details.
JSON output option is currently used in RetDec’s Radare2 plugin and an upcoming IDA plugin v1.0. Feel free to use it in your projects as well.
3. New build system
RetDec is a collection of libraries, executables, and resources. Chained together in a script, we get the decompiler itself — retdec-decompiler.py
. But what about all the individual components? Couldn’t they be useful on their own?
Most definitely they could!
Until now the RetDec components weren’t easy to use. As of version 4.0, the installation contains all the resources necessary to utilize them in other CMake projects.
If RetDec is installed into a standard system location (e.g. /usr
), its library components can be used as simply as:
find_package(retdec 4.0 REQUIRED COMPONENTS <component> [...] ) target_link_libraries(your-project PUBLIC retdec::<component> [...] )
If it isn’t installed somewhere where it can be discovered, CMake needs help before find_package()
is used. There are generally two ways to do it:
- Add the RetDec installation directory to
CMAKE_PREFIX_PATH
- Set the path to installed RetDec CMake scripts to
retdec_DIR
list(APPEND CMAKE_PREFIX_PATH ${RETDEC_INSTALL_DIR})
set(retdec_DIR ${RETDEC_INSTALL_DIR}/share/retdec/cmake)
It is also possible to configure the build system to produce only the selected component(s). This can significantly speed up compilation. The desired components can be enabled at CMake-configuration time by one of these parameters:
-D RETDEC_ENABLE_<component>=ON [...]
-D RETDEC_ENABLE=component[,...]
See Repository Overview for the list of available RetDec components, retdec-build-system-tests for component demos, and Build Instructions for the list of possible CMake options.
4. retdec
library
Well, now that we can use various RetDec libraries, can we use the whole RetDec decompiler as a library?
Not yet. But we should!
In fact, the vast majority of RetDec functionality is in libraries as it is. The retdec-decompiler.py
script and other related scripts are just putting it all together. But they are kinda remnants of the past. There is no reason why even the decompilation itself couldn’t be provided by a library. Then, we could use it in various front-ends, replacing hacked-together Python scripts. Other prime users would be the already mentioned RetDec’s IDA and Radare2 plugins.
We aren’t there yet, but version 4.0 moves in this direction. It adds a new library called retdec
, which will eventually implement a comprehensive decompilation interface. As a first step, it currently offers a disassembling functionality. That is a full recursive traversal decoding of a given input file into an LLVM IR module and structured (functions & basic blocks) Capstone disassembly.
It also provides us with a good opportunity to demonstrate most of the things this article talked about. The following source code is all that’s needed to get to a complete LLVM IR and Capstone disassembly of an input file:
#include <iostream> #include <retdec/retdec/retdec.h> #include <retdec/llvm/Support/raw_ostream.h> int main(int argc, char* argv[]) { if (argc != 2) { llvm::errs() << "Expecting path to input\n"; return 1; } std::string input = argv[1]; retdec::common::FunctionSet fs; retdec::LlvmModuleContextPair llvm = retdec::disassemble(input, &fs); // Dump entire LLVM IR module. llvm::outs() << *llvm.module; // Dump functions, basic blocks, instructions. for (auto& f : fs) { llvm::outs() << f.getName() << " @ " << f << "\n"; for (auto& bb : f.basicBlocks) { llvm::outs() << "\t" << "bb @ " << bb << "\n"; // These are not only text entries. // There is a full Capstone instruction. for (auto* i : bb.instructions) { llvm::outs() << "\t\t" << retdec::common::Address(i->address) << ": " << i->mnemonic << " " << i->op_str << "\n"; } } } return 0; }
The CMake script building it looks simply like this:
cmake_minimum_required(VERSION 3.6) project(demo) find_package(retdec 4.0 REQUIRED COMPONENTS retdec llvm ) add_executable(demo demo.cpp) target_link_libraries(demo retdec::retdec retdec::deps::llvm )
If RetDec is installed somewhere where it can be discovered, the demo can be built simply with:
cmake .. make
If it is not, one option is to set the path to installed CMake scripts:
cmake .. -Dretdec_DIR=$RETDEC_INSTALL_DIR/share/retdec/cmake make
If we are building RetDec ourselves, we can configure CMake to enable only the retdec
library with cmake .. -DRETDEC_ENABLE_RETDEC=ON
.
What’s next?
We believe that for effective and efficient manual malware analysis it is best to selectively decompile only the interesting functions. Interact with the results, and gradually compose an understanding of the inspected binary. Such a workflow is enabled by RetDec’s IDA and Radare2 plugins, but no so much by its native one-off mode of operation. Especially when performance on medium-to-large files is still an ongoing issue. We also believe in the ever-increasing role of advanced automated malware analysis.
For these reasons, RetDec will move further in the direction outlined in the previous section. Having all the decompilation functionality available in a set of libraries will enable us to build better tools for both manual and automated malware analysis.
Reversing tools series
With this introductory piece, we are starting a series of articles focused on engineering behind reversing. So, if you are interested in the inner workings of such tools, then do look out for new posts in here!