FLARE IDA Pro Script Series: Automatic Recovery of Constructed Strings in Malware
The FireEye Labs Advanced Reverse Engineering (FLARE) Team is dedicated to sharing knowledge and tools with the community. We started with the release of the FLARE On Challenge in early July where thousands of reverse engineers and security enthusiasts participated. Stay tuned for a write-up of the challenge solutions in an upcoming blog post.
This post is the start of a series where we look to aid other malware analysts in the field. Since IDA Pro is the most popular tool used by malware analysts, we’ll focus on releasing scripts and plug-ins to help make it an even more effective tool for fighting evil. In the past, at Mandiant we released scripts on GitHub and we’ll continue to do so at the following location. This is where you will also find the plug-ins we released in the past: Shellcode Hashes and Struct Typer. We hope you find all these scripts as useful as we do.
Let’s start with a simple challenge. What two strings are printed when executing the disassembly shown in Figure 1?
If you answered
“Hello world\n” and
“Hello there\n”, good job! If you didn’t see it then Figure 2 makes this more obvious. The bytes that make up the strings have been converted to characters and the local variables are converted to arrays to show buffer offsets.
Reverse engineers are likely more accustomed to strings that are a consecutive sequence of human-readable characters in the file, as shown in Figure 3. IDA generally does a good job of cross-referencing these strings in code as can be seen in Figure 4.
Manually constructed strings like in Figure 1 are often seen in malware. The bytes that make up the strings are stored within the actual instructions rather than a traditional consecutive sequence of bytes. Simple static analysis with tools such as strings cannot detect these strings. The code in Figure 5, used to create the challenge disassembly, shows how easy it is for a malware author to use this technique.
Automating the recovery of these strings during malware analysis is simple if the compiler follows a basic pattern. A quick examination of the disassembly in Figure 1 could lead you to write a script that searches for
mov instructions that begin with the opcodes
C6 45 and then extract the stack offset and character bytes. Modern compilers with optimizations enabled often complicate matters as they may:
- Load frequently used characters in registers which are used to copy bytes into the buffer
- Reuse a buffer for multiple strings
- Construct the string out of order
Figure 6 shows the disassembly of the same source code that was compiled with optimizations enabled. This caused the compiler to load some of the frequently occurring characters in registers to reduce the size of the resulting assembly. Extra instructions are required to load the registers with a value like the 2-byte mov instruction at
0040115A, but using these registers requires only a 4-byte mov instruction like at
mov instructions that contain hard-coded byte values are 5-bytes, such as at
The StackStrings IDA Pro Plug-in
To help you defeat malware that contains these manually constructed strings we’re releasing an IDA Pro plug-in named StackStrings that is available at our Github page. The plug-in relies heavily on analysis by a Python library called Vivisect. Vivisect is a binary analysis framework frequently used to augment our analysis. StackStrings uses Vivisect’s analysis and emulation capabilities to track simple memory usage by the malware. The plug-in identifies memory writes to consecutive memory addresses of likely string data and then prints the strings and locations, and creates comments where the string is constructed. Figure 7 shows the result of running the above program with the plug-in.
While the plug-in is called StackStrings, its analysis is not just limited to the stack. It also tracks all memory segments accessed during Vivisect’s analysis, so manually constructed strings in global data are identified as well as shown in Figure 8.
Simple, manually constructed WCHAR strings are also identified by the plug-in as shown in Figure 9.
Download Vivisect and add the package to your PYTHONPATH environment variable if you don’t already have it installed.
Clone the git repository at our Github page. The
python\stackstring.py file is the IDA Python script that contains the plug-in logic. This can either be copied to your
%IDADIR%\python directory, or it can be in any directory found in your PYTHONPATH. The
plugins\stackstrings_plugin.py file must be copied to the %IDADIR%\plugins directory.
Test the installation by running the following Python commands within IDA Pro and ensure no error messages are produced:
To run the plugin in IDA Pro go to Edit – Plugins – StackStrings or press Alt+0.
The compiler may aggressively optimize memory and register usage when constructing strings. The worst-case scenario for recovering these strings occurs when a memory buffer is reused multiple times within a function, and if string construction spans multiple basic blocks. Figure 10 shows the construction of
“Hello world\n” and
“Hello there\n”. The plug-in attempts to deal with this by prompting the user by asking whether you want to use the basic-block aggregator or function aggregator. Often the basic-block level of memory aggregation is fine, but in this situation running the plug-in both ways provides additional results.
You’ll likely get some false positives due to how Vivisect initializes some data for its emulation. False positives should be obvious when reviewing results, as seen in Figure 11.
The plug-in aggressively checks for strings during aggregation steps, so you’ll likely get some false positives if the compiler sets null bytes in a stack buffer before the complete string is constructed.
The plug-in currently loads a separate Vivisect workspace for the same executable loaded in IDA. If you’ve manually loaded additional memory segments within your IDB file, Vivisect won’t be aware of that and won’t process those.
Vivisect’s analysis does not always exactly match that of IDA Pro, and differences in the way the stack pointer is tracked between the two programs may affect the reconstruction of stack strings.
If the malware is storing a binary string that is later decoded, even with a simple XOR mask, this plug-in likely won’t work.
The plug-in was originally written to analyze 32-bit x86 samples. It has worked on test 64-bit samples, but it hasn’t been extensively tested for that architecture.
StackStrings is just one of many internally developed tools we use on the FLARE team to speed up our analysis. We hope it will help speed up your analysis too. Stay tuned for our next post where we’ll release another tool to improve your malware analysis workflow.