capa: Automatically Identify Malware Capabilities
capa is the FLARE team’s newest open-source tool for analyzing malicious programs. Our tool provides a framework for the community to encode, recognize, and share behaviors that we’ve seen in malware. Regardless of your background, when you use capa, you invoke decades of cumulative reverse engineering experience to figure out what a program does. In this post you will learn how capa works, how to install and use the tool, and why you should integrate it into your triage workflow starting today.
Effective analysts can quickly understand and prioritize unknown files in investigations. However, determining if a program is malicious, the role it plays during an attack, and its potential capabilities requires at least basic malware analysis skills. And often, it takes an experienced reverse engineer to recover a file’s complete functionality and guess at the author’s intent.
Malware experts can quickly triage unknown binaries to gain first insights and guide further analysis steps. Less experienced analysts, on the other hand, oftentimes don’t know what to look for and have trouble distinguishing the usual from the unusual. Unfortunately, common tools like strings / FLOSS or PE viewers display the lowest level of detail, burdening their users to combine and interpret data points.
Malware Triage 01-01
To illustrate this, let us look at Lab 01-01 from Practical Malware Analysis (PMA) available here. Our goal is to understand the program’s functionality. Figure 1 shows the file’s strings and import table with interesting values highlighted.
With this data, reverse engineers can hypothesize about the strings and imported API functions to guess at the program’s functionality—but no more. The sample may create a mutex, start a process, or communicate over the network—potentially to IP address 127.26.152.13. The Winsock (WS2_32) imports make us think about network functionality, but the names are not available here because they are, as is common, imported by ordinal.
Dynamically analyzing this sample can confirm or disprove initial suspicions and reveal additional functionality. However, sandbox reports or dynamic analysis tools are limited to capturing behavior from the exercised code paths. This, for example, excludes any functionality triggered after a successful connection to the command and control (C2) server. We don’t usually recommend analyzing malware with a live Internet connection.
To really understand this file, we need to reverse engineer it. Figure 2 shows IDA Pro’s decompilation of the program’s main function. While we use the decompilation instead of disassembly to simplify our explanation, similar concepts apply to both representations.
With a basic understanding of programming and the Windows API, we observe the following functionality. The malware:
- creates a mutex to ensure only one instance is running
- creates a TCP socket; indicated by the constants 2 = AF_INET, 1 = SOCK_STREAM, and 6 = IPPROTO_TCP
- connects to IP address 127.26.152.13 on port 80
- sends and receives data
- compares received data to the strings sleep and exec
- creates a new process
Although not every code path may execute on each run, we say that the malware has the capability to execute these behaviors. And, by combining the individual conclusions, we can reason that the malware is a backdoor that can run an arbitrary program specified by a hard-coded C2 server. This high-level conclusion enables us to scope an investigation and decide how to respond to the threat.
Automating Capability Identification
Of course, malware analysis is rarely as straight forward. The artifacts of intent may be spread through a binary that contains hundreds or thousands of functions. Furthermore, reverse engineering has a fairly steep learning curve and requires solid understanding of many low-level concepts such as assembly language and operating system internals.
However, with enough practice, we can recognize capabilities in programs simply from repetitive patterns of API calls, strings, constants, and other features. With capa, we demonstrate that some of our key analysis conclusions are actually feasible to perform automatically. The tool provides a common yet flexible way to codify expert knowledge and make it available to the entire community. When you run capa, it recognizes features and patterns as a human might, producing high-level conclusions that can drive subsequent investigative steps. For example, when capa recognizes the ability for unencrypted HTTP communication, this might be the hint you need to pivot into proxy logs or other network traces.
When we run capa against our example program, the tool output in Figure 3 almost speaks for itself. The main table shows all identified capabilities in this sample, with each entry on the left describing a capability. The associated namespace on the right helps to group related capabilities. capa did a fantastic job and described all the program capabilities we’ve discussed in the previous section.
We find that capa often provides surprisingly good results. That’s why we want capa to always be able to show the evidence used to identify a capability. Figure 4 shows capa’s detailed output for the “create TCP socket” conclusion. Here, we can inspect the exact locations in the binary where capa found the relevant features. We’ll see the syntax of rules a bit later – in the meantime, we can surmise that they’re made up of a logic tree combining low level features.
How capa Works
capa consists of two main components that algorithmically triage unknown programs. First, a code analysis engine extracts features from files, such as strings, disassembly, and control flow. Second, a logic engine finds combinations of features that are expressed in a common rule format. When the logic engine finds a match, capa reports on the capability described by the rule.
The code analysis engine extracts low-level features from programs. All the features are consistent with what a human might recognize, such as strings or numbers, and enable capa to explain its work. These features typically fall into two large categories: file features and disassembly features.
File features are extracted from the raw file data and its structure, e.g. the PE file header. This is information that you might notice by scrolling across the entire file. Besides the above discussed strings and imported APIs, these include exported function and section names.
Disassembly features are extracted from an advanced static analysis of a file – this means disassembling and reconstructing control flow. Figure 5 shows selected disassembly features including API calls, instruction mnemonics, numbers, and string references.
Because the advanced analysis can distinguish between functions and other scopes in a program, capa can apply its logic at an appropriate level of detail. For example, it doesn’t get confused when unrelated APIs are used in different functions since capa rules can specify that they should be matched against each function independently.
We’ve designed capa with flexible and extendable feature extraction in mind. Additional code analysis backends can be integrated easily. Currently, the capa standalone version relies on the vivisect analysis framework. If you’re using IDA Pro, you can also run capa using the IDAPython backend. Note that sometimes differences among code analysis engines may result in divergent feature sets and hence different results. Fortunately, this usually isn’t a serious problem in practice.
A capa rule uses a structured combination of features to describe a capability that may be implemented in a program. If all required features are present, capa concludes that the program contains the capability.
capa rules are YAML documents that contain metadata and a tree of statements to express their logic. Among other things, the rule language supports logical operators and counting. In Figure 6, the “create TCP socket” rule says that the numbers 6, 1, and 2, and calls to either of the API functions socket or WSASocket must be present in the scope of a single basic block. Basic blocks group assembly code at a very low level making them an ideal place to match tightly related code segments. Besides within basic blocks, capa supports matching at the function and the file level. The function scope ties together all features in a disassembled function, while the file scope contains all features across the entire file.
Figure 7 highlights the rule metadata that enables capa to display high-level, meaningful results to its users. The rule name describes the identified capability while the namespace associates it with a technique or analysis category. We already saw the name and namespace in the capability table of capa’s output. The metadata section can also include fields like author or examples. We use examples to reference files and offsets where we know a capability to be present, enabling unit testing and validation of every rule. Moreover, capa rules serve as great documentation for behaviors seen in real-world malware, so feel free to keep a copy around as a reference. In a future post we will discuss other meta information, including capa’s support for the ATT&CK and the Malware Behavior Catalog frameworks.
To make using capa as easy as possible, we provide standalone executables for Windows, Linux, and OSX. The tool is written in Python and the source code is available on our GitHub. Additional and up-to-date installation instructions are available in the capa repository.
To identify capabilities in a program run capa and specify the input file:
$ capa suspicious.exe
capa supports Windows PE files (EXE, DLL, SYS) and shellcode. To run capa on a shellcode file you must explicitly specify the file format and architecture, for example to analyze 32-bit shellcode:
- $ capa -f sc32 shellcode.bin
To obtain detailed information on identified capabilities, capa supports two additional verbosity levels. To get the most detailed output on where and why capa matched on rules use the very verbose option:
- $ capa -vv suspicious.exe
If you only want to focus on specific rules you can use the tag option to filter on fields in the rule meta section:
- $ capa -t "create TCP socket" suspicious.exe
Display capa’s help to see all supported options and consolidate the documentation:
- $ capa -h
We hope that capa brings value to the community and encourage any type of contribution. Your feedback, ideas, and pull requests are very welcome. The contributing document is a great starting point.
Rules are the foundation of capa’s identification algorithm. We want to make it easy and fun to write them. If you have any rule ideas, please open an issue or even better submit a pull request to capa-rules. This way, everyone can benefit from the collective knowledge of our malware analysis community.
To separate our work and discussions between the capa source code and the supported rules, we use a second GitHub repository for all rules that come embedded within capa. The capa main repository embeds the rule repository as a git submodule. Please refer to the rules repository for further details, including the rule format documentation.
In this blog post we have introduced the FLARE team’s newest contribution to the malware analysis community. capa is an open-source framework to encode, recognize, and share behaviors seen in malware. We think that the community needs this type of tool to fight back against the volume of malware that we encounter during investigations, hunting, and triage. Regardless of your background, when you use capa, you invoke decades of cumulative experience to figure out what a program does.
Try out capa in your next malware analysis. The tool is extremely easy to use and can provide valuable information for forensic analysts, incident responders, and reverse engineers. If you enjoy the tool, run into issues using it, or have any other comments, please contact us via the projects GitHub page.