Blog

Delving into Dalvik: A Look Into DEX Files

Aseel Kayal
Mar 06, 2024
8 min read
Threat Intelligence

During the analysis of a banking trojan sample targeting Android smartphones, Mandiant identified the repeated use of a string obfuscation mechanism throughout the application code. To fully analyze and understand the application's functionality, one possibility is to manually decode the strings in each obfuscated method encountered, which can be a time-consuming and repetitive process. 

Another possibility is to use paid tools such as JEB decompiler that allow quick identification and patching of code in Android applications, but we found that the ability to do the same with free static analysis tools is limited. We therefore explored the possibility of finding and modifying the obfuscated methods by inspecting the Dalvik bytecode. 

Through a case study of the banking trojan sample, this blog post aims to give an insight into the Dalvik Executable file format, how it is constructed, and how it can be altered to make analysis easier. Additionally, we are releasing a tool called dexmod that exemplifies Dalvik bytecode patching and helps modify DEX files.

Case Study

In this case study, we will examine a Nexus banking trojan malicious sample (File MD5: d87e04db4f4a36df263ecbfe8a8605bd). Nexus is a framework offered for sale in an underground forum, and it is capable of stealing funds from numerous banking applications on Android phones. A report published by Cyble offers more details about this framework and a thorough analysis of the sample.

Using jadx to analyze the sample, the AndroidManifest.xml file in the application (d87...) shows that it requests access to the device's SMS messages, contacts, phone calls, and more sensitive information. The main activity in AndroidManifest.xml is not present in the application initially as it is later unpacked, but another class mentioned "com.toss.soda.RWzFxGbGeHaKi" extends the Application class, meaning it will be the first class to run in the application:

Main activity and Application subclass in AndroidManifest.xml
Figure 1: Main activity and Application subclass in AndroidManifest.xml

The onCreate() callback in the Application subclass, "com.toss.soda.RWzFxGbGeHaKi", refers to two additional methods: melodynight() and justclinic(), and the latter only calls another method: bleakperfect().

onCreate() method in the Application subclass
Figure 2: onCreate() method in the Application subclass

The bleakperfect() method, along with several others in the application, contains a large amount of dead code that involves assigning values to variables and performing arithmetic operations on them using multiple loops, but eventually the variables are never used. 

Furthermore, this method is used to decode strings that are referenced elsewhere in the code. This is done by XORing a byte array (the encoded string) with another byte array (the XOR key), and storing the result in a third byte array that is converted into a string.

Excerpt from obfuscated method to decode a string
Figure 3: Excerpt from obfuscated method to decode a string

Patching methods such as this one to remove the redundant code and to replace the lengthy XOR operation with a string return, can make the analysis of the application much easier and more time efficient. To do this, we must understand how this code appears in DEX files.

DEX Overview

Android applications are primarily written in Java. To run on Android devices, the Java code is compiled into Java bytecode, and then translated into Dalvik bytecode. The Dalvik bytecode can be found in DEX (Dalvik Executable) files in the APK. An APK (Android Package Kit) is essentially a ZIP file that contains an application's code and needed resources. It is possible to examine DEX files by extracting the APK's contents. 

DEX files are divided into several sections, including a header, string table, class definitions, method code, and other data. Most sections are divided into chunks of equal size that hold multiple values to define the items in the section. To show how common concepts in Java such as classes or strings are translated in a DEX file, we will use the class_defs section as an example.

Illustration of DEX file sections and items
Figure 4: Illustration of DEX file sections and items

Classes

The class_defs section is composed of class_def_items, which are 32 bytes long each, for every class in the application. The name of the class is stored in the following way: A class_def_item holds an index (class_idx) to an item in the type_ids section, which in turn holds an index (descriptor_idx) to another item in string_ids

The value under the string_id_item is an offset from the start of the file, which points to the start of a string_data_item that contains the actual class name string (data), preceded by its length (utf16_size).

Class name from class_def_item
Figure 5: Class name from class_def_item

The class_def_item has another member (class_data_off), an offset to a class_data_item that represents the data associated with the class. It contains information about the static and virtual methods of the class, the static and instance fields of the class, and matching encoded_method and encoded_field items for each method and field. 

Methods

The direct_methods and virtual_methods hold a sequence of encoded_method items. The method_idx_diff value in the first encoded_method item in each of the method types holds the index of the matching item in the method_ids section. 

In subsequent items, however, this value is the difference from the index of the previous item, and to calculate the method_ids index the difference must be incremented to the previous method_idx_diff values.

Calculation of method_id_item index
Figure 6: Calculation of method_id_item index

Finally, the method's name in the method_id_item is stored under name_idx similarly to the class name in the type_id_item, and the string value of the method name is retrieved using a string_id_item index.

Method name retrieval from encoded_method item
Figure 7: Method name retrieval from encoded_method item

Each method in an Android application has a preface (or a code_item) that specifies information about the method's size, input and output arguments, and exception handling data. The offset of this preface in the DEX file is stored in the code_off value of the previously mentioned encoded_method item.

The first two bytes of the preface represent the registers_size or how many registers were used by the bytecode, followed by the input and output arguments word size, while the last four are the bytecode size (or insns_size). 

The bytecode size is counted in 16-bit instruction units, meaning that to calculate the number of total bytes (8-bit units) in the bytecode, this value has to be multiplied by two. The method's Dalvik bytecode starts directly after the preface.

Method preface and bytecode
Figure 8: Method preface and bytecode

Strings

So far, we have seen two examples of string_id_items being used to fetch class and method names from the strings table in the DEX file. But a string_id_item is also important in Dalvik bytecode, and it is referred to when using string values in the application code itself. 

For example, the following bytecode sequence returns the "sampleValue" string, where "0xABCD" is the index of "sampleValue"'s string_id_item in the string_ids section (an overview of the Dalvik bytecode and its opcode set is available).

1A 00 CD AB            # const-string v0, "sampleValue" [string@ABCD]

11 00                  # return-object v0

This means that to patch the bytecode of the malicious sample, one obstacle is that decoded strings which the obfuscated methods should return are not present in the DEX file's string table. Instead, they have to be added to the file after being decoded in order to have a matching string_data_item and a string_id_item index that can be referenced by the code. 

Naturally, adding those strings causes changes to the file's section sizes, indices, and offsets. This creates another obstacle as there are multiple dependencies between different items in the previously shown DEX file, and changing the indices or offsets they reference will cause the items to be parsed improperly or have incorrect member values. This is why when patching the methods, it is necessary to make sure that the rest of the DEX file remains intact.

Patching

To accomplish this, we created dexmod which is a python helper tool that patches DEX files according to the deobfuscation logic specified by the user. In addition to patching, the tool supports operations such as method lookup using a bytecode pattern, or adding strings. Documentation of this tool can be found in the Appendix.

For obfuscated methods in the Nexus sample to return decoded strings, the strings have to be decoded and added to the file with the help of dexmod. Afterwards, the bytecode sequence seen in the DEX file returning a string is placed at the start of each obfuscated method's bytecode with the corresponding string_id_item index. Any remaining bytes in the method can be replaced with 0x00 (NOP) for additional code cleanup, but this is not necessary. 

Each method's preface needs to be updated as well to reflect those changes; the register size is decreased to 1 as only one register (v0) was used, and the bytecode size is updated to 3 given that it now consists of 3 16-bits instructions (6 bytes) only. The rest of the values in the preface can remain unchanged since the items they represent were not affected.

Patched bytecode
Figure 9: Patched bytecode

The checksum and SHA-1 signature values in the DEX file's header have to be updated too; otherwise, the verification of the file content will fail. After these steps are implemented using dexmod, we can reexamine the DEX file using jadx, and the once obfuscated functions will now have all the dead code removed and instead return the decoded strings:

Patched methods returning decoded strings
Figure 10: Patched methods returning decoded strings

Since the obfuscated methods in the Nexus sample are called by another method rather than directly, another possibility is to patch the caller method and return a string to skip the obfuscated one entirely. Doing so saves researchers repetitive jumps between methods during their analysis.

Takeaways

This case study shows how useful Dalvik bytecode patching can be for researchers, and how it can be achieved with free, open-source tools. Similar to the problems faced by other deobfuscation solutions, packers and obfuscation techniques are updated frequently, and it is unfortunately difficult to come up with a patching solution that will work for a large number of applications over a long period of time. In addition, although searching an application's bytecode is efficient for identifying code patterns, attempting to modify a DEX file without corrupting certain parts of it can be a challenge. Nevertheless, we are releasing this blog post along with the dexmod code for the sample we inspected, in the hopes that it will inspire and assist others in exploring malicious Android applications.

Appendix: Code

DexMod

The dexmod tool contains the following scripts:

  • dexmod.py
    Main module, accepts a DEX file name as an argument and calls methods from editBytecode.py to patch the file
  • getMethodObjects.py
    Creates method objects with the attributes:
    - methodIdx: the method_idx value, which is referenced in the Dalvik bytecode to call the method
    - offset: the file offset of the method's bytecode
    - name: the method's name
    - bytecode: the method's bytecode
  • searchBytecode.py
    Looks for a bytecode pattern in the DEX file and returns matching method objects
  • editStrings.py
    Adds strings to the DEX file
  • editBytecode.py
    Intended for the implementation of a custom patching logic, contains empty methods
  • example/editBytecodeCustom.py
    Implements the patching logic for the case study in this blog post

The dexmod tool makes use of dexterity, an open-source library that parses DEX files, and assists in adding strings to the DEX file while fixing references to the affected string IDs and other sections' offsets. The dexterity library has some limitations, it does not for once fix the string indices referenced in the bytecode, and some changes were applied to its code during this case study to add strings properly.