Analyzing Malicious Office Documents with OLEDUMP

Microsoft office documents are a common vehicle used by malware authors to deliver malware. These documents, used for malicious purposes, are commonly referred to as maldocs. While there has been a variety of ways in which they have been used, one of the more prevalent is through the use of macros. Macros are written in Visual Basic for Applications (VBA), which is well documented on the Microsoft Developer Network (MSDN). This API allows malware authors to hook into life-cycle events of a document, such as AutoOpen, AutoClose and AutoExit (MSDN) in order to achieve code execution with minimal interaction from the user. While there are a variety of security protections now offered through the office suite, maldocs continue to plague both enterprise and home users. In this post we’ll be looking into oledump, which is a Python based tool produced by Didier Stevens that can be used to extract and inspect macro streams.

Oledump: Help and Usage

Oledump is a Python-based program, all you need installed in your analysis system is Python and a single dependency, OleFileIO_PL (found here). You can also grab a copy of REMnux, which is a preconfigured, malware analysis distribution of Ubuntu Linux. This distro has oledump already setup and mapped to your path so all you have to do is open up a terminal and go! I will be using the latest version of REMnux at the time of this writing.

To use oledump, open a terminal and type the following command, the “-h” argument will display help and usage information:

> oledump.py -h

Your results should be similar to the following:

Finding Macros

Now we’re ready to analyze a malicious document – I’ll be using this sample if you want to follow along. The first step is usually to determine if the document contains macros. Running oledump and providing the path to the maldoc as the single argument will produce what I consider to be a table of contents. Office documents use the OLE format for storing and organizing the content within an office document (MSDN). When you view the content of a maldoc, you are looking at the streams and storages of that document. Oledump helps to identify those streams that contain macros by adding an upper or lower case M next to the index. Run the following command to see the structure of our malicious document.

> oledump.py demo.doc

In this document, streams 11, 12 and 13 have been identified to contain macro code. In fact, streams 24, 25, 26 and 27 also contain macro code but are not displayed in this screenshot. Before we get to the difference between the upper and lowercase M, let’s look at the rest of the information being displayed.

The first column is provided as an index, we’ll use this value later to reference and decode any macro stream we’re interested in analyzing. Next is identification for a macro stream, indicated by the character ‘M’. After that comes the size of the stream in bytes, sometimes the size of a stream can provide insight into functionality within the document. Finally, the last column is the name and/or type of stream. In this example, stream 13 is named “Module2”, stream 12 “Module1” and so on. It is common for malware authors to obfuscate these names, giving them meaningless or random values.

Dumping Macro Streams

When you are ready to view the contents of a macro stream, you have to provide some additional information to oledump through command line arguments – namely, the index you want to view and to tell oledump to decompress the stream. Macro streams are compressed within the document, if you don’t instruct oledump to decompress them you will not be able to view their content. The two arguments that we now need are “-s” and “-v”. The first argument defines the stream we want to investigate, while the second argument instructs oledump to decompress that stream. The decompression argument only works with the compressed macro streams.

Let’s look at the code within stream 12, type the following command:

> oledump.py -s 12 -v demo.doc

I also redirected the output through the more utility to make it easier to view the output:

> oledump.py -s 12 -v demo.doc | more

Your output should look like the following:

Just as we redirected the output to the more utility, we can also redirect any content to a file. This allows us to use a text editor/IDE to continue to perform analysis, clean up the code through formatting and refactoring and trace other functionality.

> oledump.py -s 12 -v demo.doc > stream12

Macro Analysis

Now that you have the ability to analyze the macros – the next step is usually to find the entry point. As previously discussed, malware authors will tie into one of several document life-cycle events to accomplish this. While open events are common, I have also encountered several that use close events. The difference is when the macro code will be executed – on closing, this may avoid analysis in a sandbox environment or be overlooked by an analyst.

For this document, we only need to inspect streams that contain an uppercase M – a lower case m indicates that the stream contains attributes only. Here is an example from stream 26:

Viewing the other macros streams you’ll eventually find that stream 27 utilizes the function Workbook_Open:

Now begins the task of tracing functionality – of course, this assumes that you want to analyze the macros in detail. Workbook_Open calls SendLastPik, which isn’t defined in this stream. Other candidates are streams 12 and 13, dump those macros and you’ll find this function in stream 12. For more prolonged analysis, I tend to dump the streams into files and move to a text editor for analysis. For this post, I’ll be using Microsoft’s Visual Studio Code (found here). Another benefit of using an IDE such as Code is that many will offer language-specific syntax highlighting, which can be seen in the screenshot below.

There are several prevalent obfuscation techniques used by malware authors to complicate this analysis – we didn’t encounter this on the open function but see it now with the function SendLastPik. Some common obfuscation attempts include (this is by no means an exhaustive list):

Adding unnecessary instructions: basic arithmetic, loops, functions, object creation – anything that slows down analysis.
Hiding and obfuscating strings: Strings are an important source of information for what the macros will do, hiding these makes it more difficult to trace functionality.
Hiding and obfuscating objects: Similar to strings, objects are also important to identify. VBA uses objects to write to the file system, execute commands, download content from the internet – hiding these objects makes it more difficult to determine macro purpose.
Nonsensical naming: the use of confusing or randomly generated names also makes it more difficult to trace functionality. Refactoring (i.e. renaming) functions and variables as you perform your analysis can greatly enhance your ability to understand code behavior.

Tracing From the Entry Point

At the beginning of SendLastPik is the creation of a variable – vatafak_1, which is assigned the value from a call to CreateObject. There are two important aspects to this statement. The first is that no matter how much obfuscation a malware author employs, they still have to adhere to the confines of the language to get the functionality they need. Objects are required to make HTTP requests, execute scripts, interact with the underlying file system and other behaviors. While it’s not immediately obvious what object is being created, it’s important to trace these objects to determine what the code is up to.

The second aspect is to understand where malware authors will store strings and other data that they need. For this example, the arguments used for the calls to CreateObject are actually stored in a user form. This isn’t immediately obvious and will be the topic of a follow-on post. Suffice it to say, the strings used for this function call are stored in the properties of the user form object label1 and label4.

There are a couple of ways to get this data – for this post we’ll continue to analyze the code and deduce what the objects are. We could also use dynamic analysis in the form of the Office Integrated Development Environment (IDE) debugger. Similar to user forms, a more in-depth discussion about using the Office debugger will be the topic for another post.

After those objects are created, there is a GoTo statement that directs execution to the label xbee1. After this label is a call to xbee_ensureMessageID and where our analysis needs to go next.

Doing a search for the string “xbee_ensureMessageID” reveals that this function is defined within the same macro stream that we’re currently analyzing. This function uses a similar technique as the previous by defining a Goto statement. In this case, the label it will transfer execution to is xbee3. This means all of the code between the Goto statement and the label is just junk!

Jumping Modules

There are two function calls that we can now explore: RudiknTest and Module2.RudiknTest2. Since we have no particular reason to analyze one over the other, it’s a good strategy to continue to analyze the code in sequential order. Searching for RudiknTest in the current macro stream does not yield any results. This means that the function is defined in another stream. Use OLEDUMP to re-examine the structure of the document and extract any additional streams that contain macros (streams with a capital M). You’ll find this function defined in stream 13.

While this function contains a fair amount of obfuscation – with time and experience you can figure out exactly what it is up to in relatively short order. The variable vatafak_1 uses the member function Open. This is commonly used with XMLHTTP objects to make HTTP requests. Looking at the this object on MSDN, the first argument is the HTTP method, while the second is the URL. The URL is represented by the variable vatafak_11, which is the variable concatenated to in the loop above the call to Open. Typically during this type of analysis, we’re interested in discovering those indicators of compromise (IOCs) such as URLs used – so this becomes an interesting area to analyze.

If we want to figure out what the URL is, we just need to figure out the logic of the loop. The array contains values that do not represent valid ASCII characters. Instead, the loop does some trivial modifications to generate those new values. The approach is relatively straight-forward, each array value has (11 * 46) subtracted from it, then a value of 1001 subtracted. The numeric result is converted to a character through the use of the chr function. A little Python and the URL is revealed…

obj_chars = [1611, 1623, 1623, 1619, 1565, 1554, 1554, 1619, 1608, 1621, 1609, 1608, 1606, 1623, 1552, 1613, 1608, 1626, 1608, 1615, 1615, 1608, 1621, 1628, 1553, 1606, 1618, 1553, 1624, 1614, 1554, 1555, 1564, 1624, 1563, 1611, 1562, 1561, 1609, 1554, 1561, 1560, 1609, 1610, 1561, 1562, 1617] url = “” for c in range(0,len(obj_chars)): url += chr(obj_chars[c] – (11 * 46) – 1001) print url

Wrapping Up

While there is still more to analyze, you’re now well on your way to getting started with performing maldoc analysis. This document is meant to begin an attack, it downloads the next stage from the URL above and executes it on the host. It’s these latter stages that contain the ultimate payload – that is, the functionality that the malware author ultimately intends (banking trojan, ransomware, etc).