Analyzing PDF and Office Documents Delivered Via Malspam

We have all grown used to the presence of spam. Despite the many detection and verification techniques available to users, unsolicited email is still a major nuisance.

Malspam is a specific type of spam that is designed to infect potential victims with malware via email. This delivery method most often leverages techniques that are also used for phishing attacks with the goal of exploiting the victim’s computer. One of the more popular forms of malware included in malspam is ransomware, which locks the data on a victim’s computer and demands a payment to return the stolen files.

From an attacker’s point of view, email remains one of the most successful, reliable and relatively inexpensive delivery methods for malware. As such, it’s unlikely to go away anytime soon.

Malspam can come in a variety of forms, but it is usually designed to look like a legitimate message. This can be done by impersonating the look and feel of important documents or by using spoofed email addresses. The malware is then attached to the email or embedded as a hyperlink in the body of the message. For our purposes, we’ll focus on spam that delivers malicious PDF files or Microsoft Office attachments to infect victims.

Basic Protection Against Malspam

A basic level of protection against this type of threat is already available via email filtering at the gateways, endpoint protection and system hardening. User awareness campaigns can also help reduce the success rate of phishing attacks. Unfortunately, these measures only provide limited protection — determined and resourceful attackers will almost always be able to bypass these defenses.

Another approach is to use automated incident response processes to analyze the malware attached to emails, extract indicators of compromise (IoCs) and then update your filtering devices with this new information. The analysis of attachments will almost always be done in a sandbox. This approach is usually more than sufficient, but a few cases require an understanding of how to analyze attachments outside a sandbox. If you want to understand the full scope or intent of the malware, for example, sandboxes sometimes miss certain behavioral characteristics. In addition, some malware is designed to evade sandbox detection.

The SANS Institute offers a great course on reverse engineering malware that includes an entire day focused on analyzing malicious web and document files.

Analyzing Static Properties of Malicious Documents

Before diving into the code of the document, the first step is to conduct basic research on the static properties of the file. This allows you to investigate the nature of the file without having to execute it.

The most obvious properties to inspect are the file hashes. Under Linux, you can use the md5sum or sha256sum commands to create an MD5 or SHA256 hash. On Windows systems, you can use the certutil command line tool.

Next, look up that hash value in a public malware sandbox. If there’s a match, you can further investigate the basic behavior of the sample via the reports generated by the sandbox. Public sandboxes allow you to upload your samples for further analysis, but you should only do this if you are certain that the sample does not contain any sensitive information and is not targeted toward your organization. Uploading the sample to a sandbox basically means telling the whole world, including attackers, that you’re analyzing the file and that the malicious campaign has been detected.

A match gives you a starting point when it comes to understanding the behavior and intent of the malware. Some sandboxes also provide screenshots to show how the document looks once it is opened. These screenshots can help analysts determine whether a sample is indeed targeted specifically toward your organization without having to open it.

A word of caution before you proceed with analyzing potentially malicious files: Do not do this on a production system or a system connected to a production network. You should analyze these files on isolated lab systems. Some malicious documents require an active internet connection to detonate their payload. Make sure you either emulate the network connections or conduct this analysis on a network connection that can not be linked back to your organization. If the malware calls back to its author in an abnormal way or during an unusual time frame, this might tip off the attacker that his or her campaign has been detected.

Analyzing PDF Documents for Malspam

A PDF document is nothing more than a collection of elements that describe the document structure and provide rendering and, in some cases, execution instructions. Some elements can reference other elements within the same PDF. We are primarily interested in the elements that allow the execution of code.

For example, PDFs can:

  • Execute Flash files with the keyword /RichMedia.
  • Execute JavaScript with the keywords /JS , /JavaScript or /XFA (forms).
  • Start an external application via /Launch.

JavaScript in PDF files

To find the JavaScript included in a PDF file, start by searching the body (not the displayed content) of the PDF for the keywords that allow it to execute JavaScript (/JavaScript, /JS). This can be done with a tool designed by security researcher Didier Stevens called Searching for JavaScript is done via the following command:

pdf-parser malicious.pdf –search JavaScript

This search command will return a list of all elements or objects that contain the keyword “JavaScript.”

Next, extract the object data for each found element. The pdf-parser tool allows you to immediately decode the content of the object (with –filter) and store it to a file (with -d) for further analysis:

pdf-parser malicious.pdf –object 21 –filter –raw -d object21.txt

The output file (object21.txt) in the above case will contain the raw JavaScript content extracted from the PDF stream number 21. You can then further analyze this extracted JavaScript code using the developer tools in your browser or a JavaScript debugger, such as SpiderMonkey.

The further exploitation via JavaScript is limited by the creativity of the malware author. The JavaScript code can redirect the user to a URL, which can then be used to download and execute additional malware. Alternatively, the JavaScript code can be used to execute binary code that is included in the PDF file. This binary code can also be triggered by the exploitation of a vulnerability in the PDF reader.

Binary Data in PDF Files

The binary data in PDF files is represented in encoded content, most often base64 encoding. One way to extract binary data from a PDF file is by looking for those base64-encoded strings. You can do this with the tool, which also supports other encodings, using the following command: malicious.pdf

To extract a full decoded string from stream 10 in the PDF file, you should use: -s 10 -S malicious.pdf

Examining Potentially Malicious Office Documents

Microsoft Office files support Visual Basic for Applications (VBA) macros. These macros provide attackers with a powerful tool to interact with the system. Make sure that you take sufficient precautions to prevent macros from being executed automatically.

Two Types of Office Documents

There are two formats of Microsoft Office documents. The older, binary format is OLE2, which resembles a container that holds different files and folders. The modern format is in XML and the container is a ZIP file structure. Both formats can contain macros, but in the case of the XML-based format, the file extension needs to end in “m.”

VBA Editor

The easiest way to inspect the macro code in an Office document is by opening it in the VBA editor that is included in Microsoft Office. There are some downsides to this approach, however. For example, you must use the targeted application, including its possible vulnerabilities, to conduct the analysis, and password-protected files might not be accessible. For the latter, you can switch to LibreOffice.

Extracting Macros From Office Documents

The extraction of macros from Office documents is most easily done with the tools designed by Didier Stevens. The approach for analyzing Office documents is similar to process of examining PDF files: Search for possible malicious elements and then extract and decode those elements for further analysis.

To see an overview of the existing elements in an Office document, enter the following command:

oledump myword.docm

The output will show you an overview of the streams, their sizes and their names. If there is a letter “m” next to the stream number, this indicates that there is a macro included in the stream.

The next step is to extract that stream using the following command:

oledump -s 10 -v myword.docm

This will display the decompressed content of stream 10 on the screen. Obviously, it is best to redirect the output to a file for future processing.

Rich text format (RTF) supports text style formatting and images. Although an RTF itself doesn’t support macros, an RTF document can include malicious OLE1 objects.

The steps outlined above are designed to help you locate possible suspicious code and then extract this code from the file. However, sometimes it is still necessary to deobfuscate the extracted code. With JavaScript, especially, you may need to remove HTML code or declare elements that the JavaScript code expects to exist in the document container. For the binary that is extracted, it will be necessary to transform the code into executable code for further observation or debugging. You may even need to disassemble the code.

Practice Makes Perfect

Analyzing malspam is not always hard, but it does require some practice. It’s important to read and understand the reports from your sandbox, and identify the complete chain of infection and the most common techniques being used. Additionally, you should regularly read public reports of analyses conducted by other researchers and organizations, such as the SANS Internet Storm Center and Malware Traffic Analysis.

Once you feel more confident in understanding malspam samples, start building your own lab to conduct in-depth analysis. Finally, don’t forget to share your findings, either by compiling a report or sharing the indicators via a threat intelligence platform.

Share this Article:
Koen Van Impe

Security Analyst

Koen Van Impe is a security analyst who worked at the Belgian national CSIRT and is now an independent security researcher. He has a twitter feed (@cudeso) and a personal blog ( Koen is passionate about computer security, incident handling, network analysis, honeypots, Linux, log management and web technologies. He is responsible for the follow-up and coordination of computer security incidents and gives security advice to customers.