Thursday, June 15, 2017

Analyzing Documents

I've noticed over time that a lot of the write-ups that get posted online regarding malware or downloaders delivered via email attachments (i.e., spear phishing campaign) focus on what happens after the malicious payload is activated...the URL reached to, the malware downloaded, etc.  However, few seem to dig into the document itself, and there's a great deal can be gleaned from those documents, that can add to the threat intel picture.  If you're not looking at everything involved in the incident, if you're not (as Jesse Kornblum said) using all the parts of the buffalo, then you're very likely missing critical elements of the threat intel picture.

Here's an example from MS...much of the information in the post focuses on the embedded macro and the subsequent decrypted Cerber executable file, but there's nothing available regarding the document itself.

Keep in mind that different file formats (LNK, OLE, etc.) will contain different information.  And what I'm referring to here isn't about running through analysis steps that take a great deal of time; rather, what I'm going to show you are a few simple steps you can use to derive even more information from the attachment/wrapper documents themselves.

I took a look at a couple of documents (Doc 1, Doc 2) recently, and wanted to share my process and see if others might find it useful.  Both of these OLE-format documents have hashes available (or you can download and compute the hashes yourself), and they were also found on VirusTotal:

VirusTotal analysis for Doc 1
VirusTotal analysis for Doc 2

The VT analysis for both files includes a comment that the file was used to deliver PupyRAT.

Tools I'll be using for this analysis include my own and

Doc 1 Analysis
Running against the file, we see:

Fig. 1: Doc 1 output

That's a lot of streams in this OLE file.  So, one of the first things we see is the dates for the Root Entry and the 'directories' (MS has referred to the OLE file format as "a file system within a file", and they're right), which is 1 Jan 2017.  According to VT, the first time this file was submitted was 1 Jan 2017, at approx. 20:29:43 what that tells us is that it's likely that one of the first folks to receive the document submitted it less than 14 hrs after the file was modified.

Continuing with, we can view the contents of the various streams in a hex dump format, but we see that stream number 20 contains a macro.  Using with the argument "-d 20", we can view the contents of the stream in hex dump format.  In the output we see what appear to be 2 base64-encoded Powershell commands, one that downloads PupyRAT to the system, and another that appears to be shell code.  Copying and decoding both of the streams gives us the command that downloads PupyRAT, as well as a second command that appears to be some form of shell code.  Some of the variable names ($Qsc, $zw5) appear to be unique, so searching for those via Google leads us to this Hybrid-Analysis write-up, which provides some insight into what the shell code may do.

Interestingly enough, the same search reveals that, per this Reverse.IT link, both encoded Powershell commands were used in another document, as well.

Moving on, here's an excerpt of the output from, when run against this document:

Summary Information
Title        :
Subject      :
Authress     : Windows User
LastAuth     : Windows User
RevNum       : 2
AppName      : Microsoft Office Word
Created      : 01.01.2017, 06:51:00
Last Saved   : 01.01.2017, 06:51:00
Last Printed :

Notice the dates...they line up with the previously-identified dates (see fig.1)

Doc 2 Analysis
Following the same process we did with doc 1, we can see very similar output from with doc 2:

Fig. 2: Doc 2 output

One of the first things we can see is that this document was created within about 24 hrs of doc 1.

In the case of doc 2, stream 16 contains the data we're looking for...extracting and decoding the base64-encoded Powershell commands, we see that the commands themselves (PupyRAT download, shell code) are different.  Conducting a Google search for the variables used in the shell code command, we find this Hybrid-Analysis write-up, as well as this one from Reverse.IT.

Here's an excerpt of the output from, when run against this document:

Summary Information
Title        : HealthSecure User Registration Form
Subject      : HealthSecure User Registration Form
Authress     : ArcherR
LastAuth     : Windows User
RevNum       : 2
AppName      : Microsoft Office Word
Created      : 02.01.2017, 06:49:00
Last Saved   : 02.01.2017, 06:49:00
Last Printed : 20.06.2013, 06:27:00

Document Summary Information
Organization : ACC

Remember, this is a sample pulled down from VirusTotal, so there's no telling what happened with the document between the time it was created and submitted to VT.  I made the 'authress' information bold, in order to highlight it.

While this analysis may not appear to be of significant value, it does form the basis for developing a better intelligence picture, as it goes beyond the more obvious aspects of what constitutes most analysis (i.e., the command to download PupyRAT, as well as the analysis of the PupyRAT malware itself) in phishing cases.  Valuable information can be derived from the document format used to deliver the malware itself, regardless of whether it's an MSOffice document, or a Windows shortcut/LNK file.  When developing an intel picture, we need to be sure to use all the parts of the buffalo.