Monday, September 02, 2013

Data Structures, Revisited

A while back, I wrote this article regarding understanding data structures.  The importance of this topic has
not diminished with time; if anything, it deserves much more visibility.  Understanding data structures provides analysts with insight into the nature and context of artifacts, which in turn provides a better picture of their overall case.

First off, what am I talking about?  When I say, "data structures", I'm referring to the stuff that makes up files.  Most of us probably tend to visualize files on a system as being either lines of ASCII text (*.txt files, some log files, etc.), or an amorphous blob of binary data.  We may sometimes even visualize these blobs of binary data as text files, because of how our tools present the information found in those blobs.  However, as we've seen over time, there are parts of these blobs that can be extremely meaningful to us, particularly during an examination.  For example, in some of these blobs, there may be an 8-byte sequence that is the FILETIME format time stamp that represents when a file was accessed, or when a device was installed on a system.

A while back, as an exercise to learn more about the format of the IE (version 5 - 9) index.dat file, I wrote a script that would parse the file based on the contents of the header, which includes a directory table that points to all of the valid records within the file, according to information available on the ForensicsWiki (thanks to Joachim Metz for documenting the format, the PDF of which can be found here).  Again, this was purely an exercise for me, and not something monumentally astounding...I'm sure that we're all familiar with pasco.  Using what I'd learned, I wrote another script that I could use to parse just the headers of the index.dat as part of malware detection, the idea being that if a user account such as "Default User", LocalService, or NetworkService has a populated index.dat file, this would be an indication that malware on the system is running with System-level privileges and communicating off-system via the WinInet API.  I've not only discussed this technique on this blog and in my books, but I've also used this technique quite successfully a number of times, most recently to quickly identify a system infected with ZeroAccess.

More recently, I was analyzing a user's index.dat, as I'd confirmed that the user was using IE during the time frame in question.  I parsed the index.dat with pasco, and did not find any indication of a specific domain in which I was interested.  I tried my script again...same results.  Exactly.  I then mounted the image as a read-only volume and ran strings across the user's "Temporary Internet Files" subfolders (with the '-o' switch), looking specifically for the domain name...that command looked like this:

C:\tools>strings -o -n 4 -s | find "domain" /i

Interestingly enough, I got 14 hits for the domain name in the index.dat file.  Hhhhmmmm....that got me to thinking.  Since I had used the '-o' switch in the strings command, the output included the offsets within the file to the hits, so I opened the index.dat in a hex editor and manually scrolled on down to one of the offsets; in the first case, I found full records (based on the format specification that Joachim had published).  In another case, there was only a partial record, but the string I was looking for was right there.  So, I wrote another script that would parse through the file, from beginning to end, and locate records without using the directory table.  When the script finds a complete record, it will parse it and display the record contents.  If the record is not complete, the script will dump the bytes in a hex dump so that I could see the contents.  In this way, I was able to retrieve 10 complete records that were not listed in the directory table (and were essentially deleted), and 4 partial records, all of which contained the domain that I was looking for.

Microsoft refers to the compound file binary file format as a "file system within a file", and if you dig into the format document just a bit, you'll start to see why...the specification details sectors of two sizes, not all of which are necessarily allocated.  This means that you can have strings and other data buried within the file that are not part of the file when viewed through the appropriate application.
CFB Format
The Compound File Binary Format document available from MS specifies the use of a sector allocation table, as well as a small sector allocation table. For Jump Lists in particular, these structures specify which sectors are in use; mapping the ones that are in use, and targeting just those sectors within the file that are not in use can allow you to recover potentially deleted information.

MS Office documents no longer use this file format specification, but it is used in *.automaticDestinations-ms Jump Lists on Windows 7 and 8. The Registry is similar, in that the various "cells" that comprise a hive file can allow for a good bit of unallocated or "deleted" data...either deleted keys and values, or residual information in sectors that were allocated to the hive file as it continued to grow in size.  MS does a very good job of making the Windows XP/2003 Event Log record format structure available; as such, not only can Event Logs from these systems be parsed on a binary basis (to not only locate valid records within the .evt file that are "hidden" by the information in the header), but records can also be recovered from unallocated space and other unstructured data.  MFT records have been shown to contain useful data , particularly as a file moves from being resident to non-resident (specific to the $DATA attribute), and that can be particularly true for systems on which MFT records are 4K in size (rather than the 1K that most of us are familiar with).

Understanding data structures can help us develop greater detail and additional context with respect to the available data during an examination.  We can recover data from within files that is not "visible" in a file by going beyond the API.  Several years ago, I was conducting a PCI forensic audit, and found several potential credit card numbers "in" a Registry hive...understanding the structures within the file, and a bit of a closer look revealed that what I was seeing wasn't part of the Registry structure, but instead part of the sectors allocated to the hive file as it grew...they simply hadn't been overwritten with key and value cells yet.  This information had a significant impact on the examination.  In another instance, I was trying to determine which files a user had accessed, and found that the user did not have a RecentDocs key within their NTUSER.DAT; I found this to be odd, as even a newly-created profile will have a RecentDocs key.  Using regslack.exe, I was able to retrieve the deleted RecentDocs key, as well as several subkeys and values.
Understanding the nature of the data that we're looking at is critical, as it directs our interpretations of that data. This interpretation will not only direct subsequent analysis, but also significantly impact our conclusions. If we don't understand the nature of the data and the underlying data structures, our interpretation can be significantly impacted. Is that credit card number, which we found via a search, actually stored in the Registry as value data? Just because our search utility located it within the physical sectors associated with a particular file name, do we understand enough about the file's underlying data structures to understand the true nature and context of the data?


Joe said...

"do we understand enough about the file's underlying data structures to understand the true nature and context of the data?”

Another question might be: With all the data structures out there, is it even possible to truly understand them all?

Furthermore, how do you determine if you understand it? Years ago, most would say you truly understood the Registry, then regslack was discovered. We have little hope of not missing evidence from data structures when you miss evidence in the Registry. ;)

Computers are complex, and if history is any indication of the future, we’re still likely missing a lot of important evidence. Look at timeline and memory analysis improvements over the years.

You've given great examples as to why it's important to understand data structures, but what about the times you, and everyone else, miss those critical artifacts-- and never know about it?

H. Carvey said...

Another question might be: With all the data structures out there, is it even possible to truly understand them all?

Good question...without the vendor's assistance, no, I don't believe that this an absolute.

However, the fact is that many data structures are, in fact, least to some degree. Again, Joachim Metz and others have done a great job with this. Even when the structures are documented, there's still a "so what" factor...analysts may not be able to process the information in a manner in which they can incorporate it into their day-to-day work. The first step to this is awareness...if you don't know that something is out there, and why it's important and valuable to _you_, there's really no point.

No one can know everything...this is why we have references. However, references are no good if they aren't used. The ForensicsWiki is a great site for keeping and maintaining this sort of information; however, I don't think that analysts use it. This could be because they can't find what they're looking for, so instead of asking for assistance, they simply stop using the site.

The fact is that in many cases, the information is out there. If you're aware of it, but can't follow it or don't understand how to use it, and you've read what's available, ask. I've met analysts who've spent weeks and in some cases months "noodling something over", whereas if they'd asked someone, they would've received a pretty complete education in 20 minutes.

Can we possibly know every data structure? No. Can we better understand the ones that are known? Yes. Can we ask for assistance and direct attention to data structures that were not previously known? Of course.

...but what about the times you, and everyone else, miss those critical artifacts-- and never know about it?

Exactly. If the "critical artifacts" are unknown, how do you know that you missed them?

Joe said...

"Exactly. If the "critical artifacts" are unknown, how do you know that you missed them?"

I guess we don’t. If security professionals can only mitigate risk, maybe forensic analysts can only mitigate doubt. No security professional should say they’re 100% secure. No forensic analyst should say they’re 100% certain, especially when dealing with negative evidence. It’s up to the analyst to understand the evidence, tools, and limits as best as possible.

Troy said...

Learning to identify and interpret the "critical artifacts" is one of those things that turns the novice into a master and separates the hack from the expert.

There are those who constantly dig into platforms and applications, who run tools to document system and program behavior, and who, when they see unexpected results in their work, chase down the reasons why. And there are those who don't.

To succeed in digital forensics, you need to accept a level of uncertainty, all the while working to reduce that uncertainty. You can't know everything, but you certainly can work to know more than you now know. Digital forensics is pretty much like any other profession in that regard.

If you approach forensics as mostly a collection of tips and tricks to be mastered, then you will probably fail at some point or work at a lower level of proficiency. If you view forensics as an evolving, dynamic subject matter that has to be continually examined and mastered, you will be better served. Good forensics takes attention and a work.

Good forensics involves learning the fundamentals of the platform one will be examining. If you examine Windows systems, then you should absorb Windows Internals (the books), and be intimately familiar with TechNet, MSDN, etc.. Absorb what the professional community already knows, by reading blogs such as this, and reading the best of forensics books.

Constantly challenge what you think you know. As you study forensics, train, go about your work, or look at the forensic import of a new OS or program, consider whether whatever you are focused on is "stateful," i.e., whether it remembers prior states, or reflects current states of a feature or program. Stateful suggests potential artifacts. Using techniques of behavior analysis (e.g., process monitor, registry differencing, etc.) identify all the files and places (registry keys, databases, log files, scratch files, etc.) that the feature or program touches--especially writes. Those are your artifacts.

Interpreting artifacts requires 1) being able to "read" an artifact and then 2) being able to determine its significance. In the Windows world, there are a finite number of file formats. In fact, Windows relies heavily on a handful of file types and data structures--file headers. Learn to identify these on sight. When you recognize a file type or data structure, you can apply the correct tool or algorithm to make sense of it. Thanks to an active, hard working community, you can find multiple parsers for almost all Microsoft file formats.

Determining the significance of data in an artifact will require additional thought and testing. The question you want to answer is, "what does the artifact prove?" You also have to consider what the artifact does not prove. It is as important to understand the limits of the evidence offered by an artifact as it is to understand what it may prove. (For example, pictures in thumbcache files don't mean that those pictures where ever opened by a user.)

Then, rinse and repeat.