Wednesday, July 06, 2011

Structure Adds Context

A while ago, I was talking to Cory Altheide and he mentioned something about timeline analysis that sort of clarified an aspect of the analysis technique for me...he said that creating a timeline from multiple data sources added context to the data that you were looking at.  This made a lot of sense to me, because rather than just using file system metadata and displaying just the MACB times of the files and directories, if we added Event Log records, Prefetch file metadata, Registry data, etc., we'd suddenly see more than just that a file was created or accessed.  We'd start to see things like, user A had logged in, launched an application, and the result of those actions was the file creation or modification in which we were interested.

Lately, I've been looking at a number of data structures used by Windows systems...for example, the DestList stream within Windows 7 jump lists.  What this got me thinking about is this...as analysts, we have to understand the structure in which data is stored, and correspondingly, how it's used by the application.  We need to understand this because the structure of the data can provide context to that data.

Let's look at an example...once, in a galaxy far, far away, I was working on a PCI forensic assessment, which included scanning every acquired image for potential credit card numbers (CCNs).  When the scan had completed, I found that I had a good number of hits in two Registry hive files.  So my analysis can't stop there, can it?  After all, what does that mean, that I found CCNs in the Registry?  In and of itself, that statement is lacking context.  So, I need to ask:

Are the hits key names?  Value names?  Value data, or embedded in value data?  Or, are the hits located in unallocated space within the hive files?

The answers to any of these questions would significantly impact my analysis and the findings that I report.

Here's another example...I remember talking with someone a while back who'd "analyzed" a Windows PE file by running strings on it, and found the name of a DLL.  I don't remember the exact conclusions that they'd drawn from this, but what I do remember is thinking that had they done some further analysis, they might have had different conclusions.  After all, finding a string in a 200+ KB file is one thing...but what if that DLL had been in the import table of the PE header?  Wouldn't that have a different impact on the analysis than if the DLL was instead the name of the file where stolen data was stored before being exfil'd?

So, much like timeline analysis, understanding the structure in which data is stored, and how that data is used by an application or program, can provide context to the data that will significantly impact your analysis and findings.

Addendum, 7 July
I've been noodling this over a bit more and another thought that I had was that this concept applies not just to DF analysis, but also to the work that often goes on beyond just analysis, particularly in the LE field, and that is developing intelligence.

In many cases, and particularly for law enforcement, there's more to DF analysis than simply running keyword searches or finding an image.  In many instances, the information found in one examination is used to develop intelligence for a larger investigation, either directly or indirectly.  So, it's not just about, "hey, I found an IP address in the web logs", but what verb was used (GET, POST, etc.), what were the contents of the request, who "owns" the IP address, etc.

So how is something like this implemented?  Well, let's say you're using Simson's bulk_extractor, and you find that a particular email address that's popped up in your overall investigation was located in an acquired image.  Just the fact that this email address exists within the image may be a significant finding, but at this point, you don't have much in the way of context, beyond the fact that you found it in the image.  It could be in an executable, or part of a chat transcript, or in another file.  Regardless, where the email address is located within the image (i.e., which file it's located in) will significantly impact your analysis, your findings, and the intel you derive from these.

Now, let's say you take this a step further and determine, based on the offset within the image where the email address was located, that the file that it is located in is an email.  Now, this provides you with a bit more context, but if you really think about it, you're not done yet...how is the email-address-of-interest used in the file?  Is it in the To:, CC:, or From: fields?  Is it in the body of the message?  Again, where that data is within the structure in which it's stored can significantly impact your analysis, and your intel.

Consider how your examination might be impacted if the email address were found in unallocated space or within the pagefile, as opposed to within an email.

6 comments:

Pete Cap said...

I definitely agree. First of all, at its most basic, any kind of "analysis" involves extracting significant chunks of "evidence" from raw data. This is true regardless of subject matter (traffic or protocol analysis, reversing malware, doing forensics on a drive, etc.)--there is a whole needle/haystack, chaff/wheat challenge going on. Context (relationship with other data points in some kind of schema) is what you need to turn a data point into a piece of evidence.

In my experience, the most frequently-used analyst tools for establishing that context are things like timelines, concept/mind maps, process flows, and relationship maps. Any "analyst" should be able to create these without the use of special software.

Unfortunately, I would say that the vast majority of "analysis" done in our field is incomplete because it does NOT have context and because the analyst does NOT use those tools to tell the story. How often do you see long technical reports that ultimately only resemble a list of dry facts devoid of context or meaning? Such information is valuable but only insofar as it can be used as fodder for additional analysis! The job's not done at that point.

Keydet89 said...

Pete,

I see what you are saying, but what I was trying to convey was something along the lines of when you do extract that chuck of "evidence" from the raw data, if you know how that information was used within the "raw data", you already have some context. I tried to communicate this in the Registry example.

Thanks for your comments.

Anonymous said...

I know this may be blasphemy, but I've always had an issue with the way programs like log2timeline extract date/times from a case and don't allow you to review them in context. So much so that I've written and posted my own EnScript to GSI which allows you see 'super time line' dates as bookmarks. I know it only works under EnCase (v6) and so may not appeal to all ;)

Still (and probably always will be a work in progress), but I find this really helps to see related artefacts that you would have missed if reviewed in Excel.

Hope it helps,
James

Keydet89 said...

James,

Could you elaborate a bit on what you mean? I mean, log2timeline is open source...you can dig into it and find out what it is you want to know.

I guess I'm a little unclear as to what you mean by, "...extract date/times from a case and don't allow you to review them in context."

Thanks.

Anonymous said...

Sorry Harlan, what I was trying to say is that in programs like log2timeline that extract out the date/time info and related meta-data into files that are later analysed in programs such as Excel, you can see the data surrounding the date/time hit and that this can be of great help.

I have found it extremely helpful during investigations (PCI and LE) to be able to see my timeline of various events generated in EnCase AND be able to reference the hex around the hit to look for other related artefacts (I'll send you a screen cap).

Don't get me wrong, I think log2timeline is a great tool but I think its missing a trick by not being able to see the hit in context.

Regards,
James

Anonymous said...

I was watching "Through the Wormhole" a while ago and thought it was interesting how spacetime is related to timeline analysis. In the past, analysts view of space has been limited to the filesystem. Now we've discovered another dimension of space and are branching out to include other data sources.