Wednesday, June 22, 2011

Defining "Forensic Value"

Who defines "forensic value" when it comes to data?  How does a string of 1s and 0s become "valuable", "evidence" or "intelligence"?

These are questions I've been asking myself lately.  I've recently seen purveyors of forensic analysis applications indicate that a particular capability has recently been added (or is in the process of being added) to their application/framework, without an understanding of the value of the data that is being presented, or how it would be useful to a practitioner. Sure, it's great that you've added that functionality, or that you will be doing so at some point in the very near future, but what is the value of the data that the capability provides, and how can it be used?  Do your users recognize the value of the data that you're providing?  If not, do you have a way of educating your users?

I was also thinking about these questions during my presentation at OSDFC...I was talking about extending RegRipper into more of a forensic scanner, and found myself looking out across a sea of blank stares.  In fact, at one point I asked the audience if what I was referring to made sense, and the only person to react was Cory.  ;-)  As a practitioner, I believe that there is significant value in preserving and sharing the collective knowledge and experience of a group of practitioners.  I believe that being able to quickly determine the existence (or absence) of a number of artifacts and removing that "low hanging fruit" (i.e., things we've seen before) is and will be extremely valuable.  Based on the reaction of the attendees, it appears that Cory and I may be the only ones who see the value of something like this.  Does something like this have value?

Also, at the conference, there were a number of academics and researchers in attendance (and speaking), along with a number of practitioners.  Speaking to some of the practitioners between sessions and after the conference, there was a common desire to have more practical information available, and possibly even separate tracks for practitioners and developers/academics.  There seemed to be a common feeling that while developing applications to parse data and run on multiple cores was definitely a good thing, this only solved a limited number issues and did not address issues that were on the plates of most practitioners right now.  It would be safe to say that many of the practitioners (those that I spoke with) didn't see the value in some of the presentations.

One example of this is bulk_extractor (previous version described here), which Simson L. Garfinkel discussed during the conference.  This is a tool (Windows EXE/DLLs available) that can be run against an image file, and it will extract a number of items by default, including credit card numbers and CCN track 2 data, and includes the offset to where within the image file the data was found.  Something like this may seem valuable to those performing PCI forensic exams, but one of the items required for such exams is the name of the file in which the credit card number/track data were located.  As such, where a tool like bulk_extractor might have the most value during a PCI forensic exam is if it were run against the pagefile and unallocated space extracted from the image.  Even so, using three checks (LUN formula, length, and BIN) only gives you the possibility that you've found a CCN...we found that there are a lot of MS DLLs with embedded GUIDs that appear to be Visa CCNs, including passing the three checks.  In this case, there is some value in what Simson discussed, although perhaps not at its face value.

As a side note, another thing you might want to do before running the tool is to contact Simson and determine which CCNs the tool searches for, to ensure that all of the CCNs covered by PCI are addressed.  When I was doing this work, we had an issue with a commercial tool that wasn't covering all the bases, so to speak...so we rolled our own solution.

Recently, I began looking at Windows 7 Jump Lists, and quickly found some very good information about the structure of both the automatic and custom "destinations" files.  One thing I could not find, however, was information regarding the structure of the DestList stream located in the automatic destinations file; to me, this seemed to be of particular value, as the numbered streams followed the MS-SHLLNK file format and contained MAC time stamps for the target file, but not for led to the creation of the stream in the first place.  Looking at the contents of the DestList stream in a hex editor, and noticing a number of familiar data structures (FILETIME, etc.), it occurred to me that the DestList stream might act like a most recently used (MRU) or most frequently used (MFU) list.  More research is needed, but at this point, I think I may have figured out some of the basic elements of DestList structure; so far, my parsing code is consistent across multiple DestList streams, including from multiple systems.  As a practitioner, I can see the value in parsing jump list numbered streams, and I believe that there may be more value in the contents of the DestList stream, which is why I pursued examining this structure.  But again...who determines the value of something like this?  The question is, then...is there any value to this information, or is it just an academic exercise?  Simply because I look at some data and as a practitioner believe that it is valuable, is this then a universal assignment, or is that solely my own provenance?

Who decides the forensic value of data?  Clearly, during an examination the analyst would determine the relative value of data, perhaps based on the goals of the analysis.  But when not involved in an examination, who decides the potential or relative value of data?

14 comments:

Corey Harrell said...

> Who defines "forensic value" when it comes to data?

Great post and very thought provoking. Tools can add different functionality or people can present of different tools/artifacts but I think practitioners finally realize the forensic value of data once it has an impact on their case or they can see how it applies their work. Take timelines for example. There were a lot of people saying how valuable the timeline analysis technique was but I didn’t fully realize the power of timelines until I started using them to answer questions. Now I found myself saying how valuable timelines are and some examiners don’t see the value of them.

There are different learning styles and maybe for seeing the forensic value of data falls more in the category of learning by doing instead of listening and seeing.

Little Mac said...

Very interesting and thoughtful, Harlan. I tend to think that value is in the eye of the beholder (or hands of the practitioner, as it were). It's very subjective. If you are working on something and think it has value, then it does. Just as with Corey's example of timelines, not everyone may agree; over time, overall opinion may change regarding community-wide value.

At the same time I can see where it could be frustrating to spend time and energy researching, testing, and presenting information - only to be met by blank stares. Not saying you were, but I know that sometimes positive feedback is needed to fuel continued work in an area.

If someone's having to sell the value of what they're doing in order to justify their existence, I still think value is subjective. The difficulty then becomes the necessity to have particular people drink the koolaid. ;)

Clintonian said...

Great post Harlan. First time caller, long time listener. Sorry in advance if I digress a little...

As a member of LE in Canada doing tech crime work, I know Value to the Case is something that we reinforce with junior examiners. The baseline for us starts with looking at the charges, and what the elements of the offence are, thus what type of records of activity need to be uncovered. The assumption being that the investigator has seized the computer with grounds making that evidence potentially relevant - but we'll make that leap.

Working in that direction, an understanding of the case and elements to be proven is very important. That combined with a thorough knowledge of the technology and where relevant artifacts lie, will allow the examiner to more directly identify those artifacts with a higher forensic value. I believe "Forensic Value" can best be determined when you have a strong understanding of the high level elements you are trying to prove.

On your forensic scanner idea: Something like this is needed badly for a couple of reasons.

1) It would be nice if there was a way (we are exploring this in-house) that initial artifacts could be pulled off the digital evidence in a forensically sound manner even at the scene (field triage) to hand off to the investigator for interview purposes immediately. A tool that can accomplish this needs to be a closed loop tool (so they don't hang themselves) usable by field investigators with minimal but requisite training. I would classify Helix and all the many other boot discs as open-ended tools.

2) Being able to peel off evidence via initial triage prior to the main examination will also serve to move the forensic examiner's starting point that much farther ahead. So much of what we do as forensic examiners in the initial steps of a forensic exam could be automated.

Bottom line, if low hanging fruit are there to be had, why not create a process to auto-gather that based on the overall goals of the case? I think that is what your intent is with the scanner. (I stand to be corrected) I liken this to discussions people must have had way back about whether it's "worth it" to put the effort into adding power steering to a car.

Anyway, just my 2c.

@Clintonian on Twitter...

troy said...

Yes, the DestList is the thing that arranges the listed items in either an MRU or MFU format. That is why I told you and others on the win4n6 list to study the DestList. (I will check my slide deck, published through CTIN, and make sure that is clear--if not I will revise them.) While I have worked out the DestList internals, I used internal sources and have no public sources I can point you to. Unfortunately, I therefore cannot legally disclose the DestList portion of my Jump List slides.

I will say that if you force changes in the listing order by clicking on different items in a Jump List, and then compare before and after versions of the DestList, some of the internals should become clear. You could also use process monitor to monitor Jump list activity, particularly file I/O at offset locations in the Jump List. I know this isn't pretty, but it is basically how I start.

In addition to figuring out the DestList, it would be good for us forensics folk to begin aggregating applicationIDs, which identify the application that a particular Jump List pertains to. The applicationIDs are based on name, path and command arguments, so like prefetch hashes, the same application, in the same path, with the same arguments, will have the same applicationID on different systems. If we can develop a good reference of applicationIDs, investigators can quickly select the Jump Lists that are most interesting to their cases.

Re forensic value:

In law school, we were taught to think through legal issues in terms of elements--as in the elements of a contract are . . . In some sense, this form of breaking things down can work for forensics. It requires explicit definition of terms, and then thinking through the elements of that consitute sufficient evidence or facts to support an inference. In some instances, the work is partially done in that the elements can be defined in terms of corporate policy or law: e.g., the crime of possession of CP requires 1) CP, 2) computer user 3) who downloaded the CP, and so on. It may appear to be harder to do this for "intrusion" but think it through. I have, but I am too tired and old to remember right now.

Timelines are one of the best tools for forensics examinations and can be quite useful in determining the forensic value of facts. I was trained to use them back in the bad old days of lawyering, and continued to use them in forensics. Lawyers--aka our clients--understand their value. So, in addition to being a good thinking tool, they are also a good way to give an attorney confidence in your work.

H. Carvey said...

Thanks, everyone, for your comments thus far...this has been somewhat illuminating. I feel another blog post coming on... ;-)

I think that what your responses are illustrating to me is that value is both subjective, and relative. In my experience, I would think that both are predicated, in part, by the knowledge, experience, and training of the analyst.

Timelines seem to be a very good illustrative point, so I'll continue using it. I've been creating timelines from multiple data sources for a number of years now, and found considerable value in doing so as part of an exam, when appropriate. I've spoken to others (and continue to do so) about creating timelines, including describing how they can be used to increase relative confidence in data, as well as provide increased context to the data. I've even demonstrated how I've used timelines to discern an issue when no other technique worked.

However, there are still those who have yet to "drink the koolaid", as it were. Why is that? Is the value not recognized? Or is the bar for entry set too high?

Okay, enough of that...back to the value of data. When it comes to timelines, IMHO, there are two basic camps...the "kitchen sink" camp, and the minimalist camp. I'm in the minimalist camp...my thought on timelines is to build them from specific data sources, applying overlays one layer at a time. I've built valuable timelines from just selected web server logs and file system metadata (SQLi attack). I've also developed "micro-timelines", using only selected data sources (or subsets thereof) in order to answer a particular question. For example, I once parsed all of the available logon events from the Security Event Log and produced a timeline from just those entries.

The "kitchen sink" approach is to put everything available into the timeline, and allow the analyst to sift (no pun intended) that for the value. The reasoning for this approach has been that the value of specific data or data sources may not be known until that data is viewed in the context of other data (i.e., "I don't know what I need, so I want everything so I can decide...").

My personal opinion (and this is just my opinion, I'm not trying to push this on anyone) is that this approach is somewhat cumbersome and requires a considerable amount of time and effort to sift through and determine the actual data of value.

I hope this illustrates the point I'm trying to make, which is that the value of data, at this point, appears to be subjective. Given a CP case, for example, I think (and based on my experience) that once the requirements of the federal statute have been met, my focus would be determining who and when images were placed on the system (and how), when they were viewed, etc. And yet, I continue to meet LE analysts who appear to be caught off guard by the "Trojan Defense"...

Phil Rodokanakis said...

Regarding JumpLists, I haven't come across many Windows 7 systems yet as the corporate world is slow in upgrading. However, I've been experimenting with ProDiscover and noticed that it has the option to analyze JumpLists. Have you talked to Chris Brown about this?

H. Carvey said...

Phil,

I'm a user of ProDiscover, and I have been since version 3.0. I have seen that recent updates to PD include the ability to parse jump lists...and as I'm sure you've seen, they do not parse the DestList stream.

Thanks.

dk said...

Hi Harlan.

The audience at OSDFC 2011 was definitely quiet but I wouldn't misinterpret that as them not seeing the value in what you presented. Your presentation was quite good and flowed with the standard high energy and passion that we expect and appreciate from you!

I think that most people were listening, absorbing, and for me, waiting to head back to my computer to try some of this stuff out or write down some ideas. Some people just don't get it but for myself, I find trying it out, and seeing the practical application of the tool(s), to be of most value. I think your plugin system to extend the Forensic Scanner is great and whether others see the value in it or not, who cares? You have to scratch your own itch. If you see value in it then IMO it doesn't matter if others find it useful as well. If they do, great, if not, no big deal.

In regards to the CCN track2 data that `bulk_extractor' extracts, we worked on this with Simson and it's relatively sound (as of bulk_extractor-1.0.0) in regards to the data it extracts. The track2 format is fairly specific which reduces the number of false positives. ie. credit_card_number==MMYY\d{3}security_data. For anyone doing economic crimes files, skimmers, and other credit card fraud, it has proven to be fairly valuable information. The track2 regex definitely needs more testing as I'm sure there is some fine tuning that can be done there to make it even better!

Also in regards to `bulk_extractor' and how to find the actual filenames that an offset refers to. Simson includes an `identify_filenames.py' Python script in the 'python' directory of the latest `bulk_extractor'. This script can be ran against the `bulk_extractor' output, in combination with a `fiwalk' XML output file, and it will identify which file on the filesystem contains that offset. We are going to write a simple script which will identify the filenames without requiring the `fiwalk' XML output as this is a bit of a barrier right now (IMO) as many people don't have the `fiwalk' output. Once this is done I'll submit it to Simson and hopefully he'll roll it into his tarball.

Keep up the good work!

troy said...

Phil,

I try to send Chris Brown any new discoveries or research that I want to see incorporated into tools. He has always been very good about building capability to handle new Windows artifacts. Check out what he has done with shadow copies, for example.

More on forensic value:

Philosophically and practically, the foundation of forensic value starts with the concepts embedded in our use of "known good" file hashes.

Stepping up, I have two big buckets for thinking about cases--they are usually more or less content focused (e.g., eDiscovery) or more or less activity related (determining who did what or what has happened).

I would posit that forensics value is not so much subjective, but rather context driven. The questions that need to be answered determine the facts that are relevant.

Jump Lists, for example, might not be important in investigating something like stuxnet. They could be very relevant in an IP theft case or a CP case. Moreover, the DestList would be relevant for determining the frequency or recency of file usage.

H. Carvey said...

I would posit that forensics value is not so much subjective, but rather context driven. The questions that need to be answered determine the facts that are relevant.

I would further suggest that the value of data is then subject to the training, knowledge and experience of the examiner, and is again, subjective.

Given two arbitrary analysts, with the same case, data, and goals, you're going to get two different results.

Simson said...

bulk_extractor comes with a script called identify_filenames.py which provides the file name for every discovered feature.

Simson said...

bulk_extractor comes with a script called identify_filenames.py which tells you the files that the features were found in.

H. Carvey said...

another blogger with spelling problems,

Any thoughts on the content?

Thanks.

H. Carvey said...

Another thought on data value being subjective...

Consider an exam involving possible malware (I'm using this as an example as I get a lot of these...)...when performing these exams I have a checklist of things I run through. I've talked to others in the past about NTFS alternate data streams (ADSs)...when I've presented on this topic, I get a lot of blank stares, but on a recent exam, I did find an ADS that, of the 60 ADSs on the system, this one wasn't part of normal user activity. It turned up in the timeline during the same time as artifacts for a possible keylogger installation.

Another thing I check for during these exams, particularly the ones where the alleged malware isn't known, is MBR infectors. I have a script I run against the image file and quickly lets me see enough information to determine the likelihood of an MBR infector being installed. I've talked to other analysts about this, and their candid answers have been, no, they don't check for this.

My point is that the value of data is subjective...not just in the context of the case or exam, but also in the context of the analyst or examiner. I often find that the absence of an artifact where one is expected is in itself an artifact, that NOT finding something where I would expect it is as or more valuable than finding it.