Monday, May 30, 2016

What's the value of data, and who decides?

A question that I've been wrestling with lately is, for a DFIR analyst, what is the value of data?  Perhaps more importantly, who decides the value of available data?

Is it the client, when they state what their goals, what they're looking for from the examination?  Or is it the analyst who interprets both the goals and the data, applying the latter to the former?

Okay, let me take a step back...this isn't the only time I've wrestled with this question. In fact, if you look here, you'll see that this is a question that has popped up in this blog before.  There have been instances over the past almost two decades of doing infosec work that I, and others, have tussled with the question, in one form or another.  And I do think that this is an important question to turn to and discuss time and again, not specifically to seek an answer from one person, but for all of us to share our thoughts and hear what others have to say and offer to the discussion.

Now, back to the question...who determines the relative value of data during an examination?  Let's take a simple example; a client has an image of an employee's laptop (running Windows 7 SP1), and they have a question that they would like answered.  That question could be, "Is/was the system infected with malware?", or "...did the employee perform actions in violation of acceptable use policies?", or " there evidence that data (PII, PHI, PFI, etc.) had been exfiltrated from the system?"  The analyst receives the image, and runs through their normal in-processing procedures, and at that point, they have a potential wealth of information available to them; Prefetch files, Registry data, autoruns entries, the contents of various files (hosts, OBJECTS.DATA, qmgr0.dat, etc.), Windows Event Log records, the hibernation file, etc.

Just to be clear...I'm not suggesting an answer to the question.  Rather, I'm putting the question out there for discussion, because I firmly believe that it's important for us, as a profession, to return to this question on a regular basis. Whether we're analyzing individual images, or performing enterprise incident response, I tend to think that sometimes we can get caught up in the work itself, and every now and then it's a good idea to take a moment and do a level set.

Data Interpretation
An issue that I see analysts struggling with is the interpretation of the data that they have available.  A specific example is what is referred to as the Shim Cache data.  Here are a couple of resources that describe what this data is, as well as the value of this data:

Mandiant whitepaper, 2012
Mandiant Presentation, 2013
FireEye blog post, 2015

The issue I've seen analysts at all levels (new examiners, experienced analysts, professional DFIR instructors) struggling with is in the interpretation of this data; specifically, updates to clients (as well as reports of analysis provided to a court of law) will very often refer to the time stamp associated with the data as indicating the date of execution of the resource.  I've seen reports and even heard analysts state that the time stamp associated with a particular entry indicates when that file was executed, even though there is considerable documentation readily available, online and through training courses, that states that this is, in fact, NOT the case.

Data interpretation is not simply an issue with this one artifact.  Very often, we'll look at an artifact or indicator in isolation, outside and separate from its context with respect to other data "near" it in some manner.  Doing so can be extremely detrimental, leading an analyst down the wrong road, down a rabbit hole and away from the real issue at hand.

The question then becomes, if we, as a community and a profession, do not have a solid grasp of the value and correct interpretation of the data that we do have available to us now, is it necessarily a good idea to continue adding even more data for which we may not have even a passing understanding?

Lately, there has been considerable discussion of shell items on Windows systems.  Eric's discussed the topic on his BinForay blog, and David Cowen recently conducted a SANS webcast on the topic.  Now, shell items are not a new topic at all...they've been discussed previously within the community, including within this blog (here, and here).  Needless to say, it's been known within the DFIR community for some time that shell items are the building blocks (in part, or in whole) for a number of Windows artifacts, including (but not limited to) Windows shortcut/*.lnk files, Jump Lists, as well as a number of Registry values.

Now, I'm not suggesting that we stop discussing shell items; in fact, I'm suggesting the opposite, that perhaps we don't discuss this stuff nearly enough, as a community or profession.

Circling back to the original premise for this post, how valuable is ALL the data available from shell items?  Yes, we know that when looking at a user's shellbag artifacts, we can potentially see a considerable number of time stamps associated with a particular MRU time, several DOSDATE time stamps, and maybe even an NTFS MFT sequence number.  All, or most, of this can be available along with a string that provides the path to an accessed resource.  Further, in many cases, this same information can be derived from other data sources that are comprised of shell items, such as Windows shortcut files (and by association, Jump Lists), not to mention a wide range of Registry values.

Many analysts have said that they want to see ALL of the available data, and make a decision as to its relative value.  But at what point is ALL the data TOO MUCH data for an analyst?  There has to be some point where the currently available data is not being interpreted correctly, and adding even more misunderstood/misinterpreted data is detrimental the analyst, to the case, and most importantly, to the client.

Let's look at another example; a client comes to you with a Windows server system, says that the system appears to have been infected with ransomware a week prior, and wants to know the source of the infection; how did the ransomware get on the system in the first place?  At this point, you have what the client's looking for, and you also have a time frame on which to focus your examination. During your analysis, you determine the initial infection vector (IIV) for the ransomware, which appeared to have been placed on the system by someone who'd subverted the system's remote access capability.  However, during your examination, you also notice that 9 months prior to the ransomware infection, another bit of malware seemed to have infected the system, possibly due to a user's errant web surfing.  And you also see that about 5 months prior to that, there were possible indications of yet another malware infection of some kind.  However, having occurred over a year ago, the IIV and any impact of the infection is indeterminate.

The question is now, do you provide all of this to the client?  If the client asked a specific question, do you potentially bury that answer in all of your findings?  Perhaps more importantly, when you do share all of your findings with them, do you then bill them for the time it took to get to that point?  What if the client comes back and says, "...we asked you to answer question A, which you did; however, you also answered several other questions that we didn't ask, and we don't feel that we should pay for the time it took to do that analysis, because we didn't ask for it."

If a client asks you a specific question, to determine the access vector of a ransomware infection, do you then proceed to locate and report all of the potential malware infections (generic Trojans, BHOs, ect.) you could find, as well as a list of vulnerable, out-of-date software packages?

Again, I'm not suggesting that any of what I've described is right or wrong; rather, I'm offering this up for discussion.


43nsicbot said...

Hi Harlan
In my opinion i feel it is a little bit of both, the client\business and analyst's responsibility to put value to the data. i think that when the client or business (if working in an enterprise) provide a set of questions that they feel should be answered it is because they feel it is important to them. Now to us as analyst perhaps we may encounter questions that seem a little far fetch perhaps due to lack of technical understanding and I think as analysts we should help them in providing guidance and scope with their questions.

In regards to interpreting the data, i feel an analyst should be responsible in understanding the data as best as possible. But this is more of a personal feeling, not everyone will seek further understanding of the data/artifacts but i don't think that should stop further release of research or information on new data/artifacts. If anything i think that is an indicator of analysts that maybe go after "shiny new things".

As far as reporting, for an enterprise besides the questions set by the business i feel an analyst should also set their own questions and goals and report on additional findings if time permits. Consulting is different and i cannot fully offer an opinion given the lack of experience in this realm. But perhaps providing a good statement of work or a conversation with the client stating what should happen if an analyst finds hints of additional suspicious activity besides what they are being asked to investigate would suffice? and additional charges? maybe? hopefully some one else can interject their opinion on this.

H. Carvey said...


Thanks for your comments...very illuminating.

... i feel an analyst should be responsible in understanding the data as best as possible. But this is more of a personal feeling, not everyone will seek further understanding of the data/artifacts...

I'm afraid you're right.

...i don't think that should stop further release of research or information on new data/artifacts.

I agree that we shouldn't stop releasing new research...not that there's a large group of analysts doing this, really...but I am concerned about the quality of work as new tools (released as part of that research) are being used by analysts, and the (misunderstood and misinterpreted) data is being reported on.