Monday, July 22, 2013

HowTo: Add Intelligence to Analysis Processes


How many times do we launch a tool to parse some data, and then sit there looking at the output, wondering how someone would see something "suspicious" or "malicious" in the output?  How many times do we look at lines of data, wondering how someone else could easily look at the same data and say, "there it is...there's the malware"?  I've done IR engagements where I could look at the output of a couple of tools and identify the "bad" stuff, after someone else had spent several days trying to find out what was going wrong with their systems.  How do we go about doing this?

The best and most effective way I've found to get to this point is to take what I learned on one engagement and roll it into the next.  If I find something unusual...a file path of interest, something particular within the binary contents of a file, etc...I'll attempt to incorporate that information into my overall analysis process and use it during future engagements.  Anything that's interesting, as a result of either direct or ancillary analysis will be incorporated into my analysis process.  Over time, I've found that some things keep coming back, while other artifacts are only seen every now and then.  Those artifacts that are less frequent are no less important, not simply because of the specific artifacts themselves, but also for the trends that they illustrate over time.

Before too long, the analysis process includes, "access this data, run this tool, and look for these things..."; we can then make this process easier on ourselves by taking the "look for these things" section of the process and automating it.  After all, we're human, get tired from looking at a lot of data, and we can make mistakes, particularly when there is a LOT of data.  By automating what we look for (or, what we've have found before), we can speed up those searches and reduce the potential for mistakes.

Okay, I know what you're going to say..."I already do keyword searches, so I'm good".  Great, that's fine...but what I'm talking about goes beyond keyword searches.  Sure, I'll open up a lot of lines of output (RegRipper output, web server logs) in UltraEdit or Notepad++, and search for specific items, based on information I have about the particular analysis that I'm working on (what are my goals, etc.).  However, more often than not, I tend to take that keyword search one step further...the keyword itself will indicate items of interest, but will be loose enough that I'm going have a number of false positives.  Once I locate a hit, I'll look for other items in the same line that are of interest.

For example, let's take a look at Corey Harrell's recent post regarding locating an injected iframe.  This is an excellent, very detailed post where Corey walks through his analysis process, and at one point, locates two 'suspicious' process names in the output of a volatile data collection script.  The names of the processes themselves are likely random, and therefore difficult to include in a keyword list when conducting a search.  However, what we can take away from just that section of the blog post is that executable files located in the root of the ProgramData folder would be suspicious, and potentially malicious.  Therefore, a script that that parses the file path and looks for that condition would be extremely useful, and written in Perl, might look something like this:

my @path = split(/\\/,$filepath);
my $len = scalar(@path);
if (lc($path[$len - 2]) eq "programdata" && lc($path[$len - 1]) =~ m/\.exe$/) {
  print "Suspicious path found: ".$filepath."\n";
}

Similar paths of interest might include "AppData\Local\Temp"; we see this one and the previous one in one of the images that Corey posted of his timeline later in the blog post, specifically associated with the AppCompatCache data output.

Java *.idx files
A while back, I posted about parsing Java deployment cache index (*.idx) files, and incorporating the information into a timeline.  One of the items I'd seen during analysis that might indicate something suspicious is the last modified time embedded in the server response be relatively close (in time) to when the file was actually sent to the client (indicated by the "date:" field).  As such, I added a rule to my own code, and had the script generate an alert if the "last modified" field was within 5 days of the "date" field; this value was purely arbitrary, but it would've thrown an alert when parsing the files that Corey ran across and discussed in his blog.

Adding intel is generally difficult to do with third-party, closed source tools that we download from someone else's web site, particularly GUI tools.  In such cases, we have to access the data in question, export that data out to a different format, and then run our analysis process against that data.  This is why I recommend that DFIR analysts develop some modicum of programming skill...you can either modify someone else's open source code, or write your own parsing tool to meet your own specific needs.  I tend to do this...many of the tools I've written and use, including those for creating timelines, will incorporate some modicum of alerting functionality.  For example, RegRipper version 2.8 incorporates alerting functionality directly into the plugins. This alerting functionality can greatly enhance our analysis processes when it comes to detecting persistence mechanisms, as well as illustrating suspicious artifacts as a result of program execution.

Writing Tools
I tend to write my own tools for two basic reasons:

First, doing so allows me to develop a better understanding of the data being parsed or analyzed.  Prior to writing the first version of RegRipper, I had written a Registry hive file parser; as such, I had a very deep understanding of the data being parsed.  That way, I'm better able to troubleshoot an issue with any similar tool, rather than simply saying, "it doesn't work", and not being able to describe what that means.  Around the time that Mandiant released their shim cache parsing script, I found that the Perl module used by RegRipper was not able to parse value "big data"; rather than contacting the author and saying simply, "it doesn't work", I was able to determine what about the code wasn't working, and provide a fix.  A side effect of having this level of insight into data structures is that you're able to recognize which tools work correctly, and select the proper tool for the job.

Second, I'm able to update and make changes to the scripts I write in pretty short order, and don't have to rely on someone else's schedule to allow me to get the data that I'm interested in or need.  I've been able to create or update RegRipper plugins in around 10 - 15 minutes, and when needed, create new tools in an hour or so.

We don't always have to get our intelligence just from our own analysis. For example, this morning on Twitter, I saw a tweet from +Chris Obscuresec indicating that he'd found another DLL search order issue, this one on Windows 8 (application looked for cryptbase.dll in the ehome folder before looking in system32); as soon as I saw that, I thought, "note to self: add checking for this specific issue to my Win8 analysis process, and incorporate it into my overall DLL search order analysis process".

The key here is that no one of us knows everything, but together, we're smarter than any one of us.

I know that what we've discussed so far in this post sounds a lot like the purpose behind the OpenIOC framework.  I agree that there needs to be a common framework or "language" for representing and sharing this sort of information, but it would appear that some of the available frameworks may be too stringent, not offer enough flexibility, or are simply internal to some organizations.  Or, the issue may be as Chris Pogue mentioned during the 2012 SANS DFIR Summit..."no one is going to share their secret sauce."  I still believe that this is the case, but I also believe that there are some fantastic opportunities being missed because so much is being incorporated under the umbrella of "secret sauce"; sometimes, simply sharing that you're seeing something similar to what others are seeing can be a very powerful data point.

Regardless of the reason, we need to overcome our own (possibly self-imposed) roadblocks for sharing those things that we learn, as sharing information between analysts has considerable value.  Consider this post...who had heard of the issue with imm32.dll prior to reading that post?  We all become smarter through sharing information and intelligence.  This way, we're able to incorporate not just our own intelligence into our analysis processes, but we're also able to extend our capabilities by adding intelligence derived and shared by others.

9 comments:

Anonymous said...

An open minded approach to tools is essential to an impartial examiner. I would also add peer review is critical be it analysis, coding or report writing.

Joachim Metz said...

Re: Writing Tools.

A little bit more nuance is in order here. Roughly these tools could be categorized as: 1. scripts that automate basic analysis tasks/slice-and-dice data (even one-offs), 2. components that can be reused across investigations (e.g. libraries), 3. frameworks. You're referring to the first one (1), and yes I agree learn how to write and build/run those. Besides understanding the formats you are parsing it will also provide insight in what your other tooling might be doing different/wrong.

The other two (2 and 3) are more complex and take far more that "one hour" to build, test and deal with edge-cases and corruption scenarios. IMO it is better to that in a shared effort and report to the author you've found a possible edge case that does not work.

The last part is documentation. Some file/data formats are horrendous regarding edge cases, again sharing your findings regarding a file format are valuable to others as well. If you have not documented the format you're actually parsing how can you and others determine what you're actually parsing. Especially if constantly new findings are done regarding a certain file/data format e.g. Shell Items. Thus if you think the source is the documentation make sure to make it readable.

Harlan Carvey said...

@Joachim,

IMO it is better to that in a shared effort and report to the author you've found a possible edge case that does not work.

I agree, although I honestly don't think that most DFIR analysts do that. After all, what would they do? Simply send an email or tweet, saying, "..it doesn't work..."? Anything else would require either troubleshooting, or some sort of sample data...

I also agree with you regarding the documentation.

Joachim Metz said...

Re: I honestly don't think that most DFIR analysts do that.

I don't think this is limited to DFIR analysts ;)

Re: tweeting "..it doesn't work..."?

For me the situation is a bit more nuanced. In my experience you have people that are constructive about "it doesn't work" and those that just need to vent their frustration (none-structive). My note to the latter if you're tweeting/mailing/posting about an issue you have with a project that I produced it's unlikely that I'll be reading it and thus it will not get fixed/changed.

Just to make clear, that it isn't a bad idea to ask on those media if someone else might have dealt with the same problem as you (which I see more as a constructive approach). Again it is unlikely that I'll be reading this. So feel free to blog, write an wiki article about it so you can help your fellow analysts in not having to solve the same problems as you had to.

So if the problem persists my advice is to use the "official" channels of the project. Not that all of the authors/maintainers will answer. Some are very busy people or have abandoned the project and you'll luck out. Also don't expect every author to be receptive to these requests as well. Different people have different motivations to open up their tools.

Now if you get a hold of one of the receptive ones, the more you are prepared the better. Building/running/deploying software can be tricky but if you don't address it in the right way, to the right people it does not get changed. Filing a bug request because something is not working is often a good idea. Please do your homework and be prepared to provide detailed information; test files, debug logs, source code, etc. If both parties are willing to put effort into fixing the issue, it will provide for the tooling you would like to have in the first place. And sometimes you catch the person at the wrong moment and you just have to be persistent.

Now there is also this group of people that it is very difficult to communicate with. Often due to a language/cultural barrier but e.g. sometimes this also happens highly technical person talking to a less technical person (which could be considered as speaking a different language ;). These will often start with it does not work, but that will often resolve itself if both sides are patient enough; which can be difficult when time is pressing.

Now there is this small group that will remain none-structive and unclear whatever you try. For those don't bother with reporting an issue. Find a place where you can vent your frustration, preferable as far away as possible from me and keep doing so to at least keep the people around you safe ;)

Harlan Carvey said...

@Joachim,

Going back to a statement in your first comment...

If you have not documented the format you're actually parsing how can you and others determine what you're actually parsing.

Honestly, I don't think that the vast majority of DFIR analysts really care. Many times, you can see comments where someone will state that they want to newly available tool so that they can run it to validate the output of another tool.

IMHO, this is the wrong approach to take, but it seems to be what most analysts, at least those who make comments on the Internet, tend to do.

I agree documentation of file formats can be useful and valuable, but I also believe that is the case for a relatively small number of individuals.

Joachim Metz said...

> Honestly, I don't think that the vast majority of DFIR analysts really care.
> Many times, you can see comments where someone will state that they want to
> newly available tool so that they can run it to validate the output of another tool.

Alas this seems to be the current situation. Although I have the impression that this is slowly changing (in contrast to a couple of years ago) e.g. exposing the Java idx format and more attention for it in various of the recent articles posts; at least I hope so because formats and systems keep becoming more-and-more complex and thus the gap between those that do and don't will only increase. If you're a believer in the survival of the fittest theory, this will solve itself ;)

An interesting question would be what do these "DFIR analysts" do when both tools provide them with significantly different results. How are they able to judge the results on correctness and determine the output is fact or a fabrication. I always was under the impression that is what the F in DFIR stands for.

Personally I think having a basic understanding of the majority of format you're analyzing is minimum requirement.
Understanding the format will allow you to utilize important insights in the analysis approach you take.
It allows you to understand how to (likely) interpret the date and time value you're looking at or maybe that you need to figure out why that file creation time of a Word Document in the Outlook Secure Temp folder is not what you expect it to be. Utilizing the timestamp in the Word Binary format to determine the timestamps of the OLE Compound File were tampered with. Or what about the timestamps in the MAPI conversation index.

Understanding format has helped me many times in the past not to waste time on processing and analysis techniques that were not relevant to the case, hence I would say adding "Common Sense to Analysis Processes".

The attitude is part personality, but also has to do with the type of cases you're working on and the environment in which you do so. If you're never triggered to need to automate your analysis process or to understand a file format, or keep up with recent findings, then why bother? It is much more time and energy effective not to do so ;)

But luckily there are post like yours and that of others, people that are pushing more format documentation out there, open source tooling, etc. For those that do I would say keep up the good work and focus on the people that do seem to care, not on those that don't.

Old Chinese proverb says: "You cannot teach a person that doesn't want to learn."

Harlan Carvey said...

@Joachim,

Although I have the impression that this is slowly changing...

I can see what you're referring to (Java *.idx file format), but I'm not sure that a small handful of people pursuing the parsing really constitutes a "change", simply because once the tools were posted, that was it. No more discussion, and most folks were back to doing nothing more than downloading the tools.

I always was under the impression that is what the F in DFIR stands for.

You and me both, brother, but I don't really think that's the case.

I see what you're saying about the tools...for example, right now there are several tools available that parse shellbag artifacts and completely miss one type of shell item. Yet, many analysts simply don't care...as long as they see what they need, I've been told, that's all that matters.

Personally I think having a basic understanding of the majority of format you're analyzing is minimum requirement.

I agree with you, in part, but within the current DFIR framework for most folks, I don't think it's practical. I do think that analysts should have the ability to do minimal troubleshooting, however...I still get "it doesn't work" emails with respect to some of the tools I've written, and these emails are not accompanied with even the most basic troubleshooting (ie, open the Registry hive file in a hex editor and see if it even contains any information...).

Harlan Carvey said...

@Joachim,

If you're a believer in the survival of the fittest theory, this will solve itself ;)

I don't think it will...because what you or I might understand as "fittest" isn't what customers are looking for.

When I was in graduate school, in the military, I was required to take courses in acquisition management. The first day of the first course, the Army LtCol teaching the course made it clear that regardless of the regulations and intent of the military/DoD acquisition process, it all came down to one thing...lowest initial cost. Hence, the adage that everything that soldiers, Marines, sailors, and airmen take into combat is made by the lowest bidder.

When I was with the IBM ISS ERS team and we were certified to perform PCI forensic audits, we were on a list with 5 or 6 other vendors...and we'd get calls all the time where a "victim" would be shopping around, looking for who could provide the service at the lowest cost.

As long as DFIR analysis and response is viewed purely as a cost center and has no other value associated with it, those who feel that they need this service will always look to the least expensive vendor, while those who delve into and develop a deep understanding of formats, and write documentation and tools, will be relegated to posting to blogs, ForensicsWiki, and giving presentations. ;-)

Joachim Metz said...

> No more discussion, and most folks were back to doing nothing more than downloading the tools.

Maybe not regarding idx, but then again you can discuss a topic only so far before it becomes no longer effective. Although there have been some interesting findings regarding MSIE and Java white listing since then. Apparently discussing these things is not something the majority finds relevant enough to do so and the people that do are likely busy with other things.

I was largely referring to the fact that a couple of years these discussions (in my impression) only happened at certain conferences, e.g. DFRWS, but now seem to be more common.

> You and me both, brother, but I don't really think that's the case.

Alas, in the general case, I don't think so either ;)

> I see what you're saying about the tools...for example, right now there are several tools available that parse
> shellbag artifacts and completely miss one type of shell item.

And the ones that don't likely don't parse a several of the edge cases, e.g. the junction extension block. Let alone the edge cases that are still unknown.

> I don't think it will...because what you or I might understand as "fittest" isn't what customers are looking for.

By "fittest" I mean the most desirable trait at a certain given moment. If this is low costs at one point, it will be low costs. If this is quality at another point, this will be quality. With all pros and cons it will have.

I totally agree low costs is for most organisations a very important selling point. Which often continue up to the point that it is too late and serious things happen due to the cost concessions. Isn't that why (DF)IR exists largely in the first place?

> will be relegated to posting to blogs, ForensicsWiki, and giving presentations. ;-)

That or they'll come up with a disruptive business model and/or technology ;)