Tuesday, September 20, 2005

Issues with timeline analysis

I've been doing some searches regarding timeline analysis, delving deeper into this. I still don't think that scatter plots and histograms are the way to report on and present this sort of information...there is just too much information that can and needs to be presented, and too many possible ways that it can be viewed.

In my Googling, I'm finding a good deal of references to "timeline analysis", as well as to the term "reconstruction". A lot of this is being offered as a service. There are also products available that can assist you in your timeline development and analysis, but many seem to be limited strictly to file MAC times. Given the various sources for timeline analysis that are available on a Windows system, relying simply on the file MAC times is not doing anyone any good.

So, I think that we're doing pretty well when considering sources of information, and now the question is, how do we present this information so that it's understandable? Is there an "access view" that looks at last access times of files, where a "modified view" would look at last modification times of files, as well as Registry key LastWrite times? What about things like times maintained in the document properties of OLE docs (ie, Word docs, Excel spreadsheets, etc.)? At what point is there too much information? How does one winnow out the valid, normal, usual stuff to get to the interesting stuff?

There's definitely some room for thought and development here. At this point, my thoughts are that there there'd be some sort of database involved, along with perhaps a Java interface for issuing queries and displaying the information. I haven't written Java GUIs in while, but perhaps a zoomable, scalable depiction of a timeline would be a good start...the investigator could pick a time of interest, and the initial queries would display a view of the timeline of activity, plus and minus a certain amount of time (determined by the investigator). Perhaps an interface into the database that shows which information is available, and lets the investigator select items to add and subtract from the view would be helpful. Add to that color coding of event records from the Event Log, and some other graphical representations of severity of events, and we may have something useful.

I mentioned Java above so that the whole thing would be cross-platform, but I can easily see where HTML or XML would work as well. With a mySql database, and the necessary filters to parse out any sort of information that is available to the investigator and get it into the database, I think we may have a pretty serious tool on our hands.

Thoughts? I know for my own part, I still have a lot of thinking to do, regarding such things as anomoly detection and anti-forensics. However, I think that this may be best handled by discussion within the community.

Addendum 21 Sept: As I think about this more and more, and even go so far as to draw out diagrams on a sheet of paper, trying to conceptualize what a "timeline" should look like, I'm even more convinced that a scatter plot isn't the way to go. Why? Well, a horizontal line (representing a time scale) with a bunch of little dots is meaningless...it has no context. Even if you gave different sources separate icons, and even color-coded them, without more information, it can be useless. Let's say you have a cluster of events around a particular time...Registry key LastWrite times, file last access times, and event records. Well, this could be a boot event. But you don't know that until you dig into things and take a look at the event records from the Event Logs.

Somehow I think that a scatter plot in which each of the dots has some identifying information would be just too much...the graph would be far too busy.

Something that may be of value in this effort is something like fe3d. Yes, I know that it's intended to provide visualization for nmap scans, but I think that something like this, with modifications, would be of value. Take a look at some of the screenshots and try to imagine how to map timeline information to this sort of representation. Of course, there are other freeware visualization tools out there...something else may be easier to use.

I will say this...one of the things that caught my eye with fe3d is the different nodes. Given information dumped into a database, one could include sources from other systems, such as Event Log records from a file server, or data from an IDS, or even firewall logs or syslog data...and have that timeline information represented on a different node, but correlated at the same time scale as the system(s) you're investigating.

3 comments:

Anonymous said...

One thing I've always wanted, and have yet to find, in any timeline analysis is an easy way to do statistical analysis. Things like deviance from a moving average or other indicators that might hint to "unusual" activity. For things like incident response I imagine having a tool that could quickly scan a dataset for the outliers or large deviances could increase an examiners ability to quickly isolate those significant events. However, the amount of time necessary to analyze the results needs to be a lot less than the amount of time to analyze the data manually.

If you were to aggregate every timestamp in a computers forensically available life to a single dataset, how strongly would the correlation be with a regression line? Would things like malicious incidents show a strong deviance from the line? Or if the dataset can't be fitted, is there a certain class of events that cause the correlation to fail?

I'd almost like to see something not done in HTML. Something like that would mean you'd need some sort of web-server to process the requests and interface with the scripting language. I like the idea of Java better. Or even C/C++. Contrary to popular belief they can be very platform independent languages; it just takes quite a bit of extra work and experience to know what is and isn't a portable method/library (look at all Mozilla products as good examples). This is why I think Java might be a better language, at least to prototype in.

Requiring a web server be installed on the examiners workstation just means one more application, as well as one point of failure should the web server bind to an externally facing interface and be inadequately firewalled.

H. Carvey said...

Ryan,

Things like deviance from a moving average or other indicators that might hint to "unusual" activity.

"Moving average" of what?

I'm interested in your idea, but I'm having some trouble conceptualizing it.

If you were to aggregate every timestamp in a computers forensically available life to a single dataset, how strongly would the correlation be with a regression line?

Let's say you image a machine, and you know that the corporate user has access to a file server, and uses Word and Excel. The last access time on the applications is just that...the last time the application was accessed. You'll get MAC times from the various Word docs and Excel spreadsheets, but I'm not seeing where you'd be able to develop an deviance or a moving average...b/c you don't have any historical data from which to create an average.

Can you elaborate? Perhaps provide a concrete example? I think you may be on to something here...

Anonymous said...

Sorry, it's always more clear in my head. :)

Let's make it simple and only take the last accessed times on a system. For simplicity sake convert them to seconds, or 10ths of a second if you want to use a 64-bit value. Sort the list from least to greatest and then convert them to a 2-tuple with their index number being the major axis:

tuple-N = (N, timeN)

Now, if the system was constantly being used you'd expect that new files would be touched and eventually retouched after a given period. However, during the period it might be logical to expect that the dataset of (x,y) coordinates might fit to an x=y linear regression line. That is to say, as you progress through time files are getting accessed evenly over the period. Every 5 minutes you see another 10 files being accessed. Averaged out this could fit to some linear regression line.

Now, let's say the computer isn't being constantly used, since people obviously (in a work environment) don't do much at night other than a virus scan. Instead of fitting to a linear x=y line it might behave like a different periodic function. Instead of getting into things like quadratic or quartic regressions I thought another way to model the "normal" data would be by a moving average. Let's say during the day you accessed lots of files. This would cause your moving average to go up. Then at night when your access patterns were really low your moving average would likely go down.

Now, the way this all might be able to help, let's say one night someone's nightly moving average skyrockets to three times their normal nightly moving average. Obviously something unusual is going on with the computer. This could be counted as a "statistical anomaly" and need further investigation. It could be that that was the night that the IT team was updating people’s machines, or it could be that the computer was rooted.

With things like files you might have a shorter moving average of only a day, but with other sources of historical data you might be able to use a longer moving average of a month.

As you point out, you're right, you won't get a lot of historical data from last accessed times of word documents or excel spreadsheets. However, you can get lots of historical data from temporary files. Say, auto-saves from word, or temporary internet files, or getting away from files, event log dates. When you pool all the dates and times available on the system, I'm curious what form of patterns can be established. We are creatures of habit and I think those habits will extend into the forensic information and with the right tool we can make the deviations from that pattern stand out clearly as black vs. white.

I hope that clarified things a little bit, otherwise let me know and I’ll see if I can’t explain it out a little more.