Well, now the topic of tool validation within DFIR has popped up again, so maybe it's time to address it yet again.
So, early on in my involvement in the industry, yes, running two or more tools against a data source was considered by some to be a means of tool validation. However, over time and as experience and knowledge grew, the fallacy of this approach became more and more apparent.
When I first left active duty, one of my first roles in the private sector was performing vulnerability assessments. For the technical aspect (we did interviews and reviewed processes, as well) we used ISS's Internet Scanner, which was pretty popular at the time. When I started, my boss, a retired Army Colonel, told me that it would take me "about 2 to 3 years of running the tool to really understand it." Well, within 6 months, I was already seeing the need to improve upon the tool and started writing an in-house scanner that was a vast improvement over the commercial tool.
The reason for this was that we'd started running into questions about how the tool did it's "thing". In short, it would connect to a system and run a query, process the results, and present a "decision" or finding. We started to see disparities between the tool findings and what was actually on the system, and as we began to investigate, we were unable to get any information regarding how the tool was making its decisions...we couldn't see in to the "black box".
For example, we scanned part of an infrastructure, and the tool reported that 21 systems had AutoAdminLogon set. We also scanned this infrastructure with the tool were developing, and found that one system had AutoAdminLogon set; the other 20 systems had the AutoAdminLogon value, but it was set to "0", and the admin password was not present in the Registry (in plain text). So, technically, AutoAdminLogon was not set on all 21 systems, and the customer knew it. Had we delivered the report based solely on the tool output, we would've had a lot of 'splainin' to do.
When I was working for a consulting company and waiting for a gubmint clearance to come through, I heard a senior analyst tell one of the junior folks that Nessus would determine the version of Windows it was running against by firing of nmap-like scans, and associating the responses it received to the drivers installed in various versions of Windows. I thought this was fascinating and wanted to learn more about it, so I downloaded Nessus and started opening and reading the plugins. It turned out that Nessus determined the version of Windows via a Registry query, which meant that if Nessus was run without admin credentials, it wasn't going to determine the version.
While I was part of the IBM ISS X-Force ERS team, and we were conducting PCI forensic response, Chris and I found that our process for scanning for credit card numbers had a significant flaw. We were using EnCase at the time, which used a built-in function named "IsValidCreditCard()". We had a case where JCB and Discover credit cards had reportedly been used, but we weren't seeing any in the output of our scanning process. So, we obtained test data from the brands, and ran our scan process across just that data, and still got no results. It turned out that even with track 1 and track 2 data, the IsValidCreditCard() function did not determine JCB and Discover cards to be "valid". So, we worked with someone to create a replacement function to override the built-in one; our new function had 7 regular expressions, and even though it was slower than the built-in function, it was far more accurate.
Finally, in the first half of 2020, right as the pandemic kicked off, I was working with a DFIR consulting team that used a series of open source tools to collect, parse, and present DFIR data to analysts. MS had published a blog post on human-operated ransomware, and identified a series of attacks where WMI was used for persistence; however, when the team encountered such attacks, they were unable to determine definitively if WMI was used for persistence, as the necessary data source had been collected, but the open source middleware that parsed the data did not include a module for parsing that particular data source. As such, relying on the output of the tool left gaps that weren't recognized.
So, what?
So, what's the point of reliving history? The point is that tool validation is very often not about running two (or more) different tools and seeing which presents "correct" results. After all, different tools may use the same API or the same process under the hood, so what's the difference? Running two different tools that do the same thing isn't running two different tools, and it's not tool validation. This is particularly true if you do not control the original data source, which is how most DFIR analysis works. 
As alluded to in the examples above, tool validation needs to start with the data source, not the tool. What data source are you working with? A database? Great, which "style"? mySQL? SQLite? EDB? How does the tool you're using work? Is it closed source, or open source? If it's open source, and written in Python or Perl or some other interpreted language, what are the versions of the libraries on which the tool is based? I've seen cases where a script will not throw an error, but did present different data based on the library used.
This is also why I strongly recommend that analysts validate their findings. Very often there are direct or indirect artifacts in other data sources that are part of an artifact constellation that serves to validate a finding; if you run your tool and see something "missing" or added, you can then look to the overall constellation and circumstances to determine why that might be.

Andreas,
ReplyDeleteThanks. I'll take a look shortly...