Windows Incident Response: Consistency

I've worked a lot of places over the years, all for varying lengths of time. While this worked against me in the early days, with potential employers wondering why I didn't stay longer at my previous employer, and wondering how long I'd potentially stay with them, this became less of an issue later in my career.

During my career in the private sector, I've run vulnerability assessments, and spent over 26 yrs in digital forensics and incident response, some in FTE roles, and much more in consultant roles.

In 2006, I started in a DFIR consulting role at ISS, which evolved 6 months later when the company purchase by IBM was completed. I then became a "plank owner" of the IBM ISS X-Force ERS team, one of the original members of the team, even before we expanded. When I started, I was provided with a complete outfitting of equipment, including (but not limited to) write-blockers, cabling, laptops, and dongles for EnCase 4.22, EnCase 6.19, and FTK. As our team grew in size, everyone received similar (albeit updated, in some cases) equipment. When I started at ISS, I was one of 4 responders, and we each did our own thing when it came to analysis. There was little in the way of cross-pollination, sharing of experiences, etc. As the new team began to grow, it was a bit before some of us saw the need for consistency across the team.

In 2007, members of our team became certified to conduct PCI forensic investigations, which were subject to very stringent (and somewhat arbitrary) timelines. As part of this, Chris and I developed a process that we shared with all of the team members, using EnCase to conduct all of the searches required by Visa (driving the whole PCI effort at the time), not just the mandatory the credit card number searches. The idea was, in part, to remove the need for individual analysts to have to try to figure out what to do, by giving them a common, documented step-by-step process for completing all of the required activities in a consistent manner. This way, if issues arose, they were easier to troubleshoot. More importantly, having a consistent process meant that there was less room for guesswork, and we had confidence that as long as the process was followed, we'd be able to meet our obligations regarding timeliness. This also left more time for analysts to uncover things like initial access, and other pertinent information, because the guesswork of "what to do next" in order to meet Visa's requirements was no longer something analysts needed to concern themselves with.

In 2013, I started at <company>, on a team that was already well-established. This team was responsible for developing and actively employing the EDR technology used by the company, and this was used to drastically reduce the scoping of "targeted threat actor" incidents, at which point, triage or full forensics of specific systems could take place. For example, one incident involved 15,000 endpoints in a global infrastructure, and we found that the threat actor had "touched" 8 of the systems, and "been on" only 2. Few members of the team, at the time, had actual hands-on experience with truly in-depth DF work, and there was very little in the way of sharing of tools, techniques, and processes between analysts. There was no documentation, little cross-pollination, and some issues with consistency in the use of the analysis framework. Every now and then, someone might share a tidbit here and there, but different analysts had different ways of using the framework. For example, one analyst might tag something they hadn't seen before as "unknown", where another would tag it as definitely "malicious", with the thought of going back and researching those items a bit more...and often, they didn't. They remained marked the way they were, so when those same indicators showed up on another engagement for another analyst, it was pretty much guaranteed that you'd see a mix of "unknown", "malicious", and "benign" from previous engagements.

In 2020, I started at a DF/IR consulting company, and spent my first week on-boarding at headquarters. During that time, I made an effort to start engaging with DF analysts, in part to see what tools they were using, and how they were using them. What I found was that almost all of the analysts had at least 4 dongles/licenses to commercial products, and in some cases, a total of 5 (or more). And every analyst had their own way of doing things; one analyst described what they did as an "art". I sat with one analyst as they walked me through how they used one of tools, selecting various artifacts to be parsed. They'd been doing this for some time now, and for every case, they'd go in and manually select each item, from memory. We spent a few minutes working on it together, and we found that the application had a way for analysts to have a set of items be selected every time, in a profile.

Unfortunately, I returned home at the end of that first week, just as the pandemic lockdown was kicking off, and just shy of three months into my tenure, I (and others) were laid off. It took time, but I found an amazing role as the director of the internal SOC for a large consulting company, one that had three levels of SOC analyst. The highest, L3, were responsible for deeper forensic analysis of incidents that had been escalated up by the prior two levels. During my time there, while there were some tools discussed, what became clear to me was that each of the analysts had their own way of doing things, and there was very little in way of documentation (case notes, etc.) or cross-pollination, and there was limited effort to develop a "best practices" approach to analysis, or just something that was employed consistently across the team. As a result, when a "finding" was added to a ticket or report, there was no clear understanding as to how that finding was developed, and when the "how" was dug into, often the results would be different depending upon the analyst.

I had seen time and again in the tickets that analysts were collecting active memory from reported endpoints, and running strings on the memory dump in an effort to locate indications of the use of IP addresses. I wrote up a message to the team, describing why this was an incorrect approach, and that what was expected is that they'd use Volatility and bulk_extractor to get the necessary information. I also provided a technical description as to why they needed to do this, and why they needed to use both methods.

What's common across both organizations and time is that without some modicum of consistency in processes, what you're left with is an inefficient, error-prone mess that, thankfully, cannot be replicated. Regardless of whether it's DF or SOC work, if everyone has their own way of doing things, and there's no common, consistent processes, and no way to develop and operationalize corporate/institutional knowledge across the entire analyst base, then engagement or ticket completion times will vary, accuracy will vary and be difficult to assess, some analysts will spend a great deal of time chasing down rabbit holes, and other analysts will move very quickly to incorrect answers and findings. In some cases, you'll notice that analysts will spackle over gaps in analysis with guesses and assumption.

The other thing you'll notice about these organizations, and ones like them, is that the less consistency there is, there's also little to no oversight. That is, there's no one monitoring output, no one doing "QA" or "QC". For many customer organizations, this means that what they receive is also going to be inconsistent, and the overall strategic issues, the things that could really protect them, will be missed.

Having consistency and developing processes isn't about restricting what analysts can do. Rather, it's about taking those aspects of analysis that appear regularly, finding the "best" way of conducting the parsing, and then, if possible, automating not just the parsing, but also the enrichment and decoration of that data. After all, these are things you're going to do anyway, right? If you find an IP address, you're going to look it up on some platform, such as VirusTotal, so why not automate it? Why not use CPU cycles to parse out Windows Event Logs, extract IP addresses from login (and other) events, do some form of look-up, and then tag that entry in some way with what you find. Doing all of this automatically doesn't take anything away from an analyst, other than a great deal of very manual work, and provides them with additional time to focus on actual analysis.

Next Steps
So here's what you do...write down the steps you take. Make it repeatable. Make it so that 6 months from now, you or someone else can follow the same steps on the same data and get the same results. Repeatability leads to consistency, and if it's written down, you don't forget steps. You don't forget to collect Prefetch files because the last 6 systems you looked at were servers and didn't have Prefetch files. You don't forget to parse the AmCache.hve file.

Once you've written it down, you have some place to start. Now, notice that I said, "write it down", but not that it's written in stone, not that it's immutable. You have a baseline now, and you can deviate from it, in part by adding justification for that deviation to your process. It's easy..."I collected and parsed the PCA files because this was the first Windows 11 system I'd seen in X months." Boom. Done.

With it written down, how much can you automate? If you mount the acquired image, or put the triage data in a specific folder structure, is this something you can easily do every time? Yes? Okay, so then, can you automate your parsing process? If the data is in the same location every time, can you automate the process so that a single command or button press can get everything parsed?

At one point in my career, I'd thought that having a lab intake process was the way to go...acquired images, laptops, or just hard drives could be sent to the lab, where the intake technician would take care of connecting the device/data, running through the parsing, extraction, enrichment and decoration process, and then inform the analyst when the results were ready. At that point, the analyst would have access to the parsed data, as well as the original "evidence", if needed. But they wouldn't have to do the processing themselves, and there wouldn't be a bunch of missed steps, and gaps in the reporting.

Let's assume that the automation is complete, to some extent. It doesn't have to be fancy, and it doesn't have to be everything at this point. Now that we've got stuff written down, we have something we can build on, that we can add to and extend. Now only can we add other data sources, but we can add things like enrichment and decoration, from external sources or from previous incidents. Now that we've got some modicum of automation, a lot of the manually-intensive "heavy lifting" is now being done by the computer. Now we have a more efficient, less error-prone process that frees up a LOT of time for things like actual analysis, lessons learned development, and baking those things we learn back into the process.

Windows Incident Response

Saturday, June 27, 2026

Consistency

No comments: