Windows Incident Response: April 2023

Thursday, April 27, 2023

Program Execution

By now, I hope you've had a chance to read and consider the posts I've written discussing the need for validation of findings (third one here). Part of the reason for this series was a pervasive over-reliance on single artifacts as a source of findings that I and others have seen within the community over the past 2+ decades. One of the most often repeated examples of this is relying on ShimCache or AmCache artifacts as evidence of program execution.

ShimCache
ShimCache, or AppCompatCache (the name of the Registry value where the data is found) is often looked to as evidence of program execution when really what it demonstrates is that the file was on the system.

From this blog post from Mandiant:

It is important to understand there may be entries in the Shimcache that were not actually executed.

There you go. That's from 2015. And this is why we need to incorporate artifacts such as the ShimCache into an overall constellation, rather then viewing artifacts such as these in isolation. This 13Cubed video provides a clear explanation regarding the various aspects of the ShimCache artifact as it relates to Windows 10; note that the title of the video includes "the most misunderstood artifact".

AmCache
AmCache is another one of those artifacts that is often offered up as "evidence of program execution", as seen in this LinkedIn post. However, the first referenced URL in that post belies the fact that this artifact is "evidence of program execution", as well as other statements in the post (i.e., that AmCache is "populated after system shutdown"). From the blog post:

During these tests, it was found that the Amcache hive may have artifacts for executables that weren’t executed at all.

A bit more extensive treatment of the AmCache artifact can be found here. While you may look at the PDF and think, "TL;DR", the short version is that an entry in the AmCache does not explicitly mean, by itself, that the file was executed.

The point is that research demonstrates that, much like the ShimCache artifact, we cannot simply look at an entry and state, "oh, that is evidence of program execution". Even if you don't want to take the time reach and digest either the blog post or the PDF, simply understand that by itself, an AmCache entry does not demonstrate evidence of program execution.

So, again...let's all agree to stop looking just to ShimCache or just to AmCache as evidence of program execution, and instead look to multiple data sources and to artifact constellations to establish whether a program was executed or not.

For some insight as to how ShimCache and AmCache can be used together, check out this blog post from WithSecure.

Keep in mind that even when combining these two artifacts, it still doesn't provide clear indications that the identified executable was launched, and successfully executed. We need to seek other artifacts (Windows Event Log, Registry, etc.) to determine this aspect of the executable.

PCA
Earlier this year, AboutDFIR.com published a blog post regarding a new artifact (new to Windows 11) that appears to demonstrate evidence of program execution. Much like other artifacts (see above), this one has nuances or conditions, in that you cannot look to it to demonstrate execution of all programs; rather this one seems to apply to either GUI programs, or CLI program launched via a GUI. This is important to remember, whether you see an application of interest listed in one of the artifacts, or if you don't...context matters.

The blog post provides insight into the artifacts, as well as images of the artifacts, and samples you can download and examine yourself. This YouTube video mentions another associated artifact; specifically, Windows Event Log records of interest. Adding the artifacts to a timeline is a pretty trivial exercise; the text-based artifacts are easy to script, and the process for adding Windows Event Logs to a timeline is something that already exists.

Monday, April 24, 2023

New Events Ripper Plugins

I recently released four new Events Ripper plugins, mssql.pl, scm7000.pl, scm7024.pl and apppopup26.pl.

The mssql.pl plugin primarily looks for MS SQL failed login events in the Application Event Log. I'd engaged in a response where we were able to validate the failed login attempts first in the MS SQL error logs, but then I learned that the events are also listed in the Windows Event Log, specifically the Application Event Log, and I wanted to provide that insight to the analyst.

The plugin lists the usernames attempted and the frequency of each, as well as the source IP address of the login attempts and their frequency. In one instance, we saw almost 35000 failed login attempts, from 4 public IP addresses, three of which were all from the same class C subnet. This not only tells a great deal about the endpoint itself, but also provides significant information that the analyst can use immediately, as well as leverage as pivot points into the timeline. The plugin does not yet list successful MS SQL logins because, by default, that data isn't recorded, and I haven't actually seen such a record.

The plugin also looks for event records indicating settings changes, and lists the settings that changed. Of specific interest is the use of the xp_cmdshell stored procedure.

So, why does this matter? Not long ago, AhnLab published an article stating that they'd observed attacks against MS SQL servers resulting in the deployment of Trigona ransomware.

The scm7000.pl plugin locates "Service Control Manager/7000" event records, indicating that a Windows service failed to start. This is extremely important when it comes to validation of findings; just because something (i.e., something malicious) is listed as a Windows service does not mean that it launches and runs every time the endpoint is restarted. This is just as important to understand, alongside Windows Error Reporting events, AV events, application crash events, etc. This is why we cannot treat individual events or artifacts in isolation; events are in reality composite objects, and provide (and benefit from) context from "nearby" events.

The scm7024.pl plugin looks for "Service Control Manager/7024" records in the System Event Log, which indicate that a service terminated.

The apppopup26.pl plugin looks for "Application Popup/26" event records in the Application Event Log, and lists the affected applications, providing quick access to pivot points for analysis. If an application of interest to your investigation is listed, the simplest thing to do is pivot into the timeline to see what other events occurred "near" the event in question. Similar to other plugins, this one can provide indications of applications that may have been on the system at one point, and may have been removed.

Events Ripper has so far proven to be an extremely powerful and valuable tool, at least to me. I "see" something, document it, add context, analysis tips, reference, etc., and it becomes part of an automated process. Sharing these plugins means that other analysts can benefit from my experiences, without having to have ever seen these events before.

The tool is described here, with usage information available here, as well as via the command line.

On Validation, pt III

From the first two articles (here, and here) on this topic arises the obvious question...so what? Not
validating findings has worked well for many, to the point that the lack of validation is not recognized. After all, who notices that findings were not verified? The peer review process? The manager? The customer? Given just the fact how pervasive training materials and processes are that focus solely on single artifacts in isolation should give us a clear understanding that validating findings is not a common practice. That is, if the need for validation is not pervasive in our industry literature, and if someone isn't asking the question, "...but how do you know?", then what leads us to assume that validation is part of what we do?

Consider a statement often seen in ransomware investigation/response reports up until about November 2019; that statement was some version of "...no evidence of data exfiltration was observed...". However, did anyone ask, "...what did you look at?" Was this finding (i.e., "...no evidence of...") validated by examining data sources that would definitely indicate data exfiltration, such as web server logs, or the BITS Client Event Log? Or how about indirect sources, such as unusual processes making outbound network connections? Understanding how findings were validated is not about assigning blame; rather, it's about truly understanding the efficacy of controls, as well as risk. If findings such as "...data was not exfiltrated..." are not validated, what happens when we find out later that it was? More importantly, if you don't understand what was examined, how can you address issues to ensure that these findings can be validated in the future?

When we ask the question, "...how do you know?", the next question might be, "...what is the cost of validation?" And at the same time, we have to consider, "...what is the cost of not validating findings?"

The Cost of Validation

In the previous blog posts, I presented "case studies" or examples of things that should be considered in order to validate findings, particular in the second article. When considering the 'cost' of validation, what we're asking is, why aren't these steps performed, and what's preventing the analyst from taking the steps necessary to validate the findings?

For example, why would an analyst see a Run key value and not take the steps to validate that it actually executed, including determining if that Run key value was disabled? Or parse the Shell-Core Event Log and perhaps see how many times it may have executed? Or parse the Application Event Log to determine if an attempt to execute the program pointed to resulted in an application crash? In short, why simply state that program execution occurred based on nothing more than observing the Run key value contents?

Is it because taking those steps is "too expensive" in terms of time or effort, and would negatively impact SLAs, either explicit or self-inflicted? Does it take too long do so, so much so that the ticket or report would not be issued in what's considered a "timely" manner?

Could you issue the ticket or report in order to meet SLAs, make every attempt to validate your findings, and then issue an updated ticket when you have the information you need?

The Cost of Not Validating
In our industry, an analyst producing a ticket or report based on their analysis is very often well abstracted from the final effects, based on decisions made and resources deployed due to their findings. What this means is that whether in an internal/FTE or consulting role, the SOC or DFIR analyst may not ever know the final disposition of an incident and how that was impacted by their findings. That analyst will likely never see the meeting where someone decides either to do nothing, or to deploy a significant staff presence over a holiday weekend.

Let's consider case study #1 again, the PCI case referenced in the first post. Given that it was a PCI case, it's likely that the bank notified the merchant that they were identified as part of a common point of purchase (CPP) investigation, and required a PCI forensic investigation. The analyst reported their findings, identifying the "window of compromise" as four years, rather than the three weeks it should have been. Many merchants have an idea of the number of transactions they send to the brands on a regular basis...for smaller merchants, it may be a month, and for larger vendors, a week. They also have a sense of the "rhythm" of credit card transactions; some merchants have more transactions during the week and fewer on the weekends. The point is that when the PCI Council needed to decide on a fine, they take the "window of compromise" into account.

During another incident in the financial sector, a false positive was not validated, and was reported as a true positive. This led to the domain controller being isolated, which ultimately triggered a regulatory investigation.

Consider this...what happens when you tell a customer, "OMGZ!! You have this APT Umpty-Fratz malware running as a Windows service on your domain controller!!", only to later find out that every time the endpoint is restarted, the service failed to start (based on "Service Control Manager/7000" events, or Windows Error Reporting events, application crashes, etc.)? The first message to go out sounds really, REALLY bad, but the validated finding says, "yes, you were compromised, and yes, you do need a DFIR investigation to determine the root cause, but for the moment, it doesn't appear that the persistence mechanism worked."

Conclusion
So, what's the deal? Are you validating findings? What say you?

Sunday, April 16, 2023

On Validation, pt II

My first post on this topic didn't result in a great deal of engagement, but that's okay. I wrote the first post with part II already loaded in the chamber, and I'm going to continue with this topic because, IMHO, it's immensely important.

I've see more times than I care to count findings and reports going out the door without validation. I saw an analyst declare attribution in the customer's parking lot, as the team was going on-site, only to be proven wrong and the customer opting to continue the response with another team. Engagements such as this are costly to the consulting team through brand damage and lost revenue, as well as costly to the impacted organization, through delays and additional expenses to reach containment and remediation, all while a threat actor is active on their network.

When I sat down to write the first post, I had a couple more case studies lined up, so here they are...

Case Study #3
Analysts were investigating incidents within an organization, and as part of the response, they were collecting memory dumps from Windows endpoints. They had some information going into the investigations regarding C2 IP addresses, based on work done by other analysts as part of the escalation process, as well as from intel sources and open reporting, so they ran ASCII string searches for the IP addresses against the raw memory dumps. Not getting any hits, declared in the tickets that there was no evidence of C2 connections.

What was missing from this was the fact that IP addresses are not employed by the operating system and applications as ASCII strings. Yes, you may see an IP address in a string that starts with "HTTP://" or "HTTPS://", but by the time the operating system translates and ingests the IP address for use, it's converted to 4 bytes, and as part of a structure. Tools like Volatility provide the capability to search for certain types of structures that include IP addresses, and bulk_extractor searches for other types of structures, with the end result being a *.pcap file.

In this case, as is often the case, analyst findings are part of an overall corporate-wide process, a process that includes further, follow-on findings such as "control efficacy", identifying the effectiveness of various controls and solutions within the security tech stack to address situations (prevent, detect, respond to) incidents, and simply stating in the ticket that "no evidence of communication with the C2 IP address was found" is potentially incorrect, in addition to not addressing how this was determined. If no evidence of communications from the endpoint was found, then is there any reason to submit a block for the IP address on the firewall? Is there any reason to investigate further to determine if a prevention or detection control failed?

In the book Investigating Windows Systems, one of the case studies involves both an image and a memory dump, where evidence of connections to an IP address were found in the memory dump that were not found in application logs within the image, using the tools mentioned above. What this demonstrates is that it's entirely possible for evidence to be found using entirely different approaches, and that not employing the full breadth of what an analyst has available to them is entirely insufficient.

Case Study #4
Let's look at another simple example - as a DFIR analyst, you're examining either data collected from an endpoint, or an acquired image, and you see a Run key value that is clearly malicious; you've seen this one before in open reporting. You see the same path/file location, same file name.

What do you report?

Do you report, "...the endpoint was infected with <malicious thing>...", or do you validate this finding?

Do you:
- determine if the file pointed to by the value exists
- determine if the Run key value was disabled <-- wait, what??
- review the Microsoft-Windows-Shell-Core/Operational Event Log to see if the value was processed
- review the Application Event Log, looking for crash dumps, WER or Application Popup records for the malware
- review the Security Event Log for Process Creation events (if enabled)
- review Sysmon Event Log (if available)
- review the SRUM db for indications of the malware using the network

If not, why? Is it too much of a manual process to do so? Can the playbook not be automated through the means or suite you have available, or via some other means?

But Wait, There's More...
During my time as a DFIR analyst, I've seen command lines used to created Windows services, followed by the "Service Control Manager/7045" record in the System Event Log indicating that a new service was installed. I've also seen those immediately followed by a "Service Control Manager/7009" or "Service Control Manager/7011" record, indicating that the service failed to start, rather than the "Service Control Manager/7036" record you might expect. Something else we need to look for, going beyond simply "a Windows service was installed", is to look for indications of Windows Error Reporting events related to the image executable, application popups, or application crashes.

I've seen malware placed on systems that was detected by AV, but the AV was configured to "take no action" (per AV log messages), so the malware executed successfully. We were able to observe this within the acquired image by validating the impacts on the file system, Registry, Windows Event Log, etc.

I've seen threat actors push malware to multiple systems; in one instance, the threat actor pushed their malware to six systems, but it only successfully executed on four of those systems. On the other two, the Application Event Log contained Windows Error Reporting records indicating that there was an issue with the malware. Further examination failed to reveal the other impacts of the malware that had been observed on the four systems that had been successfully infected.

I worked a PCI case once where the malware placed on the system by the threat actor was detected and quarantined by AV within the first few hours it was on the system, and the threat actor did not return to the system for six weeks. It happened that that six weeks was over the Thanksgiving and Christmas holidays, during a time of peak purchasing. The threat actor returned after Christmas, and placed a new malware executable on the system, one that was not detected by AV, and the incident was detected a week later. In the report, I made it clear that while the threat actor had access to the system, the malware itself was not running and collecting credit card numbers during those six weeks.

Conclusion
In my previous post, I mentioned that Joe Slowik referred to indicators/artifacts as 'composite objects', which is something that, as an industry, we need to understand and embrace. We cannot view artifacts in isolation, but rather we need to consider their nature, which includes both being composite objects, as well as their place within a constellation. We need to truly embrace the significance of an IP address, a Run key value, or any other artifact what conducting and reporting on analysis.

Friday, April 07, 2023

Deriving Value From Open Reporting

There's a good bit of open reporting available online these days, including (but not limited to) the annual reports that tend to be published around this time of year. All of this open reporting amounts to a veritable treasure trove of information, either directly or indirectly, that can be leveraged by SOC and DFIR analysts, as well as detection engineers, to extend protections, as well as detection and response capabilities.

Sometimes, open reporting will reference incident response activities, and then focus solely on malware reverse engineering. In these cases, information about what would be observed on the endpoint needs to be discerned through indirect means. However, other open reporting, particularly what's available from TheDFIRReport, is much more comprehensive and provides much clearer information regarding the impact of the incident and the threat actor's activities on the endpoint, making it much easier on SOC and DFIR analysts to pursue investigations.

Let's take a look at some of what's shared in a recent write-up of a ransomware incident that started with a "malicious" ISO file. Right away, we get the initial access vector from the title of the write-up!

Before we jump in, though, we're not going to run through the entire article; the folks at TheDFIRReport have done a fantastic job of documenting what they saw six ways to Sunday, and there's really no need to run through everything in the article! Also, this is not a criticism, nor a critique, and should not be taken as such. Instead, what I'm going to do here is simply expand a bit on a couple of points of the article, nothing more. What I hope you take away from this is that there's a good bit of value within write-ups such as this one, value beyond just the words on paper.

The incident described in the article started with a phishing email, delivering a ZIP archive that contained an ISO file, which in turn contained an LNK file. There's a lot to unravel, just at this point. First off, the email attachment (by default) will have the MOTW attached to it, and MOTW propagation to the ISO file within the archive will depend up on the archival tool used to open it.

Once the archive is opened, the user is presented with the ISO file, and by default, Windows systems allow the user to automatically mount the disk image file by double-clicking it. However, this behavior can be easily modified, for free, while still allowing users to access disk image files programmatically, particularly as part of legitimate business processes. In the referenced Huntress blog post, Dray/@Purp1eW0lf provided Powershell code that you can just copy out of the blog post and execute on your system(s), and users will be prevented from automatically mounting disk image files by double-clicking on them, while still allowing users to access the files programmatically, such as mounting VHD files via the Disk Manager.

Next, Microsoft issued a patch in Nov 2022 that enables MOTW propagation inside mounted disk images files; had the system in this incident been patched, the user would have been presented with a warning regarding launching the LNK file. The section of the article that addresses defense evasion states, "These packages are designed to evade controls such as Mark-of-the-Web restrictions." This is exactly right, and it works...if the archival tool used to open the zip file does not propagate MOTW to the ISO file, then there's nothing to be propagated to from the ISO file to the embedded LNK file, even if the patch is installed.

Let's take a breather here for a second...take a knee. We're still at the initial access point of an incident that resulted in the domain-wide deployment of ransomware; we're at the desk of that one user who received the phishing email, and the malicious actions haven't been launched yet...and we've identified three points at which we could have inhibited (archiver tool, patched system) or obviated (enable programmatic disk image file access only) the rest of the attack chain. I bring this up because many times we hear how much security "costs", and yet, there's a free bit of Powershell that can be copied out of a blog post, that could have been applied to all systems and literally stopped this attack cycle that, according to the timeline spanned 5 days, in its tracks. The "cost" of running Dray's free Powershell code versus the "cost" of an infrastructure being encrypted and ransomed...what do those scales look like to you?

Referencing the malicious ISO file, the article demonstrates how the user mounting the disk image file can be detected via the Windows Event Log, stating that the "activity can be tracked with Event 12 from Microsoft-Windows-VHDMP/Operational" Event Log. Later, in the "Execution" section of the article, they state that "Application crashes are recorded in the Windows Application event log under Event ID 1000 and 1001", as a result of...well...the application crashing. Not only can both of these events be extracted as analysis pivot points using Events Ripper, but the application crashes observed in this incident serve to make my point regarding validation, specifically with respect to analysts validating findings.

The article continues illustrating the impact of the attack chain on the endpoint, referencing several other Windows Event Log records, several of which (i.e., "Service Control Manager/7045" events) are also covered/addressed by Events Ripper.

Conclusion
Articles like this one, and others from TheDFIRReport, are extremely valuable to the community. Where a good bit of open reporting articles will include things like, "...hey, we had 41 Sobinokibi ransomware response engagements in the first half of the year..." but then do an in-depth RE of one sample, with NO host-based impact or artifacts mentioned, articles such as this one do a great job of laying the foundation for artifact constellations, so that analysts can validate findings, and then use that information to help develop protections, detections, and response procedures for future engagements. Sharing this kind of information means that it's much easier for detect incidents like these much earlier in the attack cycle, with the goal of obviating file encryption.

Wednesday, April 05, 2023

Unraveling Rorschach

Checkpoint recently shared a write-up on some newly-discovered ransomware dubbed, "Rorschach". The write-up was pretty interesting, and had a good bit of content to unravel, so I thought I'd share the thoughts that had developed while I read and re-read the article.

From the article, the first things that jumped out at me were:

Check Point Research (CPR) and Check Point Incident Response Team (CPIRT) encountered a previously unnamed ransomware strain...

...and...

While responding to a ransomware case...

So, I'm reading this, and at this point, I'm anticipating some content around things like initial access, as well as threat actor "actions on objectives", as they recon and prepare the environment for the ransomware deployment.

However, there isn't a great deal stated in the article about how the ransomware got on the system, nor about how the threat actor gained access to the infrastructure. The article almost immediately dives into the malware execution flow, with no mention of how the system was compromised. We've seen this before; about 3 yrs ago, one IR consulting firm posted a 25-page write-up (which is no longer available) on Sobinokibi ransomware. The write-up started off by saying that during the first half of the year, the firm had responded to 41 Sobinokibi ransomware cases, and then dove into reverse engineering and analysis of one sample, without ever mentioning how the malware got on the system. As you read through Checkpoint's write-up, one of the things they point out (spoiler alert!!) is the speed of the encryption algorithm...if this is something to be concerned about, shouldn't we look to those threat actor activities that we can use to inhibit or obviate the remaining attack chain, before the ransomware is deployed?

Let's take a look at some other interesting statements from the article...

The ransomware is partly autonomous, carrying out tasks that are usually manually performed during enterprise-wide ransomware deployment...

Looking at the description of the actions performed by the ransomware executable, this is something we very often see in RaaS offerings. In June 2020, I read a write-up of a RaaS offering that included commands using "net stop" to halt 156 Windows services, taking something of a "spray-and-pray" approach; there was not advance recon that determined that those 156 services were actually running in the environment. Checkpoint's list of services the ransomware attempts to stop is much shorter, but similarly, there doesn't seem to be any indication that the list is targeted, that it's based on prior recon of the environment. In short, spray-and-pray, take the "shotgun" approach.

However, a downside of this is that while you may be able to detect it (parent process will by cy.exe, running the "net stop" commands), by that point, it may be too late. You'd need to have software-base response in place, with rules that state, "if these conditions are met on the endpoint, kill the process on the endpoint." Sending an alert to a SOC will be too late; by the time the alert makes it to the SOC console, files on the endpoint will already be encrypted.

The ransomware was deployed using DLL side-loading of a Cortex XDR Dump Service Tool, a signed commercial security product, a loading method which is not commonly used to load ransomware.

While I can't report seeing this used with ransomware specifically, DLL side-loading via a known good application is a technique that has been used extensively. Even going back a decade or more, I remember seeing legit Kaspersky, McAfee, and Symantec apps dropped in a ProgramData subfolder along with a malicious DLL, and launched as a Windows service, or via a Scheduled Task. The question I had at the time, and one that I still have when I see this sort of tactic used is, does anyone notice the legit program? What I mean is, when I've heard an analyst say that they found PlugX launched via DLL side-loading using a legit Kaspersky app, I've asked, "...were Kaspersky products used in the environment?" Most often, this doesn't seem to be a question that's asked. In the case of the Rorschach ransomware, were Palo Alto software products common in the environment, or was this Cortex tool completely new to the environment? Could something like this be used as a preventive or detective technique? After all, if a threat actor takes a tailored approach to the legit application used, deploying something that is common in the target environment and vulnerable to DLL side-loading, this would indicate a heightened level of situational awareness, rather than just "...I'll use this because I know it works."

At one point in the article, the authors state that the ransomware "...clears the event logs of the affected machines...", and then later state, "Run wevutil.exe to clear the the following Windows event logs: Application, Security, System and Windows Powershell." Okay, so we go from "clears the event logs" (implying that all Windows Event Logs are cleared) to stating that only four specific Windows Event Logs are cleared; that makes a difference. The command to enumerate and clear all Windows Event Logs is a pretty simple one-liner, where there are clearly four instances of wevtutil.exe launched, per figure 2 in the article. And why those four Windows Event Logs? Is it because the threat actor knows that their activities will appear in those logs, or is it because the threat actor understands that most analysts focus on those four Windows Event Logs, based on their training and experience? Is the Powershell Event Log cleared because Powershell is used at some point during the initial access or recon/prep phases of the attack, or are these Windows Event Logs cleared simply because the malware author believes that they are the files most often sought by SOC and DFIR analysts?

One article based on Checkpoint's analysis states, "After compromising a machine, the malware erases four event logs (Application, Security, System and Windows Powershell) to wipe its trace", implying that the malware author was aware of the traces left by the malware, and was trying to inhibit response; clearing the Windows Event Logs does not "erase" the event records, but does require extra effort on the part of the responder.

Conclusion

Even though this ransomware/file encrypting executable was found as a result of at least one response engagement, while the analysis of the malware itself is interesting, there's really very little (if any) information in the article regarding how the threat actor gained access to the environment, nor any of the steps taken by the threat actor to recon and prepare the environment prior to deploying the ransomware. From the malware analysis we know a few things that are interesting and useful, but little in the way of what we can do to detect this threat actor early in their attack cycle, allowing defenders to prevent, or detect and respond to the attack.

Monday, April 03, 2023

On Validation

I've struggled with the concept of "validation" for some time; not the concept in general, but as it applies specifically to SOC and DFIR analysis. I've got a background that includes technical troubleshooting, so "validation" of findings, or the idea of "do you know what you know, or are you just guessing", has been part of my thought processes going back for about...wow...40 years.

Here's an example...when setting up communications during Team Spirit '91 (military exercises in South Korea), my unit had a TA-938 "hot line" with another unit. This is exactly what it sounds like...it was a directly line to that other unit, and if one end was picked up, the other end would automatically ring. Yes, a "Bat phone". Just like that. Late one evening, I was in the "SOC" (our tent with all of the communications equipment) and we got a call that the hot line wasn't working. We checked connections, checked and replaced the batteries in the phone (the TA-938 phones took 2 D cell batteries, both facing the same direction), etc. There were assumptions and accusations thrown about as to why the phone wasn't working, as my team and I worked through the troubleshooting process. We didn't work on assumption; instead, we checked, rechecked, and validated everything. In the end, we found nothing wrong with the equipment on our end; however, the following day, we did find out what the issue was - at the other end, there was only one Marine in the tent, and that person had left the tent for a smoke break during the time of the attempted calls.

We could have just said, "oh, it's the batteries...", and replaced them...and we'd have the same issue all over again. Or, we could have just stated, "...the equipment on the other end was faulty/broken...", and we would not have made a friend of the maintenance chief from that unit. There were a lot of assumptions we could have made, conclusions we could have jumped to...and we'd have been wrong. We could have made stated findings that were trusted, and resulted in decisions being made, assets and resources being allocated, etc., all for the wrong reason. The end result is that my team and I (especially me, as the officer) would have lost credibility, and the trust and confidence of our fellow team members, and our commanding officer. As it was, validating our findings led to the right decisions being made, which were again validated during the exercise after action meetings.

Okay, so jump forward 32 years to present day...how does this idea of "validation" apply to SOC and DFIR analysis? I mean, this seems like such an obvious thing, right? Of course we validate our findings...but do we, really?

Case Study #1
A while back, I attended a conference during which one of the speakers walked through a PCI investigation they'd worked on. As the speaker walked through their presentation, they talked about how they'd used a single artifact, a ShimCache entry for the malware, to demonstrate program execution. This single artifact was used as the basis of the finding that the malware had been on the system for four years.

For those readers not familiar with PCI forensic investigations, the PCI Council specifies a report format and "dashboard", where the important elements of the report are laid in a table at the top of the report. One of those elements is "window of compromise", or the time between the original infection and when the breach was identified and remediated. Many merchants track the number of credit card transactions they process on a regular basis, including not only during periods of "regular" spending habits, but also off-peak and peak/holiday seasons, and as a result, the "window of compromise" can give the merchant, the bank, and the brand an approximate number of potentially compromised credit card numbers. As you'd imagine, given any average, the number of compromised credit card numbers would be much greater over a four year span than it would for, say, a three week "window of compromise".

As you'd expect, analysts submitting reports rarely, if ever, find out the results of their work. I was a PCI forensic analyst for about three and a half years, and neither I nor any of my teammates (that I'm aware of) heard what happened to a merchant after we submitted our reports. Even so, I cannot imagine that a report with a "window of compromise" of four years was entirely favorable.

But that begs the question - was the "window of compromise" really four years? Did the analyst validate their finding using multiple data sources? Something I've seen multiple times is that malware is written to the file system, and then "time stomped", often using time stamps retrieved from a native system file. This way, the $STANDARD_INFORMATION attribute time stamps from the $MFT record for the file appear to indicate that the file is "long lived", and has existed on the system for quite some time. This time stomping occurs before the Application Compatibility functionality of the Windows operating system creates an entry for the file, and the last modification time that's recorded for the entry is the one that's "time stomped". As a result, a breach that occurred in May 2013 and was discovered three weeks later ends up having the malware itself being reported as placed on the system in 2009. What impact this had, or might have had on a merchant, is something that we'll never know.

Misinterpreting ShimCache entries has apparently been a time-honored tradition within the DFIR community. For a brief walk-through (with reference links) of ShimCache artifacts, check out this blog post.

Case Study #2
In the spring of 2021, analysts were reporting, based solely on EDR telemetry, that within their infrastructure threat actors were using the Powershell Set-MpPreference module to "disable Windows Defender". This organization, like many others, was tracking such things as control efficacy (the effectiveness of controls) in order to make decisions regarding actions to take, and where and how to allocate resources. However, these analysts were not validating their findings; they were not checking the endpoints themselves to determine if Windows Defender had, in fact, been disabled, and if the threat actor's attempts had actually impacted the endpoints. As it turns out, that organization had a policy at the time of disabling Windows Defender on installation, as they had chosen another option for their security stack. As such, stating in tickets that threat actors were disabling Windows Defender, without validating these findings, led to quite a few questions, and impacted the credibility of the analysts

Artifacts As Composite Objects
Joe Slowik spoke at RSA in 2022, describing indicators, or technical observables, as "composite objects". This is an important concept in DFIR and SOC analysis, as well, and not just in CTI. We cannot base our findings on a single artifact, treating it as a discrete, atomic indicator, such as an IP address just being a location, or tied to a system, or a ShimCache entry denoting time of execution. We cannot view a process command line within EDR telemetry, by itself, as evidence of program execution. Rather, we need to recognize that artifacts are, in fact, composite objects; in his talk, Joe references Mandiant's definition of indicators of compromise, which can help us understand and visualize this concept.

Composite objects are made up of multiple elements. An IP address is not just a location, as the IP address is an observable with context. Where was the IP address observed, when was it used, and how was it used? Was it the source of an RDP, or a type 3 login? If the IP address was the source of a successful login, what was the username used? Was the IP address the source of a connection seen in web server or VPN logs? Is it the C2 address?

If we consider a ShimCache entry, we have to remember that (a) the entry itself does NOT explicitly demonstrate program execution, and that (b) the time stamp is mutable. That is, what we see could have been modified before we saw it. For example, we often see analysts hold up a ShimCache entry as evidence of program execution, often as the sole indicator. We have to understand and remember that the time stamp associated with a ShimCache entry is the last modification time for the entry, taken from the $STANDARD_INFORMATION attribute within the MFT. I've seen several instances where the file is placed on the system and then time stomped (the time stamp is easily mutable) before the entry was added to the Application Compatibility database. This is all in addition to understanding that an entry in the ShimCache does NOT mean that the file was executed. Note that the same is true for AmCache entries, as well.

We can validate indicators of compromise by including them in constellations, including them alongside other associated indicators, as doing so increases fidelity and brings valuable context to our analysis. We see this illustrated when performing searches for PCI data within acquired images; if you just search for a string of 16 characters starting with "4", you're going to get a LOT of results. If you look for strings of characters based on a bank ID number (BIN), length of the string, and if it passes the Luhn check, you're still going to get a lot of results, but not as many. If you also search for the characteristics associated with track 1 and track 2 data, your search results are going to be a smaller set, but with much higher fidelity because we've added layers of context.

Cost
So the question becomes, what is the cost of validating something versus not validating it? What is the impact or result of either? This seems on the surface like it's a silly question, maybe even a trick question. I mean, it looks that way when I read back over the question after typing it in, but then I think back to all the times I've seen when something hasn't been validated, and I have to wonder, what prevented the analyst from validating their finding, rather than simply basing their finding on a single artifact, out of context?

Let's look at a simple example...we receive an alert that a program executed, based on SIEM data or EDR telemetry. This alert can be based on elements of the command line, process parentage, or a combination thereof. Let's say that based on a number of factors and reliable sources, we believe that the command line is associated with malicious activity.

What do you report?

Do you report that this malicious thing executed, or do you investigate further to see if the malicious thing really did execute, and executed successfully? How would be go about investigating this, what data sources would be look to?

As you're thinking about this, as you're walking through this exercise, something I'd like you to keep in mind is that question, what would prevent you from actually examining those data sources you identify? Is there some "cost" (effort, time, other resources) that prevent you from doing so?