Zen and the Art of SMM Bug Hunting | Finding, Mitigating and Detecting UEFI Vulnerabilities

It’s been almost a full year since we published the last part of our UEFI blog posts series. During that period, the firmware security community has been was more active than ever and produced several high-quality publications. Notable examples include the discovery of new UEFI implants such as MoonBounce and ESPecter, and the recent disclosure of no less than 23 high-severity BIOS vulnerabilities by Binarly.

Here at SentinelOne, we haven’t been sitting idle either. In the past year, we tried our hand at hunting down and exploiting SMM vulnerabilities. After spending several months doing so, we noticed some repetitive anti-patterns in SMM code and developed a pretty good intuition regarding the potential exploitability of bugs. Eventually, we managed to conclude 2021 after having disclosed 13 such vulnerabilities, affecting most of the well-known OEMs in the industry. In addition, several more vulnerabilities are still moving through the responsible disclosure pipeline and should go public soon.

In this blog post, we would like to share the knowledge, tools, and methods we developed to help uncover these SMM vulnerabilities. We hope that by the time you finish reading this article, you too will be able to find such firmware vulnerabilities yourselves. Please note that this article assumes a solid knowledge of SMM terminology and internals, so if your memory needs a refresher we highly recommend reading the articles in the Further Reading section before proceeding. And now, let’s get started.

Classes of SMM Vulnerabilities

While in theory SMM code is isolated from the outside world, in reality, there are many circumstances in which non-SMM code can trigger and even affect code running inside SMM. Because SMM has a complex architecture with lots of “moving parts” in it, the attack surface is pretty vast and contains among other things data passed in communication buffers, NVRAM variables, DMA-capable devices, and so on.

In the following section, we will go through some of the more common SMM security vulnerabilities. For each vulnerability type, we will provide a brief description, some recommended mitigations as well as a strategy for detecting it while reversing. Note that the list of vulnerabilities is not exhaustive and contains only vulnerabilities that are specific to the SMM environment. For that reason, it will not include more generic bugs such as stack overflows and double-frees.

SMM Callouts

The most basic SMM vulnerability class is known as an “SMM callout”. This occurs whenever SMM code calls a function located outside of the SMRAM boundaries (as defined by the SMRRs). The most common callout scenario is an SMI handler that tries to invoke a UEFI boot service or runtime service as part of its operation. Attackers with OS-level privileges can modify the physical pages where these services live prior to triggering the SMI, thus hijacking the privileged execution flow once the affected service is called.

Mitigation

Besides the obvious approach of not writing such faulty code in the first place, SMM callouts can also be mitigated at the hardware level. Starting from the 4th generation of the Core microarchitecture (Haswell) Intel CPUs support a security feature called SMM_Code_Chk_En. If this security feature is turned on, the CPU is prohibited from executing any code located outside the SMRAM region once it enters SMM. One can think of this feature as the SMM equivalent of Supervisor Mode Execution Prevention (SMEP).

Querying for the status of this mitigation can be done by executing the smm_code_chk module from CHIPSEC.

Figure 2 – Using chipsec to query for the hardware mitigation against SMM callouts

Detection

Static detection of SMM callouts is pretty straightforward. Given an SMM binary, we should analyze it while looking for SMI handlers that have some execution flow that leads to calling a UEFI boot or runtime service. This way, the problem of finding SMM callouts is reduced to the problem of searching the call graph for certain paths. Luckily for us, no additional effort is required at all since this heuristic is already implemented by the excellent efiXplorer IDA plugin.

As we mentioned in previous posts in the series, efiXplorer is a one-stop-shop and serves as the de-facto standard way of analyzing UEFI binaries with IDA. Among other things, it takes care of the following:

Locating and renaming known UEFI GUIDs
Locating and renaming SMI handlers
Locating and renaming UEFI boot/runtime services
Recent versions of efiXplorer use the Hex-Rays decompiler to improve analysis. One such feature is the ability to assign the correct type to interface pointers passed to methods such as LocateProtocol() or its SMM counterpart SmmLocateProtocol().

A note to Ghidra users: We also want to add that the Ghidra plugin efiSeek takes care of all the changes in the list above. However, it doesn’t include the UI elements like the protocols window and the vulnerability detection capabilities offered by efiXplorer.

After analysis of the input file is complete, efiXplorer will move on to inspect all calls carried out by SMI handlers, which yields a curated listing of potential callouts:

Figure 4 – sub_7F8 is reachable from an SMI handler but still calls a boot service located outside of SMRAM

For the most part, this heuristic works great, but we’ve encountered several edge cases where it might generate some false positives as well. The most common one is caused due to the usage of EFI_SMM_RUNTIME_SERVICES_TABLE. This is a UEFI configuration table that exposes the exact same functionality as the standard EFI_RUNTIME_SERVICES_TABLE, with the only significant difference being that, unlike its “standard” counterpart, it resides in SMRAM and is therefore suitable to be consumed by SMI handlers. Many SMM binaries often re-map the global RuntimeServices pointer to the SMM-specific implementation after completing some boilerplate initialization tasks:

Figure 5 – Remapping the global RuntimeService pointer to the SMM-compatible implementation

Calling runtime services via the re-mapped pointer yields a situation that appears to be a callout at first glance, though a closer examination will prove otherwise. To overcome this, analysts should always search the SMM binary for the GUID identifying EFI_SMM_RUNTIME_SERVICES_TABLE. If this GUID is found, chances are that most of the callouts involving UEFI runtime services are false positives. This does not apply to callouts involving boot services, though.

Figure 6 – A false positive caused by calling `GetVariable()` via the re-mapped RuntimeService pointer

Another source of potential false positives is various wrapper functions which are “dual-mode”, meaning they can be called from both SMM and non-SMM contexts. Internally, these functions dispatch a call to an SMM service if the caller is executing in SMM, and dispatches a call to the equivalent boot/runtime service otherwise. The most common example we’ve seen in the wild is FreePool() from EDK2, which calls gSmst->SmmFreePool() if the buffer to be freed resides in SMRAM, or calls gBs->FreePool() otherwise.

Figure 7 – The `FreePool()` utility functions from EDK2 is a common source of false positives

As this example demonstrates, bug hunters should be aware of the fact that static code analysis techniques are having a hard time determining that certain code paths won’t be executed in practice, and as such are likely to flag this as a callout. Some tips and tricks for identifying this function in compiled binaries will be conveyed in the Identifying Library Functions section.

Low SMRAM Corruption

Description

Under normal circumstances, the communication buffer used to pass arguments to the SMI handler must not overlap with SMRAM. The rationale for this restriction is quite simple: if that wasn’t the case, any time the SMI handler would write some data into the comm buffer — for example, in order to return a status code to the caller — it would also modify some portion of SMRAM along the way, which is undesirable.

Figure 8 – This situation should not occur

In EDK2, the function responsible for checking whether or not a given buffer overlaps with SMRAM is called SmmIsBufferOutsideSmmValid(). This function gets called on the communication buffer upon each SMI invocation in order to enforce this restriction.

Figure 9 – EDK2 forbids the comm buffer from overlapping with SMRAM

Alas, since the size of the communication buffer is also under the attacker’s control this check on its own is not enough to guarantee sound protection and some additional responsibilities lay on the shoulders of the firmware developers. As we will see shortly, many SMI handlers fail here and leave a gap attackers can exploit to violate this restriction and corrupt the bottom portion of SMRAM. To understand how, let’s take a closer look at a concrete example:

Above we have a real-life, very simple SMI handler. We can divide its operation into 4 discrete steps:

Sanity checking the arguments.
Reading the value of the MSR_IDT_MCR5 register into a local variable.
Computing a 64-bit value out of it, then writing the result back to the communication buffer.
Return to the caller.

The astute reader might be aware of the fact that during step 3, an 8-byte value is written to the Comm Buffer, but nowhere during step 1 does the code check for the prerequisite that the buffer is at least 8 bytes long. Because this check is omitted, an attacker can exploit this by:

Placing the Comm Buffer in a memory location as adjacent as possible to the base of SMRAM (say SMRAM – 1).
Set the size of the Comm Buffer to a small enough integer value, say 1 byte.
Trigger the vulnerable SMI. Schematically, the memory layout would look as follows:

Figure 11 – Memory layout at the time of SMI invocation

As far as SmmEntryPoint is concerned, the Comm Buffer is just 1 byte long and does not overlap with SMRAM. Because of that, SmmIsBufferOutsideSmmValid() will succeed and the actual SMI handler will be called. During step 3, the handler will blindly write a QWORD value into the Comm Buffer, and by doing so it will unintentionally write over the lower 7 bytes of SMRAM as well.

Figure 12 – Memory layout at the time of corruption

Based on EDK2, the bottom portion of TSEG (the de-facto standard location for SMRAM), contains a structure of type SMM_S3_RESUME_STATE whose job is to control recovery from the S3 sleep state. As can be seen below, this structure contains a plethora of members and function pointers whose corruption can benefit the attacker.

Figure 13 – Definition for the SMM_S3_RESUME_STATE object, source: EDK2