Deep Hooks: Monitoring native execution in WoW64 applications – Part 3
Last time (part 1, part 2) we demonstrated several different methods for injecting 64-bit modules into WoW64 processes. This post will pick up where we left off and describe how the ability to execute 64-bit code in such processes can be leveraged to hook native x64 APIs. To accomplish this task, the injected DLL must host a hooking engine capable of operating in the native region of a WoW64 process. Unfortunately, none of the hooking engines we inspected can manage this out-of-the-box, so we were forced into modifying one of them in order to make it suit our needs.
Creating the injected DLL
Choosing the hooking engine to adapt
Hooking is a well-established technique in the field of computer security, used extensively by defenders and attackers alike. Since the publication of the seminal article Detours: Binary Interception of Win32 Functions back in 1999, many different hooking libraries were developed. Most of them exercise similar concepts to the ones presented in that article, but differ in other aspects such as their support for various CPU architectures, support for transactions, etc. Out of these libraries, we had to choose one which best fit our requirements:
- Support inline hooking of x64 functions.
- Open-sourced and free-licensed – so that we could modify it legally.
- Preferably, the hooking engine should be relatively minimal so as to require as few modifications as possible.
After taking all of these requirements into consideration, we chose MinHook as our go-to engine. What eventually tipped the scales in favor of it was its small codebase, making it relatively easy to use in a PoC. All of the modifications presented later were done on top of it, and might be slightly different if another hooking engine is used instead.
The complete source code of our modified hooking engine is available here.
Look ma’, no dependencies!
In part 1 we briefly mentioned that not any 64-bit module can be easily loaded into a WoW64 process. Most DLLs tend to use (both implicitly and explicitly) various functions found in common Win32 subsystem DLLs, such as kernel32.dll, user32.dll, etc. However, the 64-bit versions of these modules are not loaded by default into WoW64 processes, as they are not required for the WoW64 subsystem to operate. Furthermore, due to some limitations imposed by the address space layout, forcing the process to load any of these is somewhat difficult and unreliable.
To avoid unnecessary complications, we chose to modify our hooking engine and the DLL that hosts it so that they would only rely on native 64-bit modules found normally in WoW64 processes. Basically, this left us with just the native NTDLL, as the DLLs comprising the WoW64 environment don’t usually contain functions which are beneficial to us.
In a more practical sense, to force the build environment to only link against NTDLL, we specify the /NODEFAULTLIB flag in the linker settings, and explicitly add “ntdll.lib” to the list of additional dependencies:
Figure 19 – linker configuration for the host DLL
The first and most noticeable effect that this change raises is that higher-level Win32 API functions are not available at our disposal and would have to be re-implemented using their NTDLL counterparts. As demonstrated in figure 20, for every Win32 API used by MinHook, we introduced a replacement function which has the same public interface and implements the same core functionality, while internally using only NTDLL facilities.
Most of the time these “translations” were rather straightforward (for example calls to VirtualProtect() can be replaced almost directly with calls to NtProtectVirtualMemory()). In other, more complex cases the mapping between the Win32 API functions and their native counterparts is not as clear, so we had to resort to some reverse engineering or peeking inside the ReactOS sources.
Figure 20 – private implementation of VirtualProtect()
After re-implementing all Win32 API calls in MinHook, we were still left with a bunch of errors:
Figure 21 – a bunch of errors
Luckily, solving most of these errors only requires slight configuration changes to the project. As can be seen in the figure, most of the errors take the form of an unresolved external symbol normally exported from the CRT (which is not available). They can be settled by changing a few flags in the linker settings:
- Disable basic runtime checks (remove the /RTC flag from the command line)
- Disable buffer security checks (/GS- flag)
- The entry point must be explicitly specified as DllMain since DllMainCRTStartup is not linked in.
- Additionally, memcpy() and memset() have to be implemented manually, or replaced with calls to RtlCopyMemory() and RtlFillMemory() exported from NTDLL.
After applying all of these changes, we successfully created a custom 64-bit DLL that contains no dependencies other than NTDLL:
Figure 22 – a minimalistic DLL with only a single import descriptor (NTDLL)
Hooking the native NTDLL
Once we modified our hooking engine to match all the aforementioned limitations, we can take a closer look at the hooking mechanism itself. The hooking technique employed by MinHook, as well as by the vast majority of such libraries, is dubbed “inline hooking”. The inner workings of this technique are rather well-documented, but here is a simplified description of the steps this method includes:
- Allocate a “trampoline” in the process’ address-space and copy into it the prolog of the function that will eventually be hooked.
- Place a JMP instruction in the trampoline, right after the copied prolog. This JMP should point back to the instruction following the prolog of the original function.
- Place another JMP instruction in the trampoline, right before the copied prolog. This JMP should point into a detour function (usually found in the DLL we previously injected into the process).
- Overwrite the hooked function prolog with a JMP instruction pointing to the trampoline.
Figure 23.1 – a general illustration of what an inline hook looks like.
Figure 23.2 – a view of the trampoline. Marked in red is the jump to the detour function, and marked in green are the instructions copied from the hooked function and the jump back into it.
This hooking method works by modifying the prolog of the hooked function, so whenever it is called by the application, the detour function will be called instead. The detour function can then execute any code before, after or instead of the original function.
In 64-bit mode, most hooking engines use two different types of jumps for the hooked function and the trampoline:
- The jump from the hooked function to the trampoline is a relative jump encoded as “E9 <4 byte offset>”. Since this instruction operates on a DWORD-sized operand, the trampoline must be at a distance of no more than 2GB away from the hooked function. This form of jump is usually chosen for this step since it only takes up 5 bytes and so it’s compact enough to fit neatly inside the function’s prolog.
- The jumps from the trampoline into the detour function and back to the hooked function, shown in figure 23.2, are indirect, RIP-relative jumps encoded as “FF25 <4 byte offset>” (mnemonic form: JMP qword ptr [rip+offset]). This instruction will jump into a 64-bit absolute address stored in the location pointed to by RIP plus the offset.
When running in native 64-bit processes, hooking engines employing this technique work just fine. As can be expected, the trampoline is allocated a short distance from the target function (up to 2GB away), thus allowing for successful binary instrumentation.
However, some recent changes to the memory layout of WoW64 processes guarantee that this technique cannot be applied to the native NTDLL without some additional changes. As Alex Ionescu demonstrated in his blog, in recent Windows versions (starting from Windows 8.1 update 3), the native NTDLL has been relocated: instead of being loaded into the lower 4GB of the address space together with the rest of the process’ modules, it is now loaded to a much higher address.
Figure 24 – base address of 64-bit NTDLL on Windows 10 (left) and Windows 7 (right).
The rest of the address space above the 4GB boundary (with the exception of the native NTDLL and the native CFG bitmap) is protected with a SEC_NO_CHANGE VAD and thus cannot be accessed, allocated or freed by anyone. This means that the trampoline will always be allocated in the lower 4GB of the address space. Since the total user-mode address space in 64-bit systems is 128TB, the distance between the native NTDLL and the trampoline is bound to be much greater than 2GB. That makes the JMP emitted by most hooking engines inadequate.
Figure 25 – illustration showing the control transfers required for inline hooks in WoW64 processes on Windows 8.1 and up. Note that in Windows 10 RS4 preview (build 17115) the SEC_NO_CHANGE VADs don’t seem to exist anymore and memory can be allocated anywhere in the process address space.
An alternative form of JMP
To overcome this issue, we had to replace the relative JMP with a different instruction, capable of passing a distance of up to 128TB. When searching for alternatives, we bumped into a post by Gil Dabah listing a few possible options. After disqualifying every option that “dirties” a register, we were left with only a couple of viable choices. Initially, we attempted to replace the relative JMP with an indirect, RIP-relative JMP similar to the ones used in the trampoline:
This instruction performed well on Windows 10, and provided us with a way to instrument various native API functions in WoW64 processes. But when testing the modified code on earlier Windows versions such as Windows 8.1 and Windows 7, it failed to create the hooks entirely. As it turns out, NTDLL functions in these Windows versions are shorter than their counterparts in Windows 10, and usually do not contain enough space to accommodate our chosen JMP instruction, which takes up 14 bytes.
Figure 26 – implementation of ZwAllocateVirtualMemory() in Windows 10 RS2 (left) and Windows 8.1 (right).
To make our DLL universal across all Windows versions, we had to find a shorter instruction, still capable of branching to the trampoline. Eventually, we came up with a solution that actually takes advantage of the trampoline’s location: since the trampoline must be allocated in the lower 4GB of the address space, the upper 4 bytes of its 8-byte address are zeroed out. This lets us use the following option, which only takes up 6 bytes:
The reason this method works is because in x64 code the PUSH instruction, when supplied with a 4-byte operand, actually pushes an 8-byte value onto the stack. The upper 4 bytes are used as a sign extension, meaning that as long as the 4-byte address is not greater than 2GB, they will be zeroed.
We then use a RET instruction, which pops an 8 byte address from the stack and jumps to it. Since we have just pushed the address of the trampoline to the top of the stack, that would be our return address.
Figure 27 – NtAllocateVirtualMemory() containing our modified hook. Notice the first two instructions, which push the address of the trampoline into the stack and immediately “return” to it.
There is only one problem left with this method, caused by CFG. As mentioned in part 2 of this series, all private memory allocations in WoW64 processes – including the trampolines used for our hooks – are marked exclusively in the WoW64 CFG bitmap.
Whenever we wish to execute the original API function from the detour, we first need to call the trampoline in order to run that function’s prolog. But, if our DLL is compiled with CFG, it will attempt to validate the trampoline’s address against the native CFG bitmap before calling it. Due to this mismatch, the validation will fail, resulting in the termination of the process.
The solution to this problem is rather straightforward – having control over the DLL’s configuration, we can simply compile it without enabling CFG. This is done by removing the /guard:cf flag from the compiler’s command line.
Preventing infinite recursions
The last issue to consider when adapting a hooking engine is infinite recursions. After the hooks are placed, whenever a call is made to a hooked function, this call will reach our detour instead. But our detour functions also execute their own code, which might itself make calls to hooked functions, leading us back to our detour. Unless handled carefully, this can lead to infinite recursions.
Figure 28 – infinite recursions when the hook on LdrLoadDll attempts to load another DLL
Normally there exists a simple solution for this problem: declaring a thread local variable, which counts the “depth” of the recursion we’re in, and only executing the code inside the detour function the first time (counter == 1):
Figure 29 – a thread local variable counting the depth of the recursion
Unfortunately, we cannot use thread local variables in our DLL, for two reasons:
- Implicit TLS (__declspec(thread)) relies heavily on the CRT, which is not available to us.
- Explicit TLS APIs (TlsAlloc() / TlsFree(), etc.) are implemented in their entirely in kernel32.dll, whose 64-bit version is not loaded into WoW64 processes.
In spite of these constraints, wow64.dll does use TLS storage, as can be verified by looking at the output of the “!wow64exts.info” command:
Figure 30 – TLS variables used by the WoW64 DLLs
As it turns out, wow64.dll does not dynamically allocate TLS slots at runtime but rather uses hardcoded locations in the TlsSlots array accessible directly from the TEB (already instantiated on a per-thread basis).
Figure 31 – Wow64SystemServiceEx writes a thread local variable into a hardcoded location in the TlsSlots array
After some empirical testing, we discovered that most TLS slots in the 64-bit TEB are never used by WoW64 DLLs, so for the purpose of this PoC we can just pre-allocate one of them to store our counter. There is no guarantee that this slot will remain unused in future Windows versions, so production-grade solutions would probably look into some other available members of the TEB.
Figure 32 – using an unused member of the TEB to count the “depth” of our recursion.
This concludes the third and final part of our “Deep Hooks” series. In these three posts we presented several different ways to inject a 64-bit DLL into a WoW64 process and then use it to hook API functions in the 64-bit NTDLL. Hopefully, this option would benefit security products by allowing them to gain better visibility into WoW64 processes and making them more resilient to bypasses such as “Heaven’s Gate”.
The methods presented throughout this series still have their limitations, found in the form of new mitigation options such as dynamic code restrictions, CFG export suppression and code integrity guard. When enabled, these might prevent us from creating our hooks or thwart our injection altogether, but more on that in some future post.
Reversing Malware on macOS
Endpoint Protection Platform Free Demo