Deep Hooks: Monitoring native execution in WoW64 applications – Part 2

Where we left off

In the first part of this series we presented several injection methods capable of injecting 64-bit DLLs into WoW64 processes, with the intention to eventually use this DLL to hook 64-bit API functions in the process.

We finished the post by presenting injection via APC, and saw that, when tested to CFG-aware processes, it failed to inject the DLL and crashed the process. To understand why that happens, we have to dive into some of the implementation details of CFG.

A brief introduction to CFG

CFG (Control Flow Guard) is a relatively new exploit mitigation feature, first introduced in Windows 8.1 Update 3 and later enhanced in Windows 10. It is a compiler-enabled mitigation meant to combat memory corruption vulnerabilities by preventing indirect calls to non-legitimate targets. Right before every indirect function call, the compiler inserts an additional call to a dedicated validation routine located in NTDLL. This routine receives the call target and checks whether or not it is the start address of a function, within 8 bytes of granularity. If it is not, a security check failure (int 0x29) is raised and the process is forcibly terminated.


Figure 8 – Mimikatz, compiled with (right) and without (left) CFG.

To make this validation easy and efficient, CFG utilizes a new memory region added especially for that purpose, called a CFG bitmap. In this bitmap, each bit represents the status of 8 bytes in the process’ address space and marks whether or not they constitute a valid call target. Because of this mapping ratio, the bitmap has to be 1/64 of the total virtual address space of the process, which can get quite large – 2TB in 64-bit processes, whose total address space is 128TB.

Obviously, in 64-bit processes most of this bitmap is uncommitted, as only a very small portion of the process‘ address space is actually in use. Only when a new executable page is introduced (either by directly allocating virtual memory, mapping a view of a section object or otherwise changing page protection to executable), the kernel commits and then sets the bits corresponding to that page in the bitmap.

CFG in WoW64 processes

In his blog post, titled ‘Closing “Heaven’s Gate”’, Alex Ionescu described some of the unique characteristics of CFG in WoW64 processes. As shown there, CFG-aware WoW64 processes have not one, but two separate CFG bitmaps:  

  • A native bitmap, marking valid call targets for 64-bit code in the process. Since this bitmap has to be inaccessible to 32-bit code, it resides above the 4GB boundary, usually next to the native NTDLL.
  • A WoW64 bitmap, marking valid call targets for 32-bit code in the process. Its reserved size is 32MB as it only covers the lower 4GB of the address space, where 32-bit code can live. Obviously, it is always located below the 4GB boundary, usually next to the main image.

Since WoW64 processes have two CFG bitmaps and two versions of NTDLL loaded into them, there are, naturally, two versions of the validation function as well. The 32-bit version of the function checks the supplied address against the WoW64 bitmap, while the 64-bit version checks it against the native bitmap.


Figure 9 – a snapshot of the virtual address space of a 32-bit notepad.exe, running on Windows 10 x64

As mentioned earlier, the kernel sets bits in the CFG bitmap whenever a new executable page is introduced. This raises the question, which of the two bitmaps is affected in the case of WoW64 processes? As Ionescu pointed out, the answer to that lies in the MiSelectCfgBitMap() and MiSelectBitMapForImage() functions, called by the memory manager whenever a change to the CFG bitmap is required.

Figure 10 – Partial call stack showing the call to MiSelectCfgBitMap() and MiSelectBitMapForImage() when executable memory is mapped into a process

The pseudocode of these two functions is presented below:

Figure 11.1 – pseudocode of MiSelectCfgBitMap() as seen on Windows 10 x64 RS3

Figure 11.2 – pseudocode of MiSelectBitMapForImage() as seen on Windows 10 x64 RS3

A few conclusions can be drawn from these two functions:

  1. As can be expected, all 32-bit modules are marked in the WoW64 bitmap.
  2. All 64-bit modules are marked in the native bitmap, including those which are mapped into the lower 4GB of the address space. This is essential, as otherwise the native NTDLL would not be able to interoperate with the native DLLs comprising the WoW64 environment. For example, NTDLL would not even be able to load these modules as the call to their entry point would fail the CFG validation logic.
  3. All private memory allocations below the 4GB boundary are marked in the WoW64 bitmap, regardless of who allocated them or for what purpose. As the astute reader has probably noticed already, in the example shown in Figure 8 all of the address space above the 4GB boundary is reserved (with the exception of the native NTDLL and the native CFG bitmap). Since memory cannot be allocated from reserved regions, this effectively means that all private memory allocations will be marked exclusively in the WoW64 bitmap.

It now becomes clear why the previously shown technique for DLL injection using APC is bound to fail: although the “adapter thunk” contains 64-bit code, it is a private memory allocation and as such it will populate the WoW64 bitmap. However, the function responsible for the initial dispatching of APCs is the 64-bit version of KiUserApcDispatcher(), which will attempt to validate the thunk’s address against the native bitmap, but to no avail.

So, if we wish to maintain our APC injection capabilities, we must somehow modify our technique to overcome the CFG validation problem.

Back to APC injection

Having some prior knowledge regarding CFG implementation details, one might suggest to simply mark the adapter thunk as a valid call target by calling NtSetInformationVirtualMemory() with the VmCfgCallTargetInformation information class. This option, albeit promising, won’t actually solve the problem. The reason for that being that internally, NtSetInformationVirtualMemory() relies on MiSelectCfgBitMap() to help decide which of the two bitmaps should be affected. For the same reasons described earlier, MiSelectCfgBitmap() will still return the WoW64 bitmap when supplied with the address of the adapter thunk, thus leaving the native bitmap untouched.

Lemma 12 – NtSetInformationVirtualMemory() will only affect the WoW64 bitmap as it internally relies on MiSelectCfgBitMap(). Proof: Alex Ionescu says so. Q.E.D

After disqualifying this solution, the next option that comes into mind is finding a way to somehow “trick” MiSelectCfgBitmap() into returning the native bitmap, right when memory for the adapter thunk is allocated.

“Nativising” a WoW64 process

When looking at the pseudocode of MiSelectCfgBitmap() presented in figure 11.1, it is clearly visible that the native bitmap will always be returned for “true” 64-bit processes. This is obvious, as 64-bit processes should only have a single, native CFG bitmap. Therefore, if we somehow manage to “nativise” the WoW64 process, the adapter thunk will be marked in the native bitmap and so the APC dispatching should succeed as planned.

The kernel’s way of telling whether a given process is a native one or not is by probing the WoW64Process member of the EPROCESS structure. If this member is set to NULL, the process is considered to be a native one, otherwise it is treated as a WoW64 process.

Figure 12 – a (very) partial view of the EPROCESS structure as seen in Windows 10 RS3. Notice the Wow64Process pointer at offset 0x428.

With that in mind, we can apply a DKOM-based solution, in which WoW64Process is zeroed-out right before allocating memory for the adapter thunk and restored to its original value afterwards.

Figure 13 – pseudocode for making the adapter thunk occupy the native bitmap

This solution, presented in Appendix B, makes our APC injection succeed in CFG-aware WoW64 processes, and was tested on Windows 10 RS3.

Although simple, this method does have some significant downsides. First, the EPROCESS structure that has to be modified is largely undocumented and changes often between Windows releases. Therefore, the offset of WoW64Process inside that structure cannot be relied on to remain constant and has to be searched for heuristically during runtime. Second, zeroing out the WoW64Process member can have some unexpected side effects and dangers, especially in cases where the process contains several threads.

To conclude, this is a valid option for making the APC injector work in CFG-aware processes, but it is rather unstable and unreliable, and should be used with extreme caution. Taking these downsides into consideration, we wanted to find a more reliable solution to the problem, preferably one that does not rely on private, executable memory allocations at all.

Thunkless APC injection

When initializing an APC, it is possible to set the APC routine to point to any function of our choice, whether it is an existing function or one created by us especially for that purpose. This means that – theoretically at least – we can inject a DLL by creating an APC that will directly call the native LdrLoadDll(), without going through the adapter thunk at all. Obviously, LdrLoadDll() is a valid call target for 64-bit code, and thus it can serve as an APC target without triggering a CFG violation.

However, at the binary level there seems to be a problem: the prototypes of LdrLoadDll() and KNORMAL_ROUTINE don’t match up. While LdrLoadDll() expects four arguments, functions of type KNORMAL_ROUTINE only seem to receive three:

Figure 14.1 – the prototype of LdrLoadDll()

Figure 14.2 – the prototype of KNORMAL_ROUTINE

Still, one should consider the __fastcall calling convention used in accordance with the x64 ABI: the first four arguments of every function are passed to it via registers RCX, RDX, R8 and R9, and so when LdrLoadDll() will be called by KiUserApcDispatcher(), whatever value is currently held by R9 will be interpreted as the fourth parameter. According to the prototype presented above, the fourth parameter received by LdrLoadDll() is declared as “_Out_ PHANDLE ModuleHandle”. This means that for LdrLoadDll() to succeed, R9 must contain a valid pointer to a writable memory location capable of holding pointer-sized data.

Unfortunately, as the standard APC procedure takes only three parameters, there is obviously no way to specify a value for a fourth one during APC initialization time. As a result, the value held by R9 upon entry to the APC routine is basically unknown. So the question arises: can we somehow guarantee that R9 will hold a valid pointer such that it satisfies all of LdrLoadDll() requirements? Surprisingly enough, the answer to this question is positive, but how can we be sure of that?

Figure 15 – prototypes of KeInitializeApc and KeInsertQueueApc. The user-controlled parameters for the user-mode APC routine (NormalRoutine) are highlighted in red.

In his post exploring some of the internal aspects of APC dispatching, Skywing demonstrated that the 64-bit KiUserApcDispatcher() actually sends the APC routine a fourth, “hidden” argument, pointing to a CONTEXT structure. This structure holds the CPU state that is to be restored via NtContinue() when the APC dispatching process is finished. Although this post is rather old, looking at the implementation of KiUserApcDispatcher() in newer systems such as Windows 10 shows that this still holds:

Figure 16 – Part of the implementation of KiUserApcDispatcher() from native NTDLL in Windows 10 RS3. Notice that RSP, pointing to the CONTEXT structure, is moved into R9.

So, we can conclude that in this scenario, the value received by LdrLoadDll() as ModuleHandle will always point to a writable memory block which holds a CONTEXT structure, thus allowing for a successful injection. However, overwriting members of the CONTEXT structure might get risky; if any important information is trashed, the thread might crash when attempting to resume its execution after the call to NtContinue() is made. As we’ve seen before, LdrLoadDll() only writes 8 bytes (pointer size on x64) to the memory location pointed to by ModuleHandle, so it would only overwrite the first member of the CONTEXT structure, which happens to be P1Home:

Figure 17 – the first 0x34 bytes of the CONTEXT structure. The parameters passed to the APC routine are stored in offsets 0, 0x8 and 0x10 of that context, while the address of the APC routine is at offset 0x18.

Luckily, the first four members of the CONTEXT structure are actually used to store the arguments for KiUserApcDispatcher() and are no longer required once the APC routine itself is executed. In order to make sure that overwriting P1Home is indeed safe, it is enough to take a look at the prolog of KiUserApcDispatcher(), presented in Figure 16. By carefully reviewing its prolog, we can see that KiUserApcDispatcher() has a somewhat unique calling convention. The top of the stack points to the aforementioned CONTEXT structure, which – In addition to a CPU state – also encapsulates the address of the APC routine and the values of the other three parameters that will be passed to it.

By correlating the offsets of this structure, shown in Figure 17, with the arguments’ offsets presented in Figure 16, we can conclude that:

  • P1Home holds NormalContext
  • P2Home holds sysarg1
  • P3Home holds sysarg2
  • P4Home holds NormalRoutine, which is the address of the APC routine that will be called from KiUserCallForwarder().

Figure 18 – the first 0x30 bytes of the CONTEXT structure when the APC routine is called

Since members P1Home to P4Home were never used to hold any CPU-related data, they will not be used by NtContinue() to restore the context. Knowing that, we can assume there is no harm in overwriting P1Home from the APC routine. We can now recreate our injector (shown in Appendix C) to inject a native module into any WoW64 process by queueing an APC that directly calls LdrLoadDll(), without causing the notorious CFG violation error.

Conclusion

This brings to an end the second part of the series. In these first two posts, we demonstrated the ability to inject 64-bit DLLs into WoW64 processes using several different methods. Obviously, more methods exist for doing so, but finding them is left as an exercise to the interested reader.

Up next: adapting an x64 hooking engine to support hooking the native NTDLL.

Appendixes

 

Appendix B – nativising a WoW64 process

 

Appendix C – thunkless APC injection