Racing round and round: The little bug that could

Author

Head of X-Force Offensive Research (XOR)

IBM

The little bug that could: CVE-2024-30089 is a subtle kernel vulnerability I used to exploit a fully updated Windows 11 machine (with all Virtualization Based Security and hardware security mitigations enabled) and scored my first win at Pwn2Own this year.

In this article, I outline my straightforward approach to bug hunting: picking a starting point and intuitively following a path until something catches my attention. This bug is interesting because it can be reliably triggered due to a logic error. The error occurs in a specific state within an inter-process communication system, which then causes a use-after-free. Finding the bug required comparing the program’s code paths across its various possible states, a process I describe in detail. Equally intriguing is the bug’s origin and Microsoft’s approach to patching it. These topics are also covered in this post.

Hunting for 0-days: Where to start?

A common question I receive about vulnerability research is how to get started. In fact, picking a target and sticking to it might be one of the most difficult steps of the research process. The vulnerability discussed here is in the Microsoft Kernel Streaming Service (mskssrv.sys). Check out this blog post to get a general overview of the subsystem. In that post, I pointed out some characteristics of the MSKSSRV subsystem that might make it a good attack surface, specifically its inter-process communications (IPC) mechanism.

The code base of MSKSSRV is pretty small, and the last vulnerability in this subsystem I discovered was also independently exploited in the wild as a 0-day. I also heard about additional efforts from other researchers and companies to audit this driver. Because of this, I initially fell into the common trap of assuming there are no more bugs left to find in this attack surface. But, because I had suggested it in my previous blog post, I chose to trust my instincts and continue looking.

Lock lock: Who’s there?

A great way to get new research ideas is by staying informed on current research. I read an excellent blog post by k0shl that sparked the inspiration to hunt for a particular type of bug. In the vulnerability found by k0shl, an object’s reference count is initialized and incremented without proper locking, creating a use after free window. Despite k0shl’s bug being a userland bug and not in the kernel, the coding style of the vulnerable library reminded me of when I previously audited the MSKSSRV driver.

The MS KS Server (MSKSSRV) interacts with a userland process viaFSStreamReg and FSContextReg objects. FSStreamReg and FSContextReg are both derived from the baseFSRegObject class. I noticed that FSContextReg does not implement a locking vtable function, and the base class (FSRegObject) implementation is simply a nop instruction. This means no locking mechanism is actually implemented for FSContextReg objects. Conversely,FSStreamReg implements a locking function that utilizes a mutex. The locking mechanism is used when accessing the objects for cleanup, in the function FSRendezvousServer::Close:

Code Block 1: FSRendezvousServer::Close, locking and unlocking path for FSRegObjects

There are two objects derived from the FSRegObject base class, but one object type implements a proper locking mechanism and the other doesn’t. This was suspicious to me. To test my theory, I tried to trigger undefined behavior using references to the unprotected FSContextReg object. However, due to the locking protections on the global FSRendezvousServer object, which holds the pointers to the lists of FSRegObjects , I couldn’t manage to trigger anything interesting, despite the lack of lock protection on FSContextReg objects. Still, I could sense there was something “fishy” about the reference counting system for FSRegObjects .I just didn’t know what yet.

IPC in MS KS server

As I mentioned in my last blog post, the inter-process object sharing aspect of MSKSSRV is an interesting avenue for vulnerabilities, so I decided to focus on it further. The IPC mechanism of the subsystem is illustrated in the following diagram:

Diagram 1: Inter-process Communication in MS KS Server

Opening a file handle to the MSKSSRV device, via CreateFile, creates a FILE_OBJECT that corresponds to that handle. Using that handle, a process can initialize a new stream or a context object by sending the device an IOCTL using the DeviceIoControl function. The initializing process designates which remote process can register the object by specifying the process ID via thelpInBuffer argument. The remote process, using a new file handle to the MSKSSRV device, can now register the object via device IOCTL. The pointer for the FSRegObject is stored in

Irp->CurrentStackLocation->FileObject->FsContext2.

The same pointer to the FSRegObject object is stored twice in FsContext2, once in the FILE_OBJECT used by the initializing process and once in the FILE_OBJECT used by the registering process. In this way, references to the FSRegObject object can be shared across processes. For example, using an FSStreamReg object, multiple processes can have access to stream frame buffer, as shown in Diagram 1. An FSRegObject’s reference count is initialized to 1 and then incremented again after initialization is complete. Registering the FSRegObject increases its reference count again, for a total of three references per object.

Diagram 2: Initializing and Registering FSContextReg Objects

Finding a bug

I decided to take another look at where I previously audited trying to look for locking vulnerabilities. I noticed the function FSRendezvousClose, which calls FSRendezvousServer::Close , is called within the driver’s dispatch cleanup and close function routines:

Code Block 2: DispatchCleanup and DipatchCleanup routines for MS KS Server Driver

A dispatch routine handles one or more types of IRPs, which are packaged I/O requests. In Windows, when all handle references to a file have been closed, the corresponding file system driver for the file receives an IRP_MJ_CLEANUP and an IRP_MJ_CLOSE request, which are handled by the driver’s DispatchCleanup and DispatchClose function routines.

In MSKSSRV, FSRendezvousClose is called within both the driver’s DispatchCleanup and DispatchClose function routines, if the pointer stored in Irp->CurrentStackLocation->FileObject->FsContext2 is not NULL. Something that stood out to me in FSRendezvousServer::Close , which I had analyzed before, were the various checks on the caller’s process ID. Process context matters because all kernel mode code operates within a singular kernel address space, which is separate from user-mode address spaces. Each process has its own user-mode memory context, and the process context in which a kernel thread executes determines which process’ user-mode address space will be accessed if the thread accesses user addresses. The following code checks if the calling process is the initializing or registering process:

Code Block 3: FSRendezvousServer::Close, Process checks on FSRegObjects

Process specific information is stored in the FSRegObject stored in

 Irp->CurrentStackLocation->FileObject->FsContext2

at the time of initialization or registration. The driver determines which processes specific resources (

EPROCESS

objects, event objects, and other stuff) it needs to release by checking the caller’s process ID. If the process is the initializing or registering process, some additional cleanup is done for those process specific resources.

This stood out to me because generally, all Dispatch routines execute in an arbitrary process context, with some exceptions. In other words, the system picks a thread to do the Dispatch work; what thread it picks is arbitrary. I discovered that DispatchCleanup is called in the process context of the process that closed the final handle. So, in this case, the process ID checks make sense. However, DispatchClose is called from an arbitrary process context. That means that if FSRendezvousServer::Close is called as a result of IRP_MJ_CLOSE request, it would be from an arbitrary process context, defeating the purpose of the process ID checks. This was a big clue that something was wrong here.

Additionally, as a feature of the Windows OS, handles can be shared with other processes (by child process inheritance or using the DuplicateHandle API function). Via the shared file handle, the other process can also interact with the same FILE_OBJECT .

Diagram 3: Sharing MS KS Server Device Handle

Due to this, it is also possible that the process context during DispatchCleanup is neither the initialization process nor the registration process. If either of those handles are duplicated and shared to another process, and that process is the last to close the handle, DispatchCleanup will be called within the context of that foreign process (which is not the initializing or registering process).

Diagram 4: Foreign Process Closing Last Handle to a FILE_OBJECT

I noticed that the function FSRendezvousServer::Close could be called twice when a file handle is closed (once for the IRP_MJ_CLEANUP request and once for the IRP_MJ_CLOSE request, Code Block 2). Within that function there are two possible calls to FSRegObject::Release (Code Block 4), which dereferences the object and frees its memory if the reference count drops to 0. That means FSRegObject::Release could be called up to four times for a handle. Additionally, there are two file handles for which the FILE_OBJECT points to the same FSRegObject via the FSContext2 pointer. That means a possibility of calling FSRegObject::Release up to eight times on the same object, four for each handle. If the original reference count of the object is only three, there could be a possible use after free triggered by too many dereferences. That was my line of thinking, anyway. I knew the program structure would probably prevent hitting the theoretical maximum number of dereferences, but maybe not quite enough. At this point I felt I was on to something and decided to investigate this further.

Code Block 4: FSRegObject::Release can be called twice from FSRendezvousServer::Close

In general, more dereferences than there are references on an object is not the only way a use after free can occur. However in this case, we can be sure that if an FSRegObject has been freed, its reference count has dropped to zero. The last time a valid FSRegObject is accessed during the IRP_MJ_CLEANUP/CLOSE IRP requests is in a call to FSRegObject::Release. So, if a use-after-free is possible, a call to FSRegObject::Release will always occur after the object has already been freed. During the call, the object will be once again dereferenced. For that reason, counting the number of dereferences is a good heuristic to find use-after-frees for this particular case.

The only thing left to do was to trace out the possible states of the program, taking note of when the object is freed and accessed. I did this by mentally emulating the program logic during IRP_MJ_CLEANUP/CLOSE requests, each beginning with the corresponding Dispatch functions (Code Block 2), for each of the possible cases.

Shown below are the states based on which process closes the final reference to a handle. Each entry represents the number of dereferences of the FSContextReg object that occur if the corresponding process closes the final handle. Note: there is no functional difference between HANDLE #1 (initializing handle) and HANDLE #2 (registering handle), as the FileObject->FSContext2 field points to the same memory in both FILE_OBJECTs represented by the corresponding handles.

FSContextReg dereferences for each of the possible MSKSSRV IPC states

Success! The last state results in four dereferences: two by the foreign process and two by the initializing (and also registering) process, while only having three references initially, meaning a use after free is possible! I also repeated the same exercise with FSStreamReg objects, but due to a memory leak bug in the code, it’s actually not possible to ever free a FSStreamReg object after it’s been registered.

The vulnerability

While doing the virtual machine brain exercise outlined above, I found the problem. If the process is the initializing process or registering process, the appropriate cleanup happens, and the pointer stored in Irp->CurrentStackLocation->FileObject->FsContext2 is set to NULL . The code block below shows where this occurs in the case the caller is the initializing process:

Code Block 5: FSRegObject::CloseInitProcess sets FSContext pointer to NULL

This means FSRendezvousClose will not be called again during the completion of the IRP_MJ_CLOSE request in DispatchClose (SrvDispatchClose , Code Block 2).

However, if the calling process is a foreign process, no cleanup occurs and FSRegObject::Release is called once near the end of the function. Since FileObject->FsContext2 is not NULL, FSRendezvousServer::Close is called again during the subsequent IRP_MJ_CLOSE request and another call to FSRegObject::Release occurs.

Now, if the second handle is closed by a process that both initialized and registered the FSContextReg object, the object will clean up all its stored process resources, making it empty. This causes FSRegObject::Release to be called twice within FSRendezvousServer::Close (Code Block 4). The extra dereference serves to account for the extra initializing reference once the object is empty.

This makes for a total of four dereferences on a single FSContextReg object, indicating that a use after free occurs.

Diagram 5: CVE-2024-30089 depicted

The reader following along might wonder why then a foreign process can’t be the last one to close both handles, since this would seemingly also lead to four dereferences. Before the object is destructed and freed, it is unlinked from a list stored in the global FSRendezvousServer object. At the beginning of FSRendezvousServer::Close, the pointer in FSContext2 is checked to be a valid member of the list. In this case, the object is freed in the second call to FSRegObject::Release at the end of the function. During the fourth call to FSRendezvousServer::Close, the object has already been unlinked from the list, making it an invalid object, so it cannot be used. In order to trigger a use-after-free, it must occur after the object has been retrieved and validated in FSRendezvousServer::Close. The code snippet below shows the use-after-free primitive that can be obtained by the vulnerability:

Code Block 6: UAF primitive path

Attack complexity

In the security update guide for this vulnerability, the CVSS score indicates that “Exploitation [is] More Likely” and the attack complexity for this vulnerability is “Low”. While Microsoft does not provide detailed explanations for their scoring, I have noted some patterns while patch diffing other vulnerabilities. The vulnerability likely received this score because it stems from a logic error, making it reliably triggerable. By following the steps outlined in Diagram 5, an attacker can consistently trigger the use-after-free scenario depicted in Code Block 6. However, this doesn’t mean that exploiting it in practice is straightforward. A detailed walkthrough of the exploitation steps will be covered in the next part of this series.

A retrospective

Understanding how a bug occurred is important for cultivating a proactive approach to secure development practices. To pinpoint how the vulnerability was introduced, I analyzed previous versions of the driver obtained from Winbindex and looked for any differences in logic in the FSRendezvousServer::Close function.

In the vulnerability section, I mentioned that the ultimate cause of this bug was not setting

Irp->CurrentStackLocation->FileObject->FsContext2

NULL

if the calling process is a foreign process. To my surprise, I saw this exact line of code in an early version of mskssrv.sys:

Code Block 7: Early version of FSRendezvousServer::Close, FsContext2 is set to NULL

In the code block shown above, FileObject->FsContext2 is explicitly set to NULL , regardless of the result of the preceding process ID checks. This prevents FSRendezvousServer::Close from being called again in the subsequent IRP_MJ_CLOSE request, so the extra dereference cannot occur. Weird — so why was this line of code taken out? Let’s take a look at a later version of the function, where the bug was first introduced:

Code Block 8: FsContext set to NULL within a feature flag check

Shown in the code block above is a check for the feature flag

Feature_Servicing_TeamsUsingMediaFoundationCrashes

.
Feature flags are a component of Windows that toggle various functionality and experiments, though there is not much public information about them. In this previous blog post, we discuss how feature flags have been used for vulnerability patches. Feature flags are sometimes used to test out a functionality before it is officially adopted. In this case, if the

Feature_Servicing_TeamsUsingMediaFoundationCrashes

feature is enabled,

FileObject->FsContext2

is not set to NULL, introducing the vulnerability. This feature was observed to be enabled by default on Windows 10 installations. In Windows 11 and as shown in Code Block 1, this feature flag conditional is not present and the pointer is not set to

NULL

, making it vulnerable as well.

Due to the name of the feature, I looked into the functionality of Microsoft Teams, the video conferencing software. I confirmed that the application can use MSKSSRV functionality to share media streams across processes. It is possible that stream handle sharing was causing Teams to crash. An interesting topic for further research would be to examine how Teams shares MSKSSRV device handles across processes, and why performing proper pointer cleanup could cause the application to crash.

The patch

This part of the series is focused on the vulnerability itself, which includes its patch. I was particularly interested in examining the patch for this bug, since my proposed fix seemed to trigger crashes in Microsoft Teams. It’s important to mention that at this point I have yet to examine how a real application uses the MSKSSRV driver in practice. Not having this context introduces blind spots into the understanding of why a system is designed the way it is. A complete patch for this bug would require some base code restructuring and could reveal more details about how the IPC system is intended to function. I was also hoping to glean some insight into secure coding practices from Microsoft developers.

To my disappointment, the logic error that caused the vulnerability, which was patched in the June 2024 Security updates, was not addressed directly. Instead, an access token check was added before the vulnerable code paths. See the code below for the initialize context IOCTL, handled by the function FSRendezvousServer::InitializeContext :

Code Block 9: IOCTL function begins with a feature flag check and checks if calling process is a frame server

The function above begins by checking if a feature is enabled. This is likely the feature flag corresponding to the patch. If the feature is enabled, KsIsCurrentProcessFrameServer must return TRUE, otherwise NTSTATUS value STATUS_ACCESS_DENIED is returned.

Let’s take a look at KsIsCurrentProcessFrameServer :

Code Block 10: KsIsCurrentProcessFrameServer performing a SID check on the calling thread’s access token

This function checks the calling thread’s token against two specific security identifiers (SIDs). The SIDs correspond to a token in group NT SERVICE\FrameServer. If either of the SIDs are enabled in the calling thread’s access token, then the vulnerable function code can execute.

After seeing this, I suspected there likely was an Administrator to Kernel bug still present. Ultimately, the memory corruption problems were not addressed at all. I confirmed this by making a slight modification to my original exploit: An administrator user can start the FrameServer service, open a handle to the service and create the exploit process using the handle. I was able to obtain a full kernel R/W primitive on a fully patched system.

While Microsoft does not consider Administrator to Kernel to be a security boundary, similar bugs have been used by threat actors to gain a kernel R/W primitive and use it for EDR blinding and rootkit operations. If you’re interested in what kind of things can be done with this primitive, check out my BlackHat talk alongside FuzzySec.

Conclusion and next steps

This post focused on the vulnerability research part of my Pwn2Own endeavor, which consisted of finding a 0-day kernel vulnerability that can be exploited for privilege escalation. This post outlines the journey: getting inspired by other research, failing to find a bug, picking a new angle, finding something suspicious, and then finally pinpointing where the vulnerability lives. Now that a bug has been identified and there’s a use-after-free primitive, the rest should be straightforward, right? Microsoft seems to think so, they rated this bug “Exploitation More Likely” with attack complexity “low”. Are they right? I’ll cover that, the exploitation strategy, and unveil the meaning of the series title, in the next part!

Industry newsletter

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Acknowledgements

Andréa Piazza, for the amazing diagrams

Emma Kirkpatrick, for patiently explaining the Windows security model to me

Mixture of Experts | 25 July, episode 65

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes