Page MenuHomeFreeBSD

kern: Introduce a low-cost, conditional execution mechanism
Needs ReviewPublic

Authored by mvlaic on Aug 20 2024, 1:25 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Oct 8, 1:57 PM
Unknown Object (File)
Thu, Oct 3, 11:19 PM
Unknown Object (File)
Thu, Oct 3, 9:29 PM
Unknown Object (File)
Tue, Oct 1, 7:11 PM
Unknown Object (File)
Tue, Oct 1, 3:41 AM
Unknown Object (File)
Sat, Sep 21, 1:51 PM
Unknown Object (File)
Sat, Sep 21, 10:40 AM
Unknown Object (File)
Fri, Sep 20, 12:10 PM

Details

Summary

This patch adds a low-cost, conditional execution mechanism based on hot-patching.
Its main goal is to provide a way of optimize infrequently used if-statements in performance-sensitive code.
Instead of repeatedly evaluating an if statement, a single branch
direction is "baked in" at compile time. This is achieved by an unconditional
jump or nop using the using 'asm goto' feature. When the branch direction needs
to be changed, the appropriate instruction is patched at runtime.

Previous code blocks of form:

bool flag = false;

if(flag) {
	action1();
}

if(!flag) {
	action2();
}

would be migrated to the new interface as:

DEFINE_ZCOND_FALSE(flag);

if(zcond_true(flag)) {
	action1();
}

if(zcond_false(flag)) {
	action2();
}

This revision also modifies SDT to use this mechanism, both as a proof of concept and example of use.
This revision introduces support for the amd64 architecture only.
Support for all other architectures will be added if the overall approach is deemed acceptable.

This work was sponsored by the GSoC '24 program.

Test Plan

Core functionality was tested by manually instantiating a couple of zconds and toggling them
through sysctls. SMP safety was tested using a script which attempts to toggle a single zcond
from multiple threads at a high number of iterations.
The SDT migration was tested by running DTrace snippets by hand. Additionaly the DTrace test suite
was used.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

From the perspective of a reviewer, this patch is doing at least two things: adding a new mechanism to modify kernel text, and adding a new interface to define patch points. They should really be considered separately.

Since this kernel code patching is central to this mechanism, special care was taken to mitigate any potential security risks.
Rather than explicitly turning certain hardware protections off (e.g. disable_wp()), this patch uses a separate
kernel page table to perform kernel code patching. This separate page table is initalized by copying parts of the kernel pmap during boot.

Why is this more secure than toggling WP?

sys/amd64/amd64/zcond_machdep.c
117

Why is it necessary to map the VA?

172

Why do you need a separate pmap just to map a single page?

sys/sys/zcond.h
65

These names don't match the definitions below.

I'd like to briefly weigh in on a couple of points.

Since this kernel code patching is central to this mechanism, special care was taken to mitigate any potential security risks.
Rather than explicitly turning certain hardware protections off (e.g. disable_wp()), this patch uses a separate
kernel page table to perform kernel code patching. This separate page table is initalized by copying parts of the kernel pmap during boot.

Why is this more secure than toggling WP?

Dropping WP will expose the whole kernel image, whereas this approach exposes only one 0-order page.
Avoiding WP toggling by remapping the underlying page elsewhere in the kernel's address space does not eliminate the possibility of that mapping being used from other cores.

The separate "patching" page table acts as an additional hardening step, and has two features that set it apart from the other solutions:

  1. tearing down the mapping does not require a TLB shootdown,
  2. the temporary "patch" mappings are never disclosed to other cores.

From the perspective of a reviewer, this patch is doing at least two things: adding a new mechanism to modify kernel text, and adding a new interface to define patch points. They should really be considered separately.

Thank you for pointing this out. I will separate the two and then open a separate review for the kernel text modification mechanism.

sys/amd64/amd64/zcond_machdep.c
117

The zcond_qenter function expects a valid page table entry at the zcond_patch_va address.

I'd like to briefly weigh in on a couple of points.

Since this kernel code patching is central to this mechanism, special care was taken to mitigate any potential security risks.
Rather than explicitly turning certain hardware protections off (e.g. disable_wp()), this patch uses a separate
kernel page table to perform kernel code patching. This separate page table is initalized by copying parts of the kernel pmap during boot.

Why is this more secure than toggling WP?

Dropping WP will expose the whole kernel image, whereas this approach exposes only one 0-order page.
Avoiding WP toggling by remapping the underlying page elsewhere in the kernel's address space does not eliminate the possibility of that mapping being used from other cores.

The separate "patching" page table acts as an additional hardening step, and has two features that set it apart from the other solutions:

  1. tearing down the mapping does not require a TLB shootdown,
  2. the temporary "patch" mappings are never disclosed to other cores.

In practice, hot-patching occurs in a context where the initiating core has interrupts disabled (by virtue of holding smp_ipi_mtx), and all other cores are spinning in the rendezvous handler. So, I'm not sure what the new approach is buying us. What is the threat model? Are we, for example, considering the case where an attacker already has some limited control and is trying to overwrite read-only data using hot-patching via ROP or something like that?

To be clear, I don't mean to be negative, I would just like to better understand the justification for the approach taken here.

In practice, hot-patching occurs in a context where the initiating core has interrupts disabled (by virtue of holding smp_ipi_mtx), and all other cores are spinning in the rendezvous handler. So, I'm not sure what the new approach is buying us. What is the threat model? Are we, for example, considering the case where an attacker already has some limited control and is trying to overwrite read-only data using hot-patching via ROP or something like that?

That is correct, this was designed with a more strict threat model in mind, one where an attacker gained some limited ability to write into kernel memory through another vulnerability. Keeping WP on under this model limits the potential impact of the attacker's writing abilities, so this leaves us with the problem of figuring out how to patch the kernel while keeping WP on, which is where the "patching page table" approach comes in.

In practice, hot-patching occurs in a context where the initiating core has interrupts disabled (by virtue of holding smp_ipi_mtx), and all other cores are spinning in the rendezvous handler. So, I'm not sure what the new approach is buying us. What is the threat model? Are we, for example, considering the case where an attacker already has some limited control and is trying to overwrite read-only data using hot-patching via ROP or something like that?

That is correct, this was designed with a more strict threat model in mind, one where an attacker gained some limited ability to write into kernel memory through another vulnerability. Keeping WP on under this model limits the potential impact of the attacker's writing abilities, so this leaves us with the problem of figuring out how to patch the kernel while keeping WP on, which is where the "patching page table" approach comes in.

Sorry, I still don't really follow. An attacker which can abuse ROP to call sdt_tracepoint_patch() can overwrite an arbitrary 5-byte memory location. Similarly, an attacker which can call zcond_before_patch() can map an arbitrary page of kernel memory with RW permissions, so can be subsequently written, and zcond_after_patch() will restore CR3. Then, what's preventing an attacker from abusing these functions? Is there a fundamental gap between the two approaches, from the perspective of an attacker who has some limited code execution capabilities?

Switching CR3 like this is a low-level trick that affects the entire kernel. To justify such a trick, there should be a more formal explanation of the security properties of this approach.

sys/amd64/amd64/zcond_machdep.c
172

I understand a bit better now, but I don't understand how this code works at all. On amd64, after boot, virtual_avail will be somewhere above KERNBASE (0xffffffff80000000), but kernel_vm_end is the end of the kernel map (somewhere between VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS, updated by pmap_growkernel()). In particular, virtual_avail > kernel_vm_end, so the subtraction here will underflow.

Even if your intention is to copy the kernel map, this code is making a lot of assumptions about the system, e.g., that the kernel stack is included in your map, that nothing will access the direct map while CR3 is switched, etc..

Sorry, I still don't really follow. An attacker which can abuse ROP to call sdt_tracepoint_patch() can overwrite an arbitrary 5-byte memory location. Similarly, an attacker which can call zcond_before_patch() can map an arbitrary page of kernel memory with RW permissions, so can be subsequently written, and zcond_after_patch() will restore CR3. Then, what's preventing an attacker from abusing these functions? Is there a fundamental gap between the two approaches, from the perspective of an attacker who has some limited code execution capabilities?

Right, ignoring the current iteration where an arbitrary page can get mapped, this patching approach is only intended for kernel text.
The routines should've enforced this by checking the patch VAs before re-mapping the underlying pages.
However, if such checks are in place, we've limited the attacker's range from an arbitrary 5-byte location to the range where .text resides.

sys/amd64/amd64/zcond_machdep.c
172

Thank you for pointing this out. We are trying to come up with a proper way to copy the kernel_pmap now.

In terms of assumptions, this code assumes that only a single memcpy() will be called between zcond_before_patch() and zcond_after_patch().

sys/sys/zcond.h
65

I'm sorry but could you please expand a bit? I feel like the two match.

sys/sys/zcond.h
65

This comment describes ZCOND_DEFINE_TRUE() etc.. The macros below are called DEFINE_ZCOND_TRUE() etc..

sys/sys/zcond.h
65

Okay that is clearly true :) I will edit the comment to match the code.

db_write_bytes doesn’t bother with any of this, instead disabling WP on amd64 and using the DMAP on arm64…

However, if such checks are in place, we've limited the attacker's range from an arbitrary 5-byte location to the range where .text resides.

This is equally true regardless of whether you do page table switching, disable WP or use the DMAP. Which is only true if you consider ROP/JOP attacks that use the function entry point, which isn’t the reality. At the end of the day they can jump into whatever part of the function they like to get useful gadgets, and you have no idea what that might be in the compiled code, but has a good chance of letting you fully control what you map, at which point it’s still a write primitive. So I don’t see the benefit of this approach, it’s just a lot of complexity to pretend there’s additional security. Not to mention all the other write-anywhere-in-the-DMAP gadgets that surely already exist in the kernel. If you can ROP/JOP attack the kernel you’ve already won.

db_write_bytes doesn’t bother with any of this, instead disabling WP on amd64 and using the DMAP on arm64…

While it's true that one can use the DMAP on arm64, that's hardly a desirable state of affairs. Assuming that the portion of the DMAP covered by the kernel text will be readonly, on arm64 and other platforms we'd instead need to temporarily relax protections on the 4KB kernel map page containing the target address, modify the text, and re-apply RO protections. If the first step involves demoting 2MB mappings, the last step should be careful to recreate them; I think this last piece is missing (see pmap_change_prot() on amd64 and arm64), but the use of large mappings for kernel text is important for performance. Perhaps that might be a more fruitful direction to consider.

db_write_bytes doesn’t bother with any of this, instead disabling WP on amd64 and using the DMAP on arm64…

While it's true that one can use the DMAP on arm64, that's hardly a desirable state of affairs. Assuming that the portion of the DMAP covered by the kernel text will be readonly, on arm64 and other platforms we'd instead need to temporarily relax protections on the 4KB kernel map page containing the target address, modify the text, and re-apply RO protections. If the first step involves demoting 2MB mappings, the last step should be careful to recreate them; I think this last piece is missing (see pmap_change_prot() on amd64 and arm64), but the use of large mappings for kernel text is important for performance. Perhaps that might be a more fruitful direction to consider.

In light of the whole discussion @bnovkov and I agreed to abandon the separate pmap approach. I will refactor the code to use disable_wp() and then update this revision.
Once we pass this review I will focus on other architectures and see if anything can be improved regarding pmap_change_prot() on arm64.

I would like to thank everyone for their input and time.

mvlaic edited the summary of this revision. (Show Details)

The mechanism is now refactored to use disable_wp() instead of the separate pmap scheme.