Page MenuHomeFreeBSD

Make kernel allocations be non-executable on some platforms
ClosedPublic

Authored by jtl on Jun 7 2018, 4:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 26, 12:32 PM
Unknown Object (File)
Mon, Nov 18, 10:04 PM
Unknown Object (File)
Wed, Nov 13, 7:08 PM
Unknown Object (File)
Tue, Nov 12, 9:50 AM
Unknown Object (File)
Wed, Nov 6, 2:15 PM
Unknown Object (File)
Oct 29 2024, 3:10 AM
Unknown Object (File)
Oct 27 2024, 8:12 AM
Unknown Object (File)
Sep 24 2024, 10:19 PM

Details

Summary

Most kernel memory that is allocated after boot does not need to be executable. Make UMA and kernel malloc() return non-executable memory.

(Note that a side effect of r316767 caused this behavior to be true for allocations on amd64 that followed the "small allocation" path in UMA.)

Allocations that do need executable memory have various choices. They can map the memory themselves. Or, they can call kmem_malloc() with the new M_EXEC flag.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 17188
Build 17038: arc lint + arc unit

Event Timeline

Incorporate review feedback from @jhb.

kib added inline comments.
sys/vm/vm_kern.c
188

(flags & M_EXEC) != 0

This revision is now accepted and ready to land.Jun 7 2018, 5:08 PM
markj added inline comments.
share/man/man9/malloc.9
290

This should note that some platforms may not be able to support the request and will return executable memory anyway.

Overall, I think that this is a good idea, but the implementation has the following problem. The allocation of one executable page will block the promotion of the surrounding pages to a superpage mapping. In short, the executable mappings should be segregated from the normal mappings. One possible approach is to create a separate arena for allocation of kernel virtual addresses that are executable And, that arena imports from the current arena at a superpage granularity.

Given how rarely executable memory is probably used (just bpf JIT currently), I'm not sure how big of an impact the fine-grained PG_NX permissions will be?

share/man/man9/malloc.9
290

I had explicitly asked Jonathan to alter the wording to be stronger as I don't want developers depending on the accidental X permission and wanted to explicitly mention M_EXEC somewhere. Perhaps it could be adjusted a bit further to say something like:

"The memory that these allocation calls return be treated as non-executable.
To allocate executable memory, use
....
flag.
Note that some platforms do not support non-executable memory and will return executable memory anyway."

FWIW, we don't have a caveat / IMPLEMENTATION NOTE about that in the mmap(2) manpage (but we probably should?)

In D15691#331801, @alc wrote:

Overall, I think that this is a good idea, but the implementation has the following problem. The allocation of one executable page will block the promotion of the surrounding pages to a superpage mapping.

This is easy enough to do. I wonder, however, if there is a concern about allocating a superpage of kernel address space for this, particularly on 32-bit systems. For example, if we used a different arena, a single use of M_EXEC memory on i386 would need to allocate 21 bits/2MB (PAE) or 22 bits/4MB (non-PAE) of kernel address space to that arena. I wonder if that would put unnecessary pressure on the available kernel address space?

Given the current use of these (just for the BPF just-in-time compiler), it may be that the performance penalty of non-superpage allocations is acceptable in this circumstance?

In D15691#332057, @jtl wrote:
In D15691#331801, @alc wrote:

Overall, I think that this is a good idea, but the implementation has the following problem. The allocation of one executable page will block the promotion of the surrounding pages to a superpage mapping.

This is easy enough to do. I wonder, however, if there is a concern about allocating a superpage of kernel address space for this, particularly on 32-bit systems. For example, if we used a different arena, a single use of M_EXEC memory on i386 would need to allocate 21 bits/2MB (PAE) or 22 bits/4MB (non-PAE) of kernel address space to that arena. I wonder if that would put unnecessary pressure on the available kernel address space?

Given the current use of these (just for the BPF just-in-time compiler), it may be that the performance penalty of non-superpage allocations is acceptable in this circumstance?

I understand the generic nature of your concern, for 32bit architectures. i386 in fact does not have a problem with KVA which would make it prohibitively costly to allocate 4M just for executable allocations. And I still did not merged PAE/non-PAE.

In D15691#331991, @jhb wrote:

Given how rarely executable memory is probably used (just bpf JIT currently), I'm not sure how big of an impact the fine-grained PG_NX permissions will be?

Here is my other concern with this change, and this affects all of the kmem arena. Since we are using vmem and not vm map entries to track what is allocated versus free, we have no record of which 4KB pages are executable and which ones are not at the machine-independent layer. This could have consequences.

Suppose that I attempt to optimize kmem_back_domain() by reducing the number of pmap_enter() calls that it performs by using pmap_enter(..., psind=1). Specifically, I remove the pmap_enter() call from the current allocation loop, and I introduce a second loop that first checks to see if the first allocated page (and any later page at the start of a superpage-sized region) comes from a full reservation. If it does, I do a single pmap_enter(..., psind=1) call for the entire superpage-sized region. Moreover, for kmem_back_domain(), there is no good reason why we are performing the pmap_enter() calls while the kernel object is locked.

However, if kmem_back_domain() can't be certain that there are no executable pages within the superpage-sized region, it can't use pmap_enter(..., psind=1). It has to always perform 4KB-sized pmap_enter() calls and rely on automatic promotion to do the right thing.

In D15691#332057, @jtl wrote:
In D15691#331801, @alc wrote:

Overall, I think that this is a good idea, but the implementation has the following problem. The allocation of one executable page will block the promotion of the surrounding pages to a superpage mapping.

This is easy enough to do. I wonder, however, if there is a concern about allocating a superpage of kernel address space for this, particularly on 32-bit systems. For example, if we used a different arena, a single use of M_EXEC memory on i386 would need to allocate 21 bits/2MB (PAE) or 22 bits/4MB (non-PAE) of kernel address space to that arena. I wonder if that would put unnecessary pressure on the available kernel address space?

There are other possible solutions besides creating an arena for the executable memory allocations. See for example kern/link_elf_obj.c. If we believe that the need for executable memory is so rare that it's not worth an arena, then I would argue that we should simply remove executable permissions from the pmap_enter() calls in vm_kern.c without introducing the M_EXEC flag and rewrite the memory allocation code in BPF to use other VM interfaces besides kmem_malloc().

Given the current use of these (just for the BPF just-in-time compiler), it may be that the performance penalty of non-superpage allocations is acceptable in this circumstance?

To be clear, I'm not really concerned that BPF doesn't benefit from superpage mappings, but rather that it does collateral damage to other nearby data that would have otherwise been accessed through superpage mappings.

Numerous updates:

  • Address @alc's concerns by adding a new arena for allocations with non-standard permissions. Import a 2MB-aligned address block at a time into the arena. Release space back to the parent arena when able. But, only do this for architectures with superpages.
  • Plumb M_EXEC through malloc(9).
  • Fix BPF by reverting most of rS317072.
  • Add a note to the manpage that not all architectures will enforce execution permissions.

I confirmed that this works correctly with BPF's JIT compiler. I also confirmed that the space from the new arena is returned to the parent arena once the allocations are freed.

This revision now requires review to proceed.Jun 11 2018, 11:38 PM

Get the correct version of sys/vm/vm_kern.c (with the troubleshooting stuff removed).

Okay, really without the cruft this time. (Hopefully...)

jtl marked 3 inline comments as done.Jun 12 2018, 12:02 AM
share/man/man9/zone.9
382

.Dv M_EXEC

This revision is now accepted and ready to land.Jun 12 2018, 12:11 PM

Update the zone(9) manpage.

This revision now requires review to proceed.Jun 12 2018, 12:42 PM
jtl marked an inline comment as done.Jun 12 2018, 12:43 PM
This revision is now accepted and ready to land.Jun 13 2018, 5:20 AM
markj added inline comments.
sys/vm/vm_pagequeue.h
90

Shouldn't this be #error?

Note that we also require opt_vm.h, in case VM_NRESERVLEVEL was overridden in the config. I would be somewhat inclined to just always define vmd_kernel_rwx_arena, and have its value be NULL if NRESERVLEVEL is 0.

sys/vm/vm_pagequeue.h
90

(I think you could actually avoid some ifdefs by aliasing vmd_kernel_rwx_arena to vmd_kernel_arena if NRESERVLEVEL == 0.)

sys/vm/vm_pagequeue.h
90

Shouldn't this be #error?

Yes, it should. Good catch.

Note that we also require opt_vm.h, in case VM_NRESERVLEVEL was overridden in the config. I would be somewhat inclined to just always define vmd_kernel_rwx_arena, and have its value be NULL if NRESERVLEVEL is 0.

Yes, I considered that, but didn't want to spend the extra bytes to store this variable if it wasn't necessary. However, it is much more important to have a consistent view of the structure than save a few bytes. So, I think I will make this change.

(I think you could actually avoid some ifdefs by aliasing vmd_kernel_rwx_arena to vmd_kernel_arena if NRESERVLEVEL == 0.)

Yes, I had that in an earlier iteration, but decided to abandon it because I didn't want to add the extra (and unnecessary) processing for machines without superpages. However, I am perfectly happy to rework this if there is a strong consensus to add the processing and avoid the ifdef soup.

sys/vm/vm_pagequeue.h
90

Yes, I had that in an earlier iteration, but decided to abandon it because I didn't want to add the extra (and unnecessary) processing for machines without superpages. However, I am perfectly happy to rework this if there is a strong consensus to add the processing and avoid the ifdef soup.

Fair enough. I don't feel strongly about this part either way, so please don't hold up the change over this detail.

Address @markj's feedback by always defining the vmd_kernel_rwx_arena member of the vm_domain struct.

This revision now requires review to proceed.Jun 13 2018, 4:47 PM
jtl marked an inline comment as done.Jun 13 2018, 4:47 PM
This revision is now accepted and ready to land.Jun 13 2018, 4:47 PM
This revision was automatically updated to reflect the committed changes.