Page MenuHomeFreeBSD

Extend mmap/mprotect API to specify the max page protections.
ClosedPublic

Authored by brooks on Jan 17 2019, 9:55 PM.
Tags
None
Referenced Files
F108149769: D18880.id52951.diff
Tue, Jan 21, 10:06 PM
Unknown Object (File)
Sat, Jan 18, 12:57 AM
Unknown Object (File)
Fri, Jan 17, 9:17 PM
Unknown Object (File)
Fri, Jan 17, 7:55 PM
Unknown Object (File)
Fri, Jan 17, 4:30 PM
Unknown Object (File)
Mon, Dec 23, 6:56 AM
Unknown Object (File)
Dec 4 2024, 11:49 PM
Unknown Object (File)
Dec 2 2024, 4:25 PM
Tokens
"Like" token, awarded by op.

Details

Summary

A new macro PROT_MAX() alters a protection value so it can be OR'd with a regular protection value to specify the maximum permissions. If present, these flags specify the maximum permissions.

While these flags are non-portable, they can be used in portable code with simple ifdefs to expand PROT_MAX() to 0.

This change allows (e.g.) a region that must be writable during run-time linking or JIT code generation to be made permanently read+execute after writes are complete. This complements W^X protections allowing more precise control by the programmer.

This change alters mprotect argument checking and returns an error when unhandled protection flags are set. This differs from POSIX (in that POSIX only specifies an error), but is the documented behavior on Linux and more closely matches historical mmap behavior.

In addition to explicit setting of the maximum permissions, an experimental sysctl vm.imply_prot_max causes mmap to assume that the initial permissions requested should be the maximum when the sysctl is set to 1. This behavior is known to break code that uses PROT_NONE reservations before mapping contents into part of the reservation. A final version this is expected to provide per-binary and per-process opt-in/out options and this sysctl will go away in its current form. As such it is undocumented.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 24900
Build 23630: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • Fix build and remove _PROT_ALL from namespace.
sys/compat/freebsd32/freebsd32_misc.c
474–475 ↗(On Diff #53089)

This is just a re-wrapping?

Could we add (just) a global sysctl for now, disabled by default, to commit this and allow further testing/experimentation?

  • Move the EXTRACT macros into the PROT_ namespace.
  • Make implying PROT_MAX values conditional on a sysctl.
  • Allow mprotect() to set maximum protections.

The latest diff simplifies the whole change to add a sysctl to enable implying PROT_MAX system wide. I've also added mprotect() support. The current code compiles, but is untested.

libexec/rtld-elf/map_object.c
233 ↗(On Diff #52951)

We're able to set breakpoints on mips64. I've not tested x86.

I believe that you should allocate a flag in the feature note and use it to opt-out (or opt-in ?) of max_prot.

sys/sys/mman.h
59

There should be spaces around '|'.

sys/vm/vm_mmap.c
208

I would write this as
`
max_prot = imply_prot_max ? prot : _PROT_ALL
`

634

I do not understand this.

Suppose you have a readonly shared mapping of the file, which was created using O_RDONLY file descriptor. Isn't this addition allows to add VM_PROT_WRITE to the corresponding vm_map_entry max_prot ? Then on the next step user can mprotect(2) the mapping to modify the file.

sys/vm/vm_mmap.c
380

Should we deny the request if (max_prot & cap_maxprot) != max_prot, so that user knows that the call was not really satisfied ?

brooks marked 2 inline comments as done.
  • Simplify setting initial max_prot.
  • style(9): spaces around '|'s.
brooks added inline comments.
sys/vm/vm_mmap.c
380

I'm leaning toward not. I think we'd only want to error when we'd explicitly passed a PROT_MAX(FOO) expression so we'd have to add state for that since we don't want to fail when max_prot defaults to _PROT_ALL or when it's implied. We'd also just add yet another inexplicable EINVAL for a rare case.

634

I think this does the right thing, but I had to read the implementation a couple times to convince myself. In vm_map_protect, when set_max is true, the new max_prot is checked against existing entries max_protection in the first loop over the region:

if ((new_prot & current->max_protection) != new_prot) {
  vm_map_unlock(map);
  return (KERN_PROTECTION_FAILURE);
}

This rejects attempts to upgrade max_prot.

I believe that you should allocate a flag in the feature note and use it to opt-out (or opt-in ?) of max_prot.

Yes, although I would start with just a global sysctl to start experimentation. If failures are relatively limited perhaps we can introduce only opt-out, otherwise we need both.

This revision is now accepted and ready to land.Jun 12 2019, 10:38 PM
  • Add some minimal docs for PROT_MAX.
This revision now requires review to proceed.Jun 12 2019, 11:33 PM
brooks retitled this revision from Extend mmap(2) API to specify the max page protections. to Extend mmap/mprotect API to specify the max page protections..Jun 12 2019, 11:52 PM
brooks edited the summary of this revision. (Show Details)
brooks edited the summary of this revision. (Show Details)
  • Fix a typo resulting in unbootable systems.
  • Don't imply max_prot when prot is PROT_NONE.

I've imported the update into my testing tree (after encountering the boot failure on the previous version) and it works fine.

sys/vm/vm_mmap.c
107

s/mincore_mapped/imply_prot_max/

Some straightforward ad-hoc testing looks good.

*** Check failed: /root/freebsd/tests/sys/vm/mmap_test.c:107: MAP_ANON with extra PROT flags succeeded
*** Check failed: /root/freebsd/tests/sys/vm/mmap_test.c:107: shm fd with garbage PROT succeeded

I'll start on updating the tests for this change, but as it is initially disabled by default (after correcting the copy-pasteo, at least) IMO it can go in now.

sys/vm/vm_mmap.c
197

This has the effect of masking invalid flags (e.g. PROT_READ | PROT_WRITE | 0x100000 or 0xffff in tests/sys/vm/mmap_test.c).

*** Check failed: /root/freebsd/tests/sys/vm/mmap_test.c:107: MAP_ANON with extra PROT flags succeeded
*** Check failed: /root/freebsd/tests/sys/vm/mmap_test.c:107: shm fd with garbage PROT succeeded

I'll start on updating the tests for this change, but as it is initially disabled by default (after correcting the copy-pasteo, at least) IMO it can go in now.

The current code ignores invalid flags which isn't right and we should fix that.

  • Fix cut-n-pasto
  • Add a check that no invalid prot flags have been passed.
brooks added inline comments.
sys/vm/vm_mmap.c
615

This check is interesting. Linux documents this behavior (EINVAL for invalid protections), but POSIX seems to endorse our silent stripping. I'm inclined to think Linux is more correct here.

sys/vm/vm_mmap.c
108

sysctl description strings usually don't contain a period.

615

We should probably document this error in the man page too.

637

You can avoid a goto by initializing vm_error to KERN_SUCCESS and skipping the final vm_map_protect() if vm_error != KERN_SUCCESS.

sys/vm/vm_mmap.c
199

Don't we also need to verify that prot is a subset of max_prot?

  • Document new error case in mprotect(2).
  • No period at the end of sysctl description.
  • Verify that prot isn't larger than max_prot.
  • Avoid the need for a goto in mmprotect().
  • Add a KASSERT that cap_maxprot is contained in max_prot.
brooks added inline comments.
sys/vm/vm_mmap.c
380

I've added a KASSERT of this because I'm convinced that with the (prot & max_prot) == prot check above that this will always be safe (cap_maxprot ends up set to something derived directly from prot).

lib/libc/sys/mprotect.2
100

Or the prot bits are not a subset of max_prot bits. mmap() should document that too.

sys/vm/vm_mmap.c
372

Shouldn't it be (max_prot & cap_maxprot) == max_prot? Otherwise userland can trigger this assertion by requesting a PROT_MAX(PROT_READ) | PROT_READ mapping using a descriptor with read and write rights.

This seems like a perfectly reasonable extension to mmap(2) and mprotect(2).

Are the "max protection" semantics what you would wish for or just convenient to implement? Do you have further extensions in mind, for example, don't allow deallocation of a region until address space termination?

lib/libc/sys/mmap.2
126

"values" instead of "value"

lib/libc/sys/mprotect.2
72

I think "which" would be better than "e.g.".

79

"values" instead of "value"

sys/vm/vm_mmap.c
617

While not strictly necessary for correctness, I would advocate changing this to

prot = PROT_EXTRACT(prot);

Testing in my wipbsd branch worked on Cirrus-CI for the basic QEMU smoke test but failed on packet.net with the full system:

Trying to mount root from ufs:/dev/ufsid/5bf12e8de29db717 [rw]...
WARNING: WITNESS option enabled, expect reduced performance.
panic: cap_maxprot (7) contains permissions not in max_prot (1)
cpuid = 1
time = 1560874019
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0051e2e960
vpanic() at vpanic+0x19d/frame 0xfffffe0051e2e9b0
panic() at panic+0x43/frame 0xfffffe0051e2ea10
kern_mmap() at kern_mmap+0x5b8/frame 0xfffffe0051e2eab0
sys_mmap() at sys_mmap+0x2a/frame 0xfffffe0051e2ead0
amd64_syscall() at amd64_syscall+0x288/frame 0xfffffe0051e2ebf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0051e2ebf0
--- syscall (477, FreeBSD ELF64, sys_mmap), rip = 0x801065e3a, rsp = 0x7fffffffd2d8, rbp = 0x7fffffffd3e0 ---
KDB: enter: panic
[ thread pid 25 tid 100077 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db>

With @markj's KASSERT change the panic at startup is fixed and now the existing mmap_test tests pass (with and without imply_prot_max).

root@test:/usr/tests # sysctl vm.imply_prot_max=0
vm.imply_prot_max: 1 -> 0
root@test:/usr/tests # kyua debug sys/vm/mmap_test:mmap__bad_arguments
sys/vm/mmap_test:mmap__bad_arguments  ->  passed
root@test:/usr/tests # sysctl vm.imply_prot_max=1
vm.imply_prot_max: 0 -> 1
root@test:/usr/tests # kyua debug sys/vm/mmap_test:mmap__bad_arguments
sys/vm/mmap_test:mmap__bad_arguments  ->  passed
  • Make 'values' plural here.
  • Grammer fixes from @alc.
  • Document EINVAL when (prot & max_prot) != prot.
  • Drop an assertion that is wrong when CAPABILITIES is no defined.
In D18880#446910, @alc wrote:

Are the "max protection" semantics what you would wish for or just convenient to implement? Do you have further extensions in mind, for example, don't allow deallocation of a region until address space termination?

The extensions aren't ideal. This came out of our work on CheriABI (http://cheri-cpu.org) where mmap returns fat-pointers (hardware capabilities) with their own permissions. I wanted to set those permissions based on prot, but rtld depends on the ability to reserve a region with PROT_NONE. PROT_MAX() was an easy way to extend that reservation request to say "reserve some address space mapped PROT_NONE, but given me an RWX capability". In this case I needed something I could shoehorn into mmap. (FWIW, in CheriBSD we run with the equivalent of imply_prot_max=1 with one-line patches to rtld and libvmm.)

In the mitigations call we decided that this change was useful by itself since it lets you do things like make malloc return memory that can never be executable.

One current flaw is that PROT_MAX(PROT_NONE) doesn't work. I've contemplated using an additional bit to indicate that PROT_MAX() is set, but have not done so yet.

On the CHERI front we have another issue. We have two more page permissions load-capability and store-capability. We don't currently expose them to user space because most pages need them (otherwise you can't load and store pointers) and we don't want to have to audit every mmap() call. I've thus far been unable to convince myself of a way to make PROT_MAX() work with negative permissions (e.g. PROT_NO_CAP_READ, PROT_NO_CAP_WRITE). Thinking out loud: we could do something like allocating bits for PROT_CAP_READ, PROT_CAP_WRITE, and PROT_CAP_NONE where we imply PROT_CAP_READ | PROT_CAP_WRITE when none of the bits are set.

sys/vm/vm_mmap.c
372

My belief was that this can't happen because of the way fget_mmap is implemented, but rereading, it looks like that's only true when CAPABILITIES is defined (in which case cap_maxprot == prot so this is safe because prot must be a subset of max_prot per the check above). In the non-CAPABILITIES case cap_maxprot is VM_PROT_ALL so this will fail. Dropping the assertion may be the best fix.

sys/vm/vm_mmap.c
372

Sorry, I don't see why that assertion worked even with CAPABILITIES defined. fget_mmap() returns the maximum protections permitted by capabilities on the descriptor.

brooks added inline comments.
sys/vm/vm_mmap.c
372

On further reading, you're correct.

Interesting, NetBSD has a related change from 2017: https://github.com/NetBSD/src/commit/6508f5143a1028fc68b4de2151c3a33f65eece53
They list "extra" perms that may be added later rather than the full list.

Example from their test case: map = mmap(NULL, page, PROT_MPROTECT(PROT_EXEC)|PROT_WRITE|PROT_READ, MAP_ANON, -1, 0);

As far as I can tell there's only one non-test use in the NetBSD base system userland:

external/bsd/llvm/dist/llvm/lib/Support/Unix/Memory.inc
110:#if defined(__NetBSD__) && defined(PROT_MPROTECT)
111:  Protect |= PROT_MPROTECT(PROT_READ | PROT_WRITE | PROT_EXEC);
This revision is now accepted and ready to land.Jun 19 2019, 8:34 PM

Interesting, NetBSD has a related change from 2017: https://github.com/NetBSD/src/commit/6508f5143a1028fc68b4de2151c3a33f65eece53
They list "extra" perms that may be added later rather than the full list.

Example from their test case: map = mmap(NULL, page, PROT_MPROTECT(PROT_EXEC)|PROT_WRITE|PROT_READ, MAP_ANON, -1, 0);

As far as I can tell there's only one non-test use in the NetBSD base system userland:

external/bsd/llvm/dist/llvm/lib/Support/Unix/Memory.inc
110:#if defined(__NetBSD__) && defined(PROT_MPROTECT)
111:  Protect |= PROT_MPROTECT(PROT_READ | PROT_WRITE | PROT_EXEC);

It wouldn't be too hard to implement that instead if it seems like the right thing to do. It would do what I need for CHERI. Unfortunately, it lacks a way to downgrade max_prot later in mprotect(). We don't have any examples of the latter, but it does seem like something that could be useful in a JIT that avoids flipping back and forth between W and X.

It wouldn't be too hard to implement that instead if it seems like the right thing to do. It would do what I need for CHERI. Unfortunately, it lacks a way to downgrade max_prot later in mprotect(). We don't have any examples of the latter, but it does seem like something that could be useful in a JIT that avoids flipping back and forth between W and X.

The latter case does seem like a useful one. Since there seems to be very few extant users of this I think it doesn't matter much and we should proceed with this patch. I sent email to Joerg and Maxime V. at NetBSD for input though.

One last thing before I commit this version. I'm trying to decide if I care about being able to detect PROT_MAX(PROT_NONE) and mostly leaning towards "no, use MAP_GUARD instead". Any alternative views?

One last thing before I commit this version. I'm trying to decide if I care about being able to detect PROT_MAX(PROT_NONE) and mostly leaning towards "no, use MAP_GUARD instead". Any alternative views?

IMO "use MAP_GUARD" is fine and if we really need to we can revisit after we have some experience with arbitrary software.

One last thing before I commit this version. I'm trying to decide if I care about being able to detect PROT_MAX(PROT_NONE) and mostly leaning towards "no, use MAP_GUARD instead". Any alternative views?

IMO "use MAP_GUARD" is fine and if we really need to we can revisit after we have some experience with arbitrary software.

I think it's also something we can easily add later, just swipe another bit from the top half.

I think it's also something we can easily add later, just swipe another bit from the top half.

Precisely.

From another NetBSD commit:

commit 0c6ae7843ba21d87ac1d2a533ae8989265238de2
Author: kre <kre@NetBSD.org>
Date:   Fri Sep 1 16:27:02 2017 +0000

    Use PROT_MPROTECT() (which would have been better had it been called
    PROT_MAXPROTECT or PROT_ALLOWPROTECT or something) on the mmap() call
    which specifies PROT_NONE, and which we later want to change to PROT_READ,
    otherwise when PAX is enabled, the mprotect() will fail.

I agree with the parenthetical comment - there seems to be very very few PROT_MPROTECT consumers today for us to aim for source compatibility; our PROT_MAX is fine.