Page MenuHomeFreeBSD

Add support for Intel userspace protection keys feature on Skylake Xeons.
ClosedPublic

Authored by kib on Jan 18 2019, 6:09 PM.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kib added a comment.Jan 18 2019, 6:09 PM

The patch is currently being tested by Peter Holm, more detailed description will come after the testing finishes. Also libc functions warrant manpages.

kib edited the summary of this revision. (Show Details)Jan 22 2019, 5:36 PM
markj added inline comments.Jan 22 2019, 6:09 PM
lib/libc/x86/sys/pkru.c
88 ↗(On Diff #52998)

Extra newline.

sys/amd64/amd64/pmap.c
4134 ↗(On Diff #52998)

This var is write-only.

4888 ↗(On Diff #52998)

Write-only.

5088 ↗(On Diff #52998)

The same condition may occur in pmap_enter_object().

Shouldn't this condition also block superpage promotion?

9076 ↗(On Diff #52998)

Is ppr leaked if rangeset_insert() fails?

9247 ↗(On Diff #52998)

Why not do the TLB invalidation in this function?

9258 ↗(On Diff #52998)

What if eva > VM_MAXUSER_ADDRESS?

9277 ↗(On Diff #52998)

If you split alloc/initialization and insertion of the pkru, you can use M_WAITOK for the allocation. I believe we generally try to avoid syscall failures due to transient ENOMEM conditions.

Hmm, I see that the rangeset operations may fail with ENOMEM anyway. This is a design flaw IMHO.

sys/amd64/amd64/trap.c
815 ↗(On Diff #52998)

Should we have a dedicated ucode for this case?

860 ↗(On Diff #52998)

What if a user application triggers a kernel mode protection key violation with copyin/out?

sys/x86/include/sysarch.h
115 ↗(On Diff #52998)

Extra newline.

kib marked 10 inline comments as done.Jan 23 2019, 10:09 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
4134 ↗(On Diff #52998)

It is used by PG_PTE_PROMOTE.

4888 ↗(On Diff #52998)

No, it is used by PG_PTE_PROMOTE definition.

5088 ↗(On Diff #52998)

Yes, I missed pmap_enter_object().

Promotion is handled by the updated definition of PG_PTE_PROMOTE.

9076 ↗(On Diff #52998)

Yes, should be fixed.

9258 ↗(On Diff #52998)

In fact nothing too bad would happen. But indeed the intent was to check the eva and not sva.

9277 ↗(On Diff #52998)

It is more complicated than just split of the new entry. Problem is that removal might need to split and then it also needs to allocate.

During the initial stages of the development, I tried to use per-pmap sx lock to protect the pkru rangeset, but it is too inconsistent and I decided that the only sane approach is to protect it with the same lock as the page tables, ie. pmap mutex.

I will think about providing pre-allocated elements for rangeset operations, but it even brief evaluation shows that rangeset interface will become extra-ugly. I do not like M_NOWAIT failures as well, sure.

sys/amd64/amd64/trap.c
815 ↗(On Diff #52998)

I do not have an opinion there. After all, it is page protection. Not adding MD code might be good.

860 ↗(On Diff #52998)

Such access must only occur with pcb_onfault set to non-NULL, and then the condition below would jump to the handler after return from the fault. If kernel is tricked to access usermode without pcb_onfault, then we panic as should.

kib updated this revision to Diff 53124.Jan 23 2019, 10:11 PM
kib marked 7 inline comments as done.

Handle all Mark' notes except M_NOWAIT.
Also the update contains fixes to the rangeset implementation made after Peter' testing, mostly to rangeset_remove(). Including the nice ASCII graphic.

kib updated this revision to Diff 53126.Jan 23 2019, 10:38 PM

Handle ENOMEM by wait and retry.

markj added a comment.Jan 25 2019, 4:00 PM

Do you think it would be useful to define a PKRU_SET flag that causes the operation to fail when the request range overlaps with an existing key range?

sys/amd64/amd64/pmap.c
2898 ↗(On Diff #53126)

Why is it NOFREE?

Extra braces.

2990 ↗(On Diff #53126)

From context it looks like we should be able to use M_WAITOK here.

Extra braces.

5120 ↗(On Diff #53126)

The comparison with VM_MAXUSER_ADDRESS is redundant.

9049 ↗(On Diff #53126)

ctx is __unused.

9062 ↗(On Diff #53126)

ctx is __unused.

9119 ↗(On Diff #53126)

Perhaps assert that eva < VM_MAXUSER_ADDRESS?

9258 ↗(On Diff #53126)

sva is modified during the loop, the original value needs to be saved.

5088 ↗(On Diff #52998)

Why not move the check into pmap_enter_pde()?

sys/amd64/amd64/trap.c
860 ↗(On Diff #52998)

Oops, right.

kib marked 10 inline comments as done.Jan 25 2019, 5:58 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
2898 ↗(On Diff #53126)

It does not need to, before ENOMEM restart I tried to minimize the chances of failing allocations.

What do you mean by extra braces? {} are optional by style, and I put them around multiline statements.

2990 ↗(On Diff #53126)

The flag is not used during rangeset_init(), it is saved and used during the rangeset operations like _remove() when we execute under pmap lock.

9119 ↗(On Diff #53126)

You mean, in addition to the sva check above ?

9258 ↗(On Diff #53126)

This was the reason why invalidate_range was done outside the function.

5088 ↗(On Diff #52998)

This is very good question, because pmap_enter_pde() might sleep waiting for new pdp page and then the pmap_pkru_same() check is invalidated. So the check must be moved to pmap_enter_pde() after pdp allocation, and pmap_enter_object() can avoid the call.

kib updated this revision to Diff 53208.Jan 25 2019, 6:00 PM
kib marked 5 inline comments as done.

Add AMD64_PKRU_EXCL.
Check pmap_pkru_same() after waiting for allocation of pdp page.
Other fixes pointed out by Mark.

kib updated this revision to Diff 53214.Jan 25 2019, 6:38 PM

Simplify rangeset_check_empty().

markj added inline comments.Jan 25 2019, 6:46 PM
sys/amd64/amd64/pmap.c
2898 ↗(On Diff #53126)

I would avoid using NOFREE for zones whose usage can be easily influenced by userspace. The fragmentation that they cause is quite problematic, and their use should be strongly discouraged IMO.

Hmm, ok. I thought we tried to strictly avoid redundant braces in VM/pmap code (even for multiline statements). I'll stop commenting on that.

9258 ↗(On Diff #53126)

In general it seems that pmap functions that operate on a range of VAs will handle invalidation themselves, so I suggested the change to be consistent.

I don't see why msva shouldn't just be called va.

5088 ↗(On Diff #52998)

Hrm. What if the user changes the keyid for the range during the sleep?

kib marked 3 inline comments as done.Jan 25 2019, 7:44 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
2898 ↗(On Diff #53126)

Yes, I already removed the NOFREE.

5088 ↗(On Diff #52998)

I moved pmap_pkru_get() down into pmap_enter_pde(). Then it allowed to remove the cleaning of the key bits for small page insertion.

kib updated this revision to Diff 53221.Jan 25 2019, 7:45 PM
kib marked 2 inline comments as done.

Move pmap_pkru_get() after all pmap lock relocks. Rename msva to va.

markj added inline comments.Jan 28 2019, 6:03 PM
sys/amd64/amd64/sys_machdep.c
361 ↗(On Diff #53221)

I wonder if we should use a write lock here. With the ENOMEM handling it's possible that multiple threads are racing to modify the same VA range. It might be preferable to use a dedicated sx lock instead of the vm map lock, to avoid blocking unrelated operations.

sys/vm/vm_fault.c
482 ↗(On Diff #53221)

Doesn't vm_fault_soft_fast() need the same treatment? Maybe this should be handled internally by pmap_enter().

sys/vm/vm_map.c
3449 ↗(On Diff #53221)

Why not use vm_map_unlock()?

kib marked 3 inline comments as done.Jan 28 2019, 7:02 PM
kib added inline comments.
sys/amd64/amd64/sys_machdep.c
361 ↗(On Diff #53221)

I should have added a comment there. The consistency of single VA is handled by the pmap lock. With the additional guarantee of the rangeset_remove() to not modify the rangeset if an error is returned, the ENOMEM retry still should be fine.

This lock prevents parallel vmspace_fork() from receiving inconsistent protection settings if one thread forks and another modifies pmap protection keys for some range.

sys/vm/vm_fault.c
482 ↗(On Diff #53221)

vm_fault_soft_fast() returns error to vm_fault() and then the fault handler executes 'slow' non-superpage pmap_enter().

First I thought that indeed inserting 4k entries in pmap_enter() and returning success would be fine, but really it is hard to distinguish other failures from the specific configuration (which is also not real failure). More, current contract for pmap_enter() is that the function that just does what directed; if pmap_enter(psind = 1) inserts non-superpage mapping it looks somewhat broken.

sys/vm/vm_map.c
3449 ↗(On Diff #53221)

For the same reason why plain sx_xunlock() is used at the end of the function, see comment (we cannot process deferred free entries until both maps are unlocked). That said, the code should call the processing function even if formally.

kib updated this revision to Diff 53338.Jan 28 2019, 7:04 PM
kib marked 3 inline comments as done.

Add comment about sync/w fork. Add formally required process_deferred() call (there is no deferred entries which curthread could create).

markj added inline comments.Feb 8 2019, 4:21 PM
sys/kern/subr_rangeset.c
48 ↗(On Diff #53338)

Style: missing newline.

55 ↗(On Diff #53338)

Why _NOFREE?

263 ↗(On Diff #53338)

I would call this rangeset_lookup(). To me at least, _get() implies that the range is known to be present, i.e., the lookup cannot fail. Same with pmap_pkru_get().

285 ↗(On Diff #53338)

I believe this will fail if an earlier attempt returned ENOMEM and we did a partial copy.

sys/vm/vm_fault.c
482 ↗(On Diff #53221)

Fair enough. I agree that it would be odd for pmap_enter(psind = 1) to implement a 4KB fallback.

kib marked 5 inline comments as done.Feb 9 2019, 2:49 AM
kib added inline comments.
sys/kern/subr_rangeset.c
285 ↗(On Diff #53338)

FIxed by removing everything on copy failure.

kib updated this revision to Diff 53720.Feb 9 2019, 2:51 AM
kib marked an inline comment as done.

Rename rangeset_get() to rangeset_lookup().
Remove one more _NOFREE.
Clear dst after failed copy.

markj accepted this revision.Feb 15 2019, 9:47 PM

The code changes look ok to me. Do you plan to upload man page changes here too?

This revision is now accepted and ready to land.Feb 15 2019, 9:47 PM
kib added a comment.Feb 16 2019, 7:26 AM

The code changes look ok to me. Do you plan to upload man page changes here too?

I plan to commit this review as is after Peter re-tests it. I do not want the grammar fixes to block the code.

I will work on the man page independently of the code commit.

This revision was automatically updated to reflect the committed changes.
kib reopened this revision.Feb 19 2019, 7:15 PM
This revision was not accepted when it landed; it landed in state Needs Review.Feb 19 2019, 7:17 PM
This revision was automatically updated to reflect the committed changes.