Page MenuHomeFreeBSD

PTI: Trap if we returned to userspace with kernel (full) page table still active.
ClosedPublic

Authored by kib on Jan 17 2018, 6:53 PM.

Details

Summary

Map userspace portion of VA in the PTI kernel-mode page table as non-executable. This way, if we ever miss reloading ucr3 into %cr3 on return to usermode, the process traps instead of executing in potentially vulnerable setup.

I peek this in some article about Linux implementation.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

markj added inline comments.
sys/amd64/amd64/pmap.c
2578 ↗(On Diff #38099)

This probably deserves a comment. Maybe, "Safety belt: ensure that we trap if kernel mode mappings leak into user mode"?

This revision is now accepted and ready to land.Jan 17 2018, 7:23 PM

This is a neat idea. I do, however, think that it needs a comment explaining why this is done, since it's not obvious.

Add comment, using Mark' wording.

This revision now requires review to proceed.Jan 17 2018, 7:37 PM
alc requested changes to this revision.Jan 17 2018, 7:53 PM

Maybe I'm confused. Is that wording actually correct? Shouldn't it be "Ensure that the return to user mode immediately traps if the kernel-mode page table is mistakenly left active."

This revision now requires changes to proceed.Jan 17 2018, 7:53 PM
In D13956#292703, @alc wrote:

Maybe I'm confused. Is that wording actually correct? Shouldn't it be "Ensure that the return to user mode immediately traps if the kernel-mode page table is mistakenly left active."

Well, it is not return to usermode which traps, but usermode itself after we already returned. At least, I hope there is no such bug in the CPUs.

In D13956#292712, @kib wrote:
In D13956#292703, @alc wrote:

Maybe I'm confused. Is that wording actually correct? Shouldn't it be "Ensure that the return to user mode immediately traps if the kernel-mode page table is mistakenly left active."

Well, it is not return to usermode which traps, but usermode itself after we already returned. At least, I hope there is no such bug in the CPUs.

Yeah, that's my reaction too. This change won't cause the iret itself to fault.

sys/amd64/amd64/pmap.c
2579 ↗(On Diff #38103)

It seems a bit odd to mix "usermode" and "user mode." Maybe s/usermode/userland/, since it's not user mode itself that traps.

Let me be more specific. I think that the phrase "kernel mode mappings leak into user mode" suffers from too much ambiguity.

How about this version?

"Make all user-space mappings in the kernel-mode page table no-execute so that we detect any programming errors that leave the kernel-mode page table active on return to user space."

This revision is now accepted and ready to land.Jan 17 2018, 10:50 PM

Out of curiously, how does the system behave when this sanity check trips?

In D13956#292839, @alc wrote:

Out of curiously, how does the system behave when this sanity check trips?

I initially modified the doiret path to not reload the %cr3, which panics the kernel. But after your question I realized that this is too simple because vm_fault() does not consider these faults as invalid and does nothing. As result, if I patch ucr3 reload from the syscall return path, system page faults but silently recovers if doreti correctly reloads ucr3.

So the fault from the failed PTI check must be treated the same as the fault from the SMAP and checked according to the #pf error code. In principle, it would be enough to verify that the fault is user-mode, for exec, and for the present page. But due to the spurious page fault bugs in CPUs, I am afraid to do this and instead only issue the diagnostic message if vm_fault() indeed was called and reported success. I do not call panic() because I believe it is possible to race with the fault handler by installing an executable entry after fault but before trap_pfault() rechecks.

I hope that my explanation above can be deciphered.

Add the heuristic check for PTI violation.

This revision now requires review to proceed.Jan 18 2018, 11:34 AM
In D13956#292918, @kib wrote:
In D13956#292839, @alc wrote:

Out of curiously, how does the system behave when this sanity check trips?

I initially modified the doiret path to not reload the %cr3, which panics the kernel. But after your question I realized that this is too simple because vm_fault() does not consider these faults as invalid and does nothing.

Yes, that was the reason for my question.

Are any of the ignored bits in CR3 preserved, i.e., when you set them upon write, a later read will return them? If so, then we could use one as a flag to distinguish a kernel- versus user-mode page table.

In D13956#293422, @alc wrote:

Are any of the ignored bits in CR3 preserved, i.e., when you set them upon write, a later read will return them? If so, then we could use one as a flag to distinguish a kernel- versus user-mode page table.

SDM states that writing any reserved bit into %cr3 causes GPF.

In D13956#293431, @kib wrote:
In D13956#293422, @alc wrote:

Are any of the ignored bits in CR3 preserved, i.e., when you set them upon write, a later read will return them? If so, then we could use one as a flag to distinguish a kernel- versus user-mode page table.

SDM states that writing any reserved bit into %cr3 causes GPF.

True, but there are several bits that are described as ignored, not reserved.

Although it may be that all of the ignored bits are consumed by PCID.

In D13956#293433, @alc wrote:

Although it may be that all of the ignored bits are consumed by PCID.

In edition 65 of SDM all unspecified bits of CR3 are grayed, and the picture annotation says that the greyed bits are reserved.

If we are going to read %cr3 to see if any reserved bit is set, we must do it before kcr3 is loaded into the register, and we must save the previous %cr3 content. But then we can compare saved value of %cr3 against kcr3/ucr3.

At the moment, I'm looking at Tables 4-12 and 4-13. Table 4-12 shows several ignored, as opposed to reserved, bits when PCID is inactive. This is from a version dated 12/2017.

In D13956#293443, @alc wrote:

At the moment, I'm looking at Tables 4-12 and 4-13. Table 4-12 shows several ignored, as opposed to reserved, bits when PCID is inactive. This is from a version dated 12/2017.

I see. I think that this is not a real option, at least due to PCID. IMO the implemented heuristic check gives enough hints to make me look at bugs.

In D13956#293443, @alc wrote:

At the moment, I'm looking at Tables 4-12 and 4-13. Table 4-12 shows several ignored, as opposed to reserved, bits when PCID is inactive. This is from a version dated 12/2017.

To be clear, I'm looking at just Volume 3. Hence, Chapter 4 is "Paging".

And, really, I don't care how we do it, I'm just arguing for implementing some simple way of telling whether a kernel- or user-mode page table was active on a fault, rather than using a heuristic check on tf_err.

Doesn't your proposed PCID patch have kernel- and user-mode page tables differ in bit 11? Suppose that you change that to one of the ignored bits when PCID isn't enabled. Then, regardless of whether PCID was enabled, you can test that bit to know whether a kernel- or user-mode page table was active at the time of the fault.

On page fault, put an efforts to provide the usermode %cr3 content to the trap_pfault, and explicitly detect a situation when userspace faulted because no-exec kernel page tables were active.

I verified that this version panics as expected if fast syscall exit does not reload %cr3 on return to usermode.

Does this, in fact, dovetail with the PCID patch?

In D13956#293475, @alc wrote:

Does this, in fact, dovetail with the PCID patch?

This is really an excerpt from my dev branch where PCID patch and some more fixes are stored.

In D13956#293479, @kib wrote:
In D13956#293475, @alc wrote:

Does this, in fact, dovetail with the PCID patch?

This is really an excerpt from my dev branch where PCID patch and some more fixes are stored.

Terrific! That's what I was hoping.

This revision is now accepted and ready to land.Jan 19 2018, 8:40 PM
This revision was automatically updated to reflect the committed changes.