Page MenuHomeFreeBSD

Set the RW/NX bits to protect kernel text and data loaded as part of the main kernel image
ClosedPublic

Authored by jtl on Feb 9 2018, 5:30 PM.

Details

Summary

This change protects the kernel text, data, and BSS by setting the RW/NX bits correctly for the data contained on each memory page.

This change is not comprehensive:

  • It doesn't do anything for modules.
  • It doesn't do anything for kernel memory allocated after the kernel starts running.
  • In order to avoid excessive memory inefficiency, it may let multiple types of data share a 2M page, and assigns the most permissions needed for data on that page.

There are several components to this change:

  1. Restore the binutils' LLD default for page sizes. This helps nudge the R/W segment to start on a 2M page.
  2. Add a variable to indicate the start of the R/W portion of the initial memory.
  3. A kernel option to set the default state of the feature, and a tunable to modify that default at boot-time.
  4. Stop detecting NX bit support for each AP. Instead, use the value from the BSP and, if supported, activate the feature on the other APs just before loading the correct page table. (Functionally, we already assumed that the BSP and all APs had the same support/non-support for the NX bit.)
  5. If the feature is activated, set the RW and NX bits correctly for the kernel text, data, and BSS (subject to the above caveats).

This may need to be documented someplace better than sys/amd64/conf/NOTES; however, I couldn't find an obvious place with the little searching I did.

Test Plan

I've already run this on a machine that supports the NX bit. It boots correctly both with and without the feature activated. As far as I can tell, it works correctly with the feature activated.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jtl created this revision.Feb 9 2018, 5:30 PM
kib added inline comments.Feb 9 2018, 6:21 PM
sys/amd64/amd64/mpboot.S
235 ↗(On Diff #39092)

I do not think that you need a new variable for this. Compare pg_nx with 0 and set EFER_NXE bit if not zero.

sys/amd64/amd64/pmap.c
406 ↗(On Diff #39092)

kernelprot seems to be too generic. Might be, kernelnx ?

944 ↗(On Diff #39092)

Look at the fpuinit_bsp1, particularly CPUID_EXTSTATE_XSAVEOPT block.

1016 ↗(On Diff #39092)

I believe page tables never need exec perms.

sys/conf/options.amd64
8 ↗(On Diff #39092)

I do not see a need in the kernel option. This should work or not be done at all.

In fact, I do not see a need in the tunable. Perhaps only as the opt-out on machines where the feature appeared buggy. But it should be fixed.

jtl added inline comments.Feb 9 2018, 6:57 PM
sys/amd64/amd64/mpboot.S
235 ↗(On Diff #39092)

Good point! I'll change it.

sys/amd64/amd64/pmap.c
406 ↗(On Diff #39092)

I think kernelnx is too specific, since it also changes the way the RW bits are set. I agree that kernelprot is too generic, but I struggled to come up with a better name. I'm open to suggestions. (Other similar suggestions could be kernelrwx or kernelnxrw.) I'm really open for something better.)

944 ↗(On Diff #39092)

Thanks for pointing that out! I didn't know we actually had self-modifying code in the kernel.

Hmmm... My CPU supports XSAVE, but didn't fault on that. I'll need to dig deeper into why not.

1016 ↗(On Diff #39092)

If I read the Intel documentation correctly, setting NX on a PDE propagates the NX behavior to all pages mapped by PTEs referenced by the PDE. I don't think that's what we want.

sys/conf/options.amd64
8 ↗(On Diff #39092)

I decided we might need a tunable for debugging purposes. In particular, one way to set breakpoints is to modify instructions. Or, there might be a need to modify random memory as part of debugging.

I'm perfectly fine with removing the kernel option and activating it by default.

kib added inline comments.Feb 9 2018, 7:26 PM
sys/amd64/amd64/pmap.c
944 ↗(On Diff #39092)

It probably has XSAVE, but not XSAVEOPT.

1016 ↗(On Diff #39092)

This is mapping of the page tables itself, not the introduction of PDEs into the pt hierarchy. The KPTmap does not map kernel objects, but kernel *page tables*.

sys/conf/options.amd64
8 ↗(On Diff #39092)

I did not added the option when I set NX on DMAP, see r316767.

jhb added inline comments.Feb 12 2018, 5:56 PM
sys/amd64/amd64/pmap.c
408 ↗(On Diff #39092)

Tweaking the description here (it's not just read-only) might also help with the tunable name. Maybe "Map the kernel with fine-grained permissions rather than RWX."

You could in fact invert the option to be 'vm.pmap.kernel_rwx' where a 1 setting is the legacy behavior and 0 is the default which is the new behavior. If you go that route than the description would be "Map the entire kernel RWX rather than using fine-grained permissions."

944 ↗(On Diff #39092)

Does that imply we should defer the removal of write from kernel text until after doing any fixups? We could use a second pass to walk the page tables and clear W from executable pages?

sys/conf/options.amd64
8 ↗(On Diff #39092)

Agreed on only adding a tunable (and defaulting to new behavior) and not adding the kernel option.

jtl added a comment.Feb 12 2018, 6:26 PM

In addition to these changes, I would like to make the corresponding pages in the direct map be read-only. Does that seem acceptable?

sys/amd64/amd64/pmap.c
944 ↗(On Diff #39092)

This works because we don't set cr0.wp until a little bit later in the boot process. So, we aren't enforcing write protection at this point.

1016 ↗(On Diff #39092)

I'm fairly certain I'm interpreting this correctly; however, I might not have stated it well. So, let me try again.

(I know you know what is below. I'm merely restating it to pass on my understanding of what this particular structure does. This will let you identify the element of my thinking that is incorrect.)

The kernel page table is rooted at KPML4phys, which contains the PML4Es.

The PDPEs are in KPDPphys. The code puts entries for the PDPEs in KPML4phys starting at KPML4BASE.

The PDEs are in KPDphys. The code puts entries for the PDEs into KPDPphys. Some of the PDEs end up mapping 2M pages. Some of the PDEs point to PTEs.

The PTEs are in KPTphys. The code puts entries for the PTEs into KPDphys. (However, some get over-written with 2M pages.)

According to the Intel SDM (Vol. 3a, section 5.13.2, beginning on page 5-30), if "the execute-disable bit is set in any of the paging-structure entries used to map [a] page", execution will be denied. Table 5-5 illustrates this.

So, my understanding is that if I set the NX bit on entries in KPDphys, the processor will disable execution for any PTEs mapped using that PDE.

(Note that this is distinct from the recursive mapping found my mapping KPML4phys[PML4PML4I] to itself, which vtopte() and vtopde() follow. For those, I agree that we could set the NX bit; however, we could do that at the top level. I'll open a separate review for that, since that is a bit different from what I'm trying to do here.)

jtl updated this revision to Diff 39218.Feb 12 2018, 7:43 PM

Incorporate review feedback:

  • No kernel option.
  • Rework the tunable. It now only controls whether pages that contain kernel text are marked read-only. That's the part I think a developer is most likely going to want to disable for debugging purposes. (That also makes the name somewhat easier, maybe.)
  • Remove the efer_nxe variable and just use pg_nx to determine whether the BSP supports the NX bit.
jtl updated this revision to Diff 39221.Feb 12 2018, 8:44 PM

Apply the same read-only protection to the portions of the kernel kept in the direct map.

jtl added reviewers: alc, markj.Feb 12 2018, 8:45 PM
jtl marked 2 inline comments as done.
jhb added inline comments.Feb 12 2018, 10:18 PM
sys/amd64/amd64/pmap.c
944 ↗(On Diff #39092)

I think you should perhaps extend the comment above this range with a note that writes to perform fixups in .text are still permitted until CR0.WP is set.

kib added inline comments.Feb 12 2018, 10:39 PM
sys/amd64/amd64/pmap.c
1016 ↗(On Diff #39092)

Nevermind, I misread the fragment, thinking that it is creation of PTmap. Ignore me.

sys/amd64/include/cpu.h
73 ↗(On Diff #39221)

Perhaps order alphabetically ?

jtl updated this revision to Diff 39229.Feb 12 2018, 11:04 PM

Implement review comments:

  • Extend a comment to note that fixups in .text are still permitted until we set CR0.WP.
  • Reorder some variable declarations.
jtl added inline comments.Feb 12 2018, 11:07 PM
sys/amd64/amd64/pmap.c
944 ↗(On Diff #39092)

Thanks! I changed it.

1016 ↗(On Diff #39092)

No problem. It took me multiple times through this code to fully grok it. :-)

sys/amd64/include/cpu.h
73 ↗(On Diff #39221)

Done. (I think. Please check if the new ordering looks OK.)

jtl marked 16 inline comments as done.Feb 12 2018, 11:08 PM
kib accepted this revision.Feb 12 2018, 11:33 PM

I am fine with this, but would be curious to see an effect of setting a breakpoint with ddb when write-protection is enabled for the kernel text.

This revision is now accepted and ready to land.Feb 12 2018, 11:33 PM
jtl added a comment.Feb 12 2018, 11:44 PM
In D14282#300571, @kib wrote:

I am fine with this, but would be curious to see an effect of setting a breakpoint with ddb when write-protection is enabled for the kernel text.

There is a backtrace followed by a complaint about writing to an address in .text.

This use-case was my initial target for the tunable. However, we can fix this by disabling cr0.wp before writing the breakpoint and then enabling cr0.wp after writing it. Is that worth doing? Or, is the tunable enough (you must disable the read-only protection in order to set breakpoints)?

jtl updated this revision to Diff 39234.Feb 13 2018, 12:45 AM

Let DDB set/delete breakpoints, as well as generally modify any memory it wants. (If an administrator doesn't trust their users to use this power properly, they don't need to compile DDB support.)

This revision now requires review to proceed.Feb 13 2018, 12:45 AM
kib added inline comments.Feb 13 2018, 8:13 AM
sys/amd64/amd64/db_interface.c
85 ↗(On Diff #39234)

I am not sure why it is reasonable to check for CR0_WP. IMO it is good enough to always clear _WP before write, and always reload previously saved cr0 content below.

Also, if you do this, it removes the need for the tunable.

jtl added inline comments.Feb 13 2018, 8:18 PM
sys/amd64/amd64/db_interface.c
85 ↗(On Diff #39234)

I was trying to cut down on writes to the control register. However, I take your point that this will be true > 99% of the time, so it makes sense to just unconditionally write it.

I agree it removes the need for the tunable. I'll take it out.

jtl updated this revision to Diff 39270.Feb 13 2018, 8:19 PM

Implement review feedback:

  • Unconditionally set CR0 when writing to memory in DDB.
  • Now that DDB breakpoints work, remove the tunable.
jtl updated this revision to Diff 39271.Feb 13 2018, 8:29 PM

If the kernel ever used more than 1G, correctly initialize the right number of PDEs in the direct map.

jtl marked 4 inline comments as done.Feb 13 2018, 8:30 PM

I think I've addressed the review feedback, and I'm relatively happy with the proposed change now. So, if anyone has been sitting on the sidelines waiting for this to stabilize, now is a good time to jump in with feedback. :-)

emaste added inline comments.Feb 13 2018, 8:54 PM
sys/conf/kern.pre.mk
124 ↗(On Diff #39271)

These should have -Wl, wrapping, and it probably makes sense to commit this separately, with a comment.

kib accepted this revision.Feb 13 2018, 9:10 PM
kib added inline comments.
sys/amd64/amd64/db_interface.c
85 ↗(On Diff #39271)

Why () around CR0_WP ?

This revision is now accepted and ready to land.Feb 13 2018, 9:10 PM
markj added a comment.Feb 13 2018, 9:39 PM

I haven't yet spent time on this, but I'll just note that D14062 aims to set the NX bit in mappings created by pmap_qenter(), which includes kernel thread stacks.

jhb added a comment.Feb 13 2018, 11:57 PM

It's more painful to fix, but the GDB remote stub can also write to memory to set breakpoints. My guess is it is done in the handler for the 'M' command in sys/gdb/gdb_main.c. Unfortunately there isn't an existing MD hook before before/after that would let you avoid #ifdef's in that routine. Could add some new macros to sys/<arch>/include/gdb_machdep.h for start/stop of memory write perhaps.

In D14282#300983, @jhb wrote:

Unfortunately there isn't an existing MD hook before before/after that would let you avoid #ifdef's in that routine. Could add some new macros to sys/<arch>/include/gdb_machdep.h for start/stop of memory write perhaps.

That seems reasonable, given that I hope over time we'll apply a change like this one in D14282 to other architectures.

jtl added a comment.Feb 14 2018, 6:36 PM

I haven't yet spent time on this, but I'll just note that D14062 aims to set the NX bit in mappings created by pmap_qenter(), which includes kernel thread stacks.

I think they are orthogonal, and related only in that both try to set the NX bit in more places.

jtl added a comment.Feb 14 2018, 6:39 PM
In D14282#300983, @jhb wrote:

It's more painful to fix, but the GDB remote stub can also write to memory to set breakpoints. My guess is it is done in the handler for the 'M' command in sys/gdb/gdb_main.c. Unfortunately there isn't an existing MD hook before before/after that would let you avoid #ifdef's in that routine. Could add some new macros to sys/<arch>/include/gdb_machdep.h for start/stop of memory write perhaps.

Yeah, that is more complicated. It looks like we don't support using the Z command to set a breakpoint, so gdb will fallback to using the M command (as you suspected). I'll see if I can figure out a MD hook that looks reasonable-ish.

The harder part will be getting remote gdb to work in my environment to test this, but - hopefully - I can get that working with enough wrangling.

markj added a comment.Feb 14 2018, 6:54 PM

DTrace FBT needs fixing as well, doesn't it? fbt_patch_tracepoint() is the place where we write breakpoints.

In D14282#301193, @jtl wrote:

I haven't yet spent time on this, but I'll just note that D14062 aims to set the NX bit in mappings created by pmap_qenter(), which includes kernel thread stacks.

I think they are orthogonal, and related only in that both try to set the NX bit in more places.

Right, I'm pointing it out here just as an FYI in case you planned on making a similar change.

jtl updated this revision to Diff 39313.Feb 14 2018, 7:11 PM

Add code to handle remote GDB memory writes. (untested)

This revision now requires review to proceed.Feb 14 2018, 7:11 PM
jtl added a comment.Feb 15 2018, 12:58 AM

I tested and confirmed that the new code lets gdb set/delete breakpoints.

sys/amd64/amd64/db_interface.c
85 ↗(On Diff #39271)

The parentheses are not necessary; they are just a coding habit (quirk?). I can take them out.

jtl added inline comments.Feb 15 2018, 1:07 AM
sys/conf/kern.pre.mk
124 ↗(On Diff #39271)

Thanks! Good catch! I can commit this separately.

jtl updated this revision to Diff 39327.Feb 15 2018, 1:08 AM

Address review feedback:

  • Add -Wl, wrapping to the LDFLAGS variable.
  • Remove an unneeded set of parentheses.
jtl marked 4 inline comments as done.Feb 15 2018, 1:09 AM
jhb accepted this revision.Feb 15 2018, 8:42 PM

Thanks for fixing the GDB case.

This revision is now accepted and ready to land.Feb 15 2018, 8:42 PM
This revision was automatically updated to reflect the committed changes.