Page MenuHomeFreeBSD

Set the RW/NX bits to protect kernel text and data loaded as part of the main kernel image
ClosedPublic

Authored by jtl on Feb 9 2018, 5:30 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 23, 12:21 AM
Unknown Object (File)
Fri, Nov 22, 11:50 PM
Unknown Object (File)
Fri, Nov 22, 11:14 PM
Unknown Object (File)
Fri, Nov 22, 11:12 PM
Unknown Object (File)
Fri, Nov 22, 10:06 PM
Unknown Object (File)
Fri, Nov 22, 3:03 AM
Unknown Object (File)
Thu, Nov 21, 6:26 PM
Unknown Object (File)
Thu, Nov 21, 6:25 PM

Details

Summary

This change protects the kernel text, data, and BSS by setting the RW/NX bits correctly for the data contained on each memory page.

This change is not comprehensive:

  • It doesn't do anything for modules.
  • It doesn't do anything for kernel memory allocated after the kernel starts running.
  • In order to avoid excessive memory inefficiency, it may let multiple types of data share a 2M page, and assigns the most permissions needed for data on that page.

There are several components to this change:

  1. Restore the binutils' LLD default for page sizes. This helps nudge the R/W segment to start on a 2M page.
  2. Add a variable to indicate the start of the R/W portion of the initial memory.
  3. A kernel option to set the default state of the feature, and a tunable to modify that default at boot-time.
  4. Stop detecting NX bit support for each AP. Instead, use the value from the BSP and, if supported, activate the feature on the other APs just before loading the correct page table. (Functionally, we already assumed that the BSP and all APs had the same support/non-support for the NX bit.)
  5. If the feature is activated, set the RW and NX bits correctly for the kernel text, data, and BSS (subject to the above caveats).

This may need to be documented someplace better than sys/amd64/conf/NOTES; however, I couldn't find an obvious place with the little searching I did.

Test Plan

I've already run this on a machine that supports the NX bit. It boots correctly both with and without the feature activated. As far as I can tell, it works correctly with the feature activated.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 15046
Build 15144: arc lint + arc unit

Event Timeline

sys/amd64/amd64/mpboot.S
235

I do not think that you need a new variable for this. Compare pg_nx with 0 and set EFER_NXE bit if not zero.

sys/amd64/amd64/pmap.c
406

kernelprot seems to be too generic. Might be, kernelnx ?

934

Look at the fpuinit_bsp1, particularly CPUID_EXTSTATE_XSAVEOPT block.

1013

I believe page tables never need exec perms.

sys/conf/options.amd64
8 ↗(On Diff #39092)

I do not see a need in the kernel option. This should work or not be done at all.

In fact, I do not see a need in the tunable. Perhaps only as the opt-out on machines where the feature appeared buggy. But it should be fixed.

sys/amd64/amd64/mpboot.S
235

Good point! I'll change it.

sys/amd64/amd64/pmap.c
406

I think kernelnx is too specific, since it also changes the way the RW bits are set. I agree that kernelprot is too generic, but I struggled to come up with a better name. I'm open to suggestions. (Other similar suggestions could be kernelrwx or kernelnxrw.) I'm really open for something better.)

934

Thanks for pointing that out! I didn't know we actually had self-modifying code in the kernel.

Hmmm... My CPU supports XSAVE, but didn't fault on that. I'll need to dig deeper into why not.

1013

If I read the Intel documentation correctly, setting NX on a PDE propagates the NX behavior to all pages mapped by PTEs referenced by the PDE. I don't think that's what we want.

sys/conf/options.amd64
8 ↗(On Diff #39092)

I decided we might need a tunable for debugging purposes. In particular, one way to set breakpoints is to modify instructions. Or, there might be a need to modify random memory as part of debugging.

I'm perfectly fine with removing the kernel option and activating it by default.

sys/amd64/amd64/pmap.c
934

It probably has XSAVE, but not XSAVEOPT.

1013

This is mapping of the page tables itself, not the introduction of PDEs into the pt hierarchy. The KPTmap does not map kernel objects, but kernel *page tables*.

sys/conf/options.amd64
8 ↗(On Diff #39092)

I did not added the option when I set NX on DMAP, see r316767.

sys/amd64/amd64/pmap.c
408

Tweaking the description here (it's not just read-only) might also help with the tunable name. Maybe "Map the kernel with fine-grained permissions rather than RWX."

You could in fact invert the option to be 'vm.pmap.kernel_rwx' where a 1 setting is the legacy behavior and 0 is the default which is the new behavior. If you go that route than the description would be "Map the entire kernel RWX rather than using fine-grained permissions."

934

Does that imply we should defer the removal of write from kernel text until after doing any fixups? We could use a second pass to walk the page tables and clear W from executable pages?

sys/conf/options.amd64
8 ↗(On Diff #39092)

Agreed on only adding a tunable (and defaulting to new behavior) and not adding the kernel option.

In addition to these changes, I would like to make the corresponding pages in the direct map be read-only. Does that seem acceptable?

sys/amd64/amd64/pmap.c
934

This works because we don't set cr0.wp until a little bit later in the boot process. So, we aren't enforcing write protection at this point.

1013

I'm fairly certain I'm interpreting this correctly; however, I might not have stated it well. So, let me try again.

(I know you know what is below. I'm merely restating it to pass on my understanding of what this particular structure does. This will let you identify the element of my thinking that is incorrect.)

The kernel page table is rooted at KPML4phys, which contains the PML4Es.

The PDPEs are in KPDPphys. The code puts entries for the PDPEs in KPML4phys starting at KPML4BASE.

The PDEs are in KPDphys. The code puts entries for the PDEs into KPDPphys. Some of the PDEs end up mapping 2M pages. Some of the PDEs point to PTEs.

The PTEs are in KPTphys. The code puts entries for the PTEs into KPDphys. (However, some get over-written with 2M pages.)

According to the Intel SDM (Vol. 3a, section 5.13.2, beginning on page 5-30), if "the execute-disable bit is set in any of the paging-structure entries used to map [a] page", execution will be denied. Table 5-5 illustrates this.

So, my understanding is that if I set the NX bit on entries in KPDphys, the processor will disable execution for any PTEs mapped using that PDE.

(Note that this is distinct from the recursive mapping found my mapping KPML4phys[PML4PML4I] to itself, which vtopte() and vtopde() follow. For those, I agree that we could set the NX bit; however, we could do that at the top level. I'll open a separate review for that, since that is a bit different from what I'm trying to do here.)

Incorporate review feedback:

  • No kernel option.
  • Rework the tunable. It now only controls whether pages that contain kernel text are marked read-only. That's the part I think a developer is most likely going to want to disable for debugging purposes. (That also makes the name somewhat easier, maybe.)
  • Remove the efer_nxe variable and just use pg_nx to determine whether the BSP supports the NX bit.

Apply the same read-only protection to the portions of the kernel kept in the direct map.

jtl marked 2 inline comments as done.
sys/amd64/amd64/pmap.c
934

I think you should perhaps extend the comment above this range with a note that writes to perform fixups in .text are still permitted until CR0.WP is set.

sys/amd64/amd64/pmap.c
1013

Nevermind, I misread the fragment, thinking that it is creation of PTmap. Ignore me.

sys/amd64/include/cpu.h
75

Perhaps order alphabetically ?

Implement review comments:

  • Extend a comment to note that fixups in .text are still permitted until we set CR0.WP.
  • Reorder some variable declarations.
sys/amd64/amd64/pmap.c
934

Thanks! I changed it.

1013

No problem. It took me multiple times through this code to fully grok it. :-)

sys/amd64/include/cpu.h
75

Done. (I think. Please check if the new ordering looks OK.)

jtl marked 16 inline comments as done.Feb 12 2018, 11:08 PM

I am fine with this, but would be curious to see an effect of setting a breakpoint with ddb when write-protection is enabled for the kernel text.

This revision is now accepted and ready to land.Feb 12 2018, 11:33 PM
In D14282#300571, @kib wrote:

I am fine with this, but would be curious to see an effect of setting a breakpoint with ddb when write-protection is enabled for the kernel text.

There is a backtrace followed by a complaint about writing to an address in .text.

This use-case was my initial target for the tunable. However, we can fix this by disabling cr0.wp before writing the breakpoint and then enabling cr0.wp after writing it. Is that worth doing? Or, is the tunable enough (you must disable the read-only protection in order to set breakpoints)?

Let DDB set/delete breakpoints, as well as generally modify any memory it wants. (If an administrator doesn't trust their users to use this power properly, they don't need to compile DDB support.)

This revision now requires review to proceed.Feb 13 2018, 12:45 AM
sys/amd64/amd64/db_interface.c
85

I am not sure why it is reasonable to check for CR0_WP. IMO it is good enough to always clear _WP before write, and always reload previously saved cr0 content below.

Also, if you do this, it removes the need for the tunable.

sys/amd64/amd64/db_interface.c
85

I was trying to cut down on writes to the control register. However, I take your point that this will be true > 99% of the time, so it makes sense to just unconditionally write it.

I agree it removes the need for the tunable. I'll take it out.

Implement review feedback:

  • Unconditionally set CR0 when writing to memory in DDB.
  • Now that DDB breakpoints work, remove the tunable.

If the kernel ever used more than 1G, correctly initialize the right number of PDEs in the direct map.

jtl marked 4 inline comments as done.Feb 13 2018, 8:30 PM

I think I've addressed the review feedback, and I'm relatively happy with the proposed change now. So, if anyone has been sitting on the sidelines waiting for this to stabilize, now is a good time to jump in with feedback. :-)

sys/conf/kern.pre.mk
124

These should have -Wl, wrapping, and it probably makes sense to commit this separately, with a comment.

kib added inline comments.
sys/amd64/amd64/db_interface.c
85

Why () around CR0_WP ?

This revision is now accepted and ready to land.Feb 13 2018, 9:10 PM

I haven't yet spent time on this, but I'll just note that D14062 aims to set the NX bit in mappings created by pmap_qenter(), which includes kernel thread stacks.

It's more painful to fix, but the GDB remote stub can also write to memory to set breakpoints. My guess is it is done in the handler for the 'M' command in sys/gdb/gdb_main.c. Unfortunately there isn't an existing MD hook before before/after that would let you avoid #ifdef's in that routine. Could add some new macros to sys/<arch>/include/gdb_machdep.h for start/stop of memory write perhaps.

In D14282#300983, @jhb wrote:

Unfortunately there isn't an existing MD hook before before/after that would let you avoid #ifdef's in that routine. Could add some new macros to sys/<arch>/include/gdb_machdep.h for start/stop of memory write perhaps.

That seems reasonable, given that I hope over time we'll apply a change like this one in D14282 to other architectures.

I haven't yet spent time on this, but I'll just note that D14062 aims to set the NX bit in mappings created by pmap_qenter(), which includes kernel thread stacks.

I think they are orthogonal, and related only in that both try to set the NX bit in more places.

In D14282#300983, @jhb wrote:

It's more painful to fix, but the GDB remote stub can also write to memory to set breakpoints. My guess is it is done in the handler for the 'M' command in sys/gdb/gdb_main.c. Unfortunately there isn't an existing MD hook before before/after that would let you avoid #ifdef's in that routine. Could add some new macros to sys/<arch>/include/gdb_machdep.h for start/stop of memory write perhaps.

Yeah, that is more complicated. It looks like we don't support using the Z command to set a breakpoint, so gdb will fallback to using the M command (as you suspected). I'll see if I can figure out a MD hook that looks reasonable-ish.

The harder part will be getting remote gdb to work in my environment to test this, but - hopefully - I can get that working with enough wrangling.

DTrace FBT needs fixing as well, doesn't it? fbt_patch_tracepoint() is the place where we write breakpoints.

In D14282#301193, @jtl wrote:

I haven't yet spent time on this, but I'll just note that D14062 aims to set the NX bit in mappings created by pmap_qenter(), which includes kernel thread stacks.

I think they are orthogonal, and related only in that both try to set the NX bit in more places.

Right, I'm pointing it out here just as an FYI in case you planned on making a similar change.

Add code to handle remote GDB memory writes. (untested)

This revision now requires review to proceed.Feb 14 2018, 7:11 PM

I tested and confirmed that the new code lets gdb set/delete breakpoints.

sys/amd64/amd64/db_interface.c
85

The parentheses are not necessary; they are just a coding habit (quirk?). I can take them out.

sys/conf/kern.pre.mk
124

Thanks! Good catch! I can commit this separately.

Address review feedback:

  • Add -Wl, wrapping to the LDFLAGS variable.
  • Remove an unneeded set of parentheses.
jtl marked 4 inline comments as done.Feb 15 2018, 1:09 AM

Thanks for fixing the GDB case.

This revision is now accepted and ready to land.Feb 15 2018, 8:42 PM