Page MenuHomeFreeBSD

riscv: Handle hardware-managed dirty bit updates in pmap_promote_l2()
ClosedPublic

Authored by markj on Jun 4 2021, 6:22 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 30, 5:09 PM
Unknown Object (File)
Mon, Nov 25, 9:16 AM
Unknown Object (File)
Sat, Nov 23, 5:17 AM
Unknown Object (File)
Wed, Nov 20, 4:06 AM
Unknown Object (File)
Tue, Nov 19, 6:42 AM
Unknown Object (File)
Sun, Nov 17, 3:44 AM
Unknown Object (File)
Sun, Nov 17, 3:40 AM
Unknown Object (File)
Fri, Nov 15, 3:47 AM

Details

Summary

When adding transparent superpage support I neglected to handle
hardware-managed dirty bits here. In particular, when comparing the
attributes of a run of 512 PTEs, we must handle the possibility that the
hardware will set PTE_D on a clean, writable mapping.

Following the example of amd64 and arm64, change riscv's
pmap_promote_l2() to downgrade clean, writable mappings to read-only, so
that updates are synchronized by the pmap lock.

Note that the change clears PTE_W without an sfence.vma. When hardware
management of the dirty bit is implemented, the implementation is
required to atomically check that the leaf PTE is writable and set
PTE_D. Therefore I believe there is no danger that a downgraded mapping
will be dirtied via a cached translation. In the worst case, we'll get
a spurious page fault, at which point we issue sfence.vma.

Fixes: f6893f09d
MFC after: 1 week

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I agree with your assessment about not needing sfence.vma, but that's probably worth a comment in the source.

sys/riscv/riscv/pmap.c
2564

Hm, PTE_W is an int not a pt_entry_t, does that not cause issues for the high bits? I guess it's fine because bitwise negation on signed integers is implementation-defined, and in this case two's complement means the int -> unsigned long conversion does the right thing... and we use this pattern elsewhere. Seems dangerous but I think this is all well-defined by a combination of the C spec and knowing the implementation is two's complement, though it would probably make sense for all the PTE_* constants to be the same type as pt_entry_t, both in width and signedness...

I agree with your assessment about not needing sfence.vma, but that's probably worth a comment in the source.

Yes, I'll add one.

sys/riscv/riscv/pmap.c
2564

That assumption is made in pretty much all of the pmaps, it seems. We could perhaps specify a type suffix of ul in the literal value for all of the constants, I think that would do the right thing were we ever to implement riscv32.

sys/riscv/riscv/pmap.c
2564

Yeah, I started writing the comment thinking you had a bug there and would trash the upper bits, but then it turned into general musings. As you say, it's a pretty common pattern, and not just for pmap code. As for riscv32, well, that seems like a bit of a waste of time, but hey if someone wants to do it (properly) I won't stop them...

Add a comment explaining why we downgrade protections and why we don't
execute sfence.vma.

sys/riscv/riscv/pmap.c
2565–2569

I think the atomic compare-and-swap requirement for hardware dirty tracking is important to mention in here (and maybe also that software dirty tracking will fault on the cached translations already anyway due to missing PTE_D)?

sys/riscv/riscv/pmap.c
2567

Just checking, but: there's no possibility that these cached translations will persist past the lifetime of the page table page now being updated? That is, by the time the page containing these l3 PTEs is recycled, we certainly will have done a global TLB invalidation? (Otherwise it seems like there's risk that the PTW might occasionally corrupt the PTE_D bit in bit patterns that happen to look like the PTEs?)

sys/riscv/riscv/pmap.c
2567

See arm64's version of this code; the pmap_insert_pt_page ensures that.

markj added inline comments.
sys/riscv/riscv/pmap.c
2567

Right, during promotion we don't discard the page containing the L3 PTEs. Instead it is saved in a per-pmap radix tree. During demotion we restore it. So it is ok to let stale TLB entries reference the old L3 entries so long as they are not clean and writable.

Try to better explain why it's ok to defer sfence.vma in pmap_promote_l2().

This revision is now accepted and ready to land.Jun 4 2021, 10:13 PM

Wes, could you please check this does indeed fix the panic you were seeing with snmalloc, not just your hand-written test?

Yes, this does look like it fixes the original panic I saw as well as my hand-written test. Thanks!

Can we also commit the test program from https://reviews.freebsd.org/D30550?

Yes, I will try to do this. Of course it can't really be committed as-is since there's never a guarantee that transparent promotion will succeed. But having some basic regression tests which verify that we demote superpages during various operations would be useful.

Wes, could you please check this does indeed fix the panic you were seeing with snmalloc, not just your hand-written test?

I'd be curious how the promotion count with snmalloc compares to jemalloc, particularly for clang.