Page MenuHomeFreeBSD

riscv: Fix pindex level confusion
ClosedPublic

Authored by jrtc27 on Jul 7 2021, 3:03 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 25, 6:36 PM
Unknown Object (File)
Thu, Jan 23, 12:32 AM
Unknown Object (File)
Wed, Jan 15, 3:30 AM
Unknown Object (File)
Tue, Jan 14, 9:01 PM
Unknown Object (File)
Dec 18 2024, 7:19 AM
Unknown Object (File)
Dec 14 2024, 2:49 AM
Unknown Object (File)
Dec 4 2024, 9:31 PM
Unknown Object (File)
Dec 1 2024, 8:38 PM
Subscribers

Details

Summary

The pindex values are assigned from the L3 leaves upwards, meaning there
are NUL2E L3 tables and then NUL1E L2 tables (with a futher NUL0E L1
tables in future when we implement Sv48 support). Therefore anything
below NUL2E is an L3 table's page and anything above or equal to NUL2E
is an L2 table's page (with the threshold of NUL2E + NUL1E marking the
start of the L1 tables' pages in Sv48). Thus all the comparisons and
arithmetic operations must use NUL2E to handle the L3/L2 allocation (and
thus L2/L1 entry) transition point, not NUL1E as all but pmap_alloc_l2
were doing.

To make matters confusing, the NUL1E and NUL2E definitions in the RISC-V
pmap are based on a 4-level page hierarchy but we currently use the
3-level Sv39 format (as that's the only required one, and hardware
support for the 4-level Sv48 is not widespread). This means that, in
effect, the above bug cancels out with the bloated NULxE definitions
such that things "work" (but are still technically wrong, and thus would
break when adding Sv48 support), with one exception. pmap_enter_l2 is
currently the only function to use the correct constant, but since
_pmap_alloc_l3 uses the incorrect constant, it will do complete nonsense
when it needs to allocate a new L2 table (which is rather rare). In this
instance, _pmap_alloc_l3, whilst it would correctly determine the pindex
was for an L2 table, would only subtract NUL1E when computing l1index
and thus go way out of bounds (by 511*512*512 bytes, or 127.75 GiB) of
its own L1 table and, thanks to pmap_distribute_l1, of every other
pmap's L1 table in the whole system. This has likely never been hit as
it would presumably instantly fault and panic.

MFC after: 1 week

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jrtc27 requested review of this revision.Jul 7 2021, 3:03 AM

I haven't reproduced the panic myself, but the panic seems to be happening when hammering on ZFS; the linked PR was running on ZFS and zBeeble (dgilbert) on IRC has hit three panics already in the past few days that are suspiciously similar in terms of what went wrong (seeing unexpectedly-empty PTEs when performing pmap operations), also using ZFS, all during prolonged periods of heavy ports building, and given the wide range of ways the pindex confusion could cause things to break in (and that the pmap has been relatively well exercised for the most part, so there can't be too many egregious bugs in it like this...) it seems highly likely it's this. Hopefully zBeeble can try running with this patch and see the panics disappear, as I don't believe Dennis has seen the panic since reporting it.

With SV39 we have Ln_ENTRIES L1 entries, so Ln_ENTRIES L2 page table pages, so Ln_ENTRIES^2 L3 page table pages. So the (user) L3 page table pages should have pindices in the range [0, Ln_ENTRIES^2 - 1], and (user) L2 page table pages should have pindices in the range [Ln_ENTRIES^2, Ln_ENTRIES^2 + Ln_ENTRIES - 1]. In particular, I think the definitions of NUL1E and NUL2E are wrong, their values should also be shifted right by 9. Or am I misunderstanding?

With SV39 we have Ln_ENTRIES L1 entries, so Ln_ENTRIES L2 page table pages, so Ln_ENTRIES^2 L3 page table pages. So the (user) L3 page table pages should have pindices in the range [0, Ln_ENTRIES^2 - 1], and (user) L2 page table pages should have pindices in the range [Ln_ENTRIES^2, Ln_ENTRIES^2 + Ln_ENTRIES - 1]. In particular, I think the definitions of NUL1E and NUL2E are wrong, their values should also be shifted right by 9. Or am I misunderstanding?

Yes, they are all bigger than they need to be currently by a factor of 2^9, but with Sv48 you'll need the current definitions so I think it's fine to leave it as is (just as on amd64 we always use the LA57 numbering even for 4-level paging) rather than change them now and have to revert that later.

With SV39 we have Ln_ENTRIES L1 entries, so Ln_ENTRIES L2 page table pages, so Ln_ENTRIES^2 L3 page table pages. So the (user) L3 page table pages should have pindices in the range [0, Ln_ENTRIES^2 - 1], and (user) L2 page table pages should have pindices in the range [Ln_ENTRIES^2, Ln_ENTRIES^2 + Ln_ENTRIES - 1]. In particular, I think the definitions of NUL1E and NUL2E are wrong, their values should also be shifted right by 9. Or am I misunderstanding?

Yes, they are all bigger than they need to be currently by a factor of 2^9, but with Sv48 you'll need the current definitions so I think it's fine to leave it as is (just as on amd64 we always use the LA57 numbering even for 4-level paging) rather than change them now and have to revert that later.

So there are no gaps between indices for consecutive L3 PTPs because of the way pmap_l2_pindex() is defined, there is just a range of unused indices between the last L3 index and the first L2 index. Ok, I agree that it's fine.

sys/riscv/riscv/pmap.c
1287–1289

This can be done as a separate change, but we should probably assert that the L1 entry here isn't already valid.

1318–1320

Same here, we should assert that L2 isn't a valid entry.

This revision is now accepted and ready to land.Jul 7 2021, 4:07 PM

With SV39 we have Ln_ENTRIES L1 entries, so Ln_ENTRIES L2 page table pages, so Ln_ENTRIES^2 L3 page table pages. So the (user) L3 page table pages should have pindices in the range [0, Ln_ENTRIES^2 - 1], and (user) L2 page table pages should have pindices in the range [Ln_ENTRIES^2, Ln_ENTRIES^2 + Ln_ENTRIES - 1]. In particular, I think the definitions of NUL1E and NUL2E are wrong, their values should also be shifted right by 9. Or am I misunderstanding?

Yes, they are all bigger than they need to be currently by a factor of 2^9, but with Sv48 you'll need the current definitions so I think it's fine to leave it as is (just as on amd64 we always use the LA57 numbering even for 4-level paging) rather than change them now and have to revert that later.

So there are no gaps between indices for consecutive L3 PTPs because of the way pmap_l2_pindex() is defined, there is just a range of unused indices between the last L3 index and the first L2 index. Ok, I agree that it's fine.

Yeah.

sys/riscv/riscv/pmap.c
1287–1289

That would indeed be wise; amd64's pmap does that, but none of arm, arm64, i386 and mips manage to do that... so probably makes sense to add the assertions to all of them.

With SV39 we have Ln_ENTRIES L1 entries, so Ln_ENTRIES L2 page table pages, so Ln_ENTRIES^2 L3 page table pages. So the (user) L3 page table pages should have pindices in the range [0, Ln_ENTRIES^2 - 1], and (user) L2 page table pages should have pindices in the range [Ln_ENTRIES^2, Ln_ENTRIES^2 + Ln_ENTRIES - 1]. In particular, I think the definitions of NUL1E and NUL2E are wrong, their values should also be shifted right by 9. Or am I misunderstanding?

Yes, they are all bigger than they need to be currently by a factor of 2^9, but with Sv48 you'll need the current definitions so I think it's fine to leave it as is (just as on amd64 we always use the LA57 numbering even for 4-level paging) rather than change them now and have to revert that later.

So there are no gaps between indices for consecutive L3 PTPs because of the way pmap_l2_pindex() is defined, there is just a range of unused indices between the last L3 index and the first L2 index. Ok, I agree that it's fine.

Yeah.

Well, actually, not quite; due to the sign-extension of virtual addresses, the holes are in the middle of each level, which is why this actually matters. If the hole were at the end of the level then NUL1E, due to the 4-level definitions, would actually be a sufficient upper bound. I think.

Hm but that might break stuff... thinking about it more I'm not sure how in practice case 1 ever gets hit?

Yeah, this code is only for user addresses i.e. no negative addresses, so given the current NUL1E is equal to what NUL2E would be if we defined them for just the 3-level Sv39 the existing code, whilst notionally wrong, was actually fine, with the exception of the highlighted pmap_enter_l2 case that was already correctly using NUL2E (albeit not given everything else was "wrong") that must never have been hit since it would otherwise have instantly broken with some kind of page or access fault trying to access way beyond the end of the L1 table. So I don't think this fixes 250866, and it doesn't fix the panics reported in IRC as it got hit again apparently today with this patch applied.

This revision was automatically updated to reflect the committed changes.