During pmap_kenter try to find the correct mapping attribute by
looking in the efi map.
If nothing is found default to device memory.
While here fix WC as it should correspond to uncachable and not write througt
Sponsored by: Ampere Computing, LLC
Details
Tested on : Ampere eMAG, ThunderX2, Pine64
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Can you do the same for amd64 ?
I think this change should be somewhat improved by taking the most strong attr mode between pair of what is reported by UEFI memory map, and what is requested by a caller. For instance, if caller requests uncacheable, while UEFI reports write-combining, uncacheable should win.
I could have a look latter this week.
I think this change should be somewhat improved by taking the most strong attr mode between pair of what is reported by UEFI memory map, and what is requested by a caller. For instance, if caller requests uncacheable, while UEFI reports write-combining, uncacheable should win.
I don't really understand, the caller of pmap_mapbios never request the mapping attr.
I mean that any mapping created by pmap_mapXXX() functions should take the UEFI map attributes. In particular, pmap should not change DMAP attributes in a way incompatible with the firmware requirements.
Correct the mapping in pmap_kenter based on the requested one and the one in the efi map
sys/arm64/arm64/pmap.c | ||
---|---|---|
1163 | How large is typical map on arm64 ? How many linear walks per page mapping do we add this way ? |
sys/arm64/arm64/pmap.c | ||
---|---|---|
1163 | On the Ampere eMAG it's 25 while on the Pine64 it's 4 and Overdrive 1000 3 it seems. |
sys/arm64/arm64/pmap.c | ||
---|---|---|
1163 | For me, 25 sounds same as 'a lot'. Since each pmap_kenter() would typically walk some part of the list, depending on the ordering of the elements, I believe it is a good reason to pre-process the map into something more optimal. If you do that, the derived data structure might allow to enter additional items not presented in the raw EFI map, which is good for e.g. purpose of using similar approach on amd64 with non-EFI boot. You might consider using pctrie keyed on the start address of the map element, see sys/pctrie.h. Although the tree will be quite sparse, so may be you can propose something even more optimal. |
So with a pctrie (I can post the code somewhere if you want) I have 14 non-leaf node
db> show pctrienode efi_map_trie_zone
node 0xffff000000ad3f90, owner fffffd0010d0b540, children count 19408, level 4305:
slot: 0, val: 0xffff000001568060, value: 0, clev: 4305
slot: 1, val: 0xbfffd3ff18, value: 0, clev: 4305
slot: 2, val: 0xbfffd3fb18, value: 0, clev: 4305
slot: 3, val: 0xfffffdbf7fd30018, value: 0, clev: 4305
slot: 5, val: 0xffff000000760795, value: 0xffff000000760794, clev: 4305
slot: 6, val: 0x1030000, value: 0, clev: 4305
slot: 7, val: 0xfffffdbf6a38b200, value: 0, clev: 4305
slot: 9, val: 0xfffffd00101a2c40, value: 0, clev: 4305
slot: 10, val: 0xfffffd0010b9e400, value: 0, clev: 4305
slot: 12, val: 0x1dcd6500, value: 0, clev: 4305
slot: 13, val: 0xfffffd0010b9e380, value: 0, clev: 4305
slot: 14, val: 0xfffffd0010b9d200, value: 0, clev: 4305
So we might need something else.
You could build a table of descriptors with EFI_MD_ATTR_RT set, sorted by md_virt value, and do a binary search on that table to find which range, and thus which descriptor, had the address you're looking for.
I don't know anything about EFI, so I don't know if the ranges come in sorted order already, or whether we'd have to sort them. And I don't know if this is something that could be done once in efirt.c:efi_init, or would need to be done in efirt_machdep.c:eft_create_1t1_map for each architecture. In any case, it seems not too hard to replace a linear search with a binary search.
Every firmware that I saw populate the table sorted but I don't think that this is a requirement from the spec so we have to assume that it is not.
It should be done once in efi_init, that's what I do in my code with the pctrie (well in a similar way as right now as some calls are made before efi_init is called).
Do you know some code I could borrowed for that ? I guess I'm not the only one who need binary search code in the kernel.
libkern has both qsort and bsearch, and I can grep the source tree for many invocations of each.
I think a better approach if we want to honor the EFI map on both x86 and arm is to check the memory map when creating the direct map itself so that we set the right attributes on the entire range (e.g. the entire MMIO range at once) allowing use of large pages instead of breaking them down into small pages when a device first requests a resource and the only re-promoting if the entire MMIO range is claimed by devices. I would still like to have pmap_mapbios never set an explicit attr but just use whatever is configured as well.
I think that creating DMAP with the right attributes from start is not just better approach, but in fact is required for correctness. CPUs are allowed to make arbitrary speculative accesses, which means that any valid page table entry must be correct. I might work on the DMAP part, at least for amd64.
Tested on += Marvell MACCHIATObin (Armada8k) [upstream EDK2, atf-marvell, ECAM shift removed, ACPI mode]
On this platform, this patch significantly improves the behavior of the (linuxkpi) drivers that ioremap PCIe memory:
- mlx4en fully works instead of hanging!
- amdgpu (WIP) gets further in the init process:
- no patch:
- register mmio base: 0xC0000000 (remapped as 0x7882D000)
- first readl from there (0xffff00007882d000 + 16140 — reading gpu revision) hangs
- with patch:
- register mmio base: 0xC0000000 (remapped as 0x78823000)
- multiple successful readl: 0xffff000078823000 + 16140, + 52228, + 1716, + 1716 — early_init done
- next one hangs: 0xffff000078823000 + 28 — somewhere in bios reading
- no patch:
UPDATE: facepalm, lkpi was using UNCACHEABLE memory instead of DEVICE. this patch helping was just a side effect, I found the real fix. Still, tested here :)
UPDATE: actually this miiiight have been preventing the GPU from working.
EDIT: retested, yeah, it does: https://reviews.freebsd.org/P279
So long ago and I don't know if this is still really needed as we boot fine on Ampere machines.