Page MenuHomeFreeBSD

arm64: pmap: Try to find the correct attribute
Needs ReviewPublic

Authored by manu on May 22 2019, 5:07 AM.

Details

Reviewers
kib
andrew
jhb
dougm
Group Reviewers
arm64
Summary

During pmap_kenter try to find the correct mapping attribute by
looking in the efi map.
If nothing is found default to device memory.
While here fix WC as it should correspond to uncachable and not write througt
Sponsored by: Ampere Computing, LLC

Test Plan

Tested on : Ampere eMAG, ThunderX2, Pine64

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

manu created this revision.May 22 2019, 5:07 AM
kib added a reviewer: jhb.EditedMay 22 2019, 9:00 AM

Can you do the same for amd64 ?

I think this change should be somewhat improved by taking the most strong attr mode between pair of what is reported by UEFI memory map, and what is requested by a caller. For instance, if caller requests uncacheable, while UEFI reports write-combining, uncacheable should win.

manu added a comment.May 22 2019, 9:22 AM
In D20348#438878, @kib wrote:

Can you do the same for amd64 ?

I could have a look latter this week.

I think this change should be somewhat improved by taking the most strong attr mode between pair of what is reported by UEFI memory map, and what is requested by a caller. For instance, if caller requests uncacheable, while UEFI reports write-combining, uncacheable should win.

I don't really understand, the caller of pmap_mapbios never request the mapping attr.

kib added a comment.May 22 2019, 9:46 AM
In D20348#438886, @manu wrote:
In D20348#438878, @kib wrote:

I think this change should be somewhat improved by taking the most strong attr mode between pair of what is reported by UEFI memory map, and what is requested by a caller. For instance, if caller requests uncacheable, while UEFI reports write-combining, uncacheable should win.

I don't really understand, the caller of pmap_mapbios never request the mapping attr.

I mean that any mapping created by pmap_mapXXX() functions should take the UEFI map attributes. In particular, pmap should not change DMAP attributes in a way incompatible with the firmware requirements.

manu added a comment.May 22 2019, 9:53 AM
In D20348#438893, @kib wrote:
In D20348#438886, @manu wrote:
In D20348#438878, @kib wrote:

I think this change should be somewhat improved by taking the most strong attr mode between pair of what is reported by UEFI memory map, and what is requested by a caller. For instance, if caller requests uncacheable, while UEFI reports write-combining, uncacheable should win.

I don't really understand, the caller of pmap_mapbios never request the mapping attr.

I mean that any mapping created by pmap_mapXXX() functions should take the UEFI map attributes. In particular, pmap should not change DMAP attributes in a way incompatible with the firmware requirements.

Ah right, ok, I'll do that.

manu updated this revision to Diff 57749.May 23 2019, 6:43 AM
manu edited the test plan for this revision. (Show Details)

Correct the mapping in pmap_kenter based on the requested one and the one in the efi map

manu retitled this revision from arm64: pmap_mapbios: Try to find the correct attribute to arm64: pmap: Try to find the correct attribute.May 23 2019, 6:51 AM
manu edited the summary of this revision. (Show Details)
manu updated this revision to Diff 57752.May 23 2019, 6:58 AM
manu edited the summary of this revision. (Show Details)

remove a blank line removal.

kib added inline comments.May 23 2019, 7:29 PM
sys/arm64/arm64/pmap.c
1162

How large is typical map on arm64 ? How many linear walks per page mapping do we add this way ?

manu added inline comments.May 27 2019, 1:39 PM
sys/arm64/arm64/pmap.c
1162

On the Ampere eMAG it's 25 while on the Pine64 it's 4 and Overdrive 1000 3 it seems.

kib added inline comments.May 27 2019, 1:56 PM
sys/arm64/arm64/pmap.c
1162

For me, 25 sounds same as 'a lot'. Since each pmap_kenter() would typically walk some part of the list, depending on the ordering of the elements, I believe it is a good reason to pre-process the map into something more optimal. If you do that, the derived data structure might allow to enter additional items not presented in the raw EFI map, which is good for e.g. purpose of using similar approach on amd64 with non-EFI boot.

You might consider using pctrie keyed on the start address of the map element, see sys/pctrie.h. Although the tree will be quite sparse, so may be you can propose something even more optimal.

kib added a reviewer: dougm.May 27 2019, 1:56 PM

May be Doug with his experience with efficient data structures can suggest.

manu added a comment.May 27 2019, 6:15 PM

So with a pctrie (I can post the code somewhere if you want) I have 14 non-leaf node
db> show pctrienode efi_map_trie_zone
node 0xffff000000ad3f90, owner fffffd0010d0b540, children count 19408, level 4305:
slot: 0, val: 0xffff000001568060, value: 0, clev: 4305
slot: 1, val: 0xbfffd3ff18, value: 0, clev: 4305
slot: 2, val: 0xbfffd3fb18, value: 0, clev: 4305
slot: 3, val: 0xfffffdbf7fd30018, value: 0, clev: 4305
slot: 5, val: 0xffff000000760795, value: 0xffff000000760794, clev: 4305
slot: 6, val: 0x1030000, value: 0, clev: 4305
slot: 7, val: 0xfffffdbf6a38b200, value: 0, clev: 4305
slot: 9, val: 0xfffffd00101a2c40, value: 0, clev: 4305
slot: 10, val: 0xfffffd0010b9e400, value: 0, clev: 4305
slot: 12, val: 0x1dcd6500, value: 0, clev: 4305
slot: 13, val: 0xfffffd0010b9e380, value: 0, clev: 4305
slot: 14, val: 0xfffffd0010b9d200, value: 0, clev: 4305

So we might need something else.

dougm added a comment.May 28 2019, 5:31 AM

You could build a table of descriptors with EFI_MD_ATTR_RT set, sorted by md_virt value, and do a binary search on that table to find which range, and thus which descriptor, had the address you're looking for.

I don't know anything about EFI, so I don't know if the ranges come in sorted order already, or whether we'd have to sort them. And I don't know if this is something that could be done once in efirt.c:efi_init, or would need to be done in efirt_machdep.c:eft_create_1t1_map for each architecture. In any case, it seems not too hard to replace a linear search with a binary search.

manu added a comment.May 28 2019, 5:51 AM

You could build a table of descriptors with EFI_MD_ATTR_RT set, sorted by md_virt value, and do a binary search on that table to find which range, and thus which descriptor, had the address you're looking for.
I don't know anything about EFI, so I don't know if the ranges come in sorted order already, or whether we'd have to sort them. And I don't know if this is something that could be done once in efirt.c:efi_init, or would need to be done in efirt_machdep.c:eft_create_1t1_map for each architecture. In any case, it seems not too hard to replace a linear search with a binary search.

Every firmware that I saw populate the table sorted but I don't think that this is a requirement from the spec so we have to assume that it is not.
It should be done once in efi_init, that's what I do in my code with the pctrie (well in a similar way as right now as some calls are made before efi_init is called).
Do you know some code I could borrowed for that ? I guess I'm not the only one who need binary search code in the kernel.

dougm added a comment.May 28 2019, 6:15 AM
In D20348#441196, @manu wrote:

It should be done once in efi_init, that's what I do in my code with the pctrie (well in a similar way as right now as some calls are made before efi_init is called).
Do you know some code I could borrowed for that ? I guess I'm not the only one who need binary search code in the kernel.

libkern has both qsort and bsearch, and I can grep the source tree for many invocations of each.

jhb added a comment.May 28 2019, 5:12 PM

I think a better approach if we want to honor the EFI map on both x86 and arm is to check the memory map when creating the direct map itself so that we set the right attributes on the entire range (e.g. the entire MMIO range at once) allowing use of large pages instead of breaking them down into small pages when a device first requests a resource and the only re-promoting if the entire MMIO range is claimed by devices. I would still like to have pmap_mapbios never set an explicit attr but just use whatever is configured as well.

kib added a comment.May 28 2019, 6:23 PM
In D20348#441376, @jhb wrote:

I think a better approach if we want to honor the EFI map on both x86 and arm is to check the memory map when creating the direct map itself so that we set the right attributes on the entire range (e.g. the entire MMIO range at once) allowing use of large pages instead of breaking them down into small pages when a device first requests a resource and the only re-promoting if the entire MMIO range is claimed by devices. I would still like to have pmap_mapbios never set an explicit attr but just use whatever is configured as well.

I think that creating DMAP with the right attributes from start is not just better approach, but in fact is required for correctness. CPUs are allowed to make arbitrary speculative accesses, which means that any valid page table entry must be correct. I might work on the DMAP part, at least for amd64.

Tested on += Marvell MACCHIATObin (Armada8k) [upstream EDK2, atf-marvell, ECAM shift removed, ACPI mode]

On this platform, this patch significantly improves the behavior of the (linuxkpi) drivers that ioremap PCIe memory:

  • mlx4en fully works instead of hanging!
  • amdgpu (WIP) gets further in the init process:
    • no patch:
      • register mmio base: 0xC0000000 (remapped as 0x7882D000)
      • first readl from there (0xffff00007882d000 + 16140 — reading gpu revision) hangs
    • with patch:
      • register mmio base: 0xC0000000 (remapped as 0x78823000)
      • multiple successful readl: 0xffff000078823000 + 16140, + 52228, + 1716, + 1716early_init done
      • next one hangs: 0xffff000078823000 + 28 — somewhere in bios reading

UPDATE: facepalm, lkpi was using UNCACHEABLE memory instead of DEVICE. this patch helping was just a side effect, I found the real fix. Still, tested here :)

UPDATE: actually this miiiight have been preventing the GPU from working.

EDIT: retested, yeah, it does: https://reviews.freebsd.org/P279

dougm resigned from this revision.Sep 17 2019, 5:39 AM