RB_AUGMENT is supposed to update augmented tree data in an rb-tree when the tree is modified, by being applied to the root of every modified subtree from the bottom up to the root, but its use in the RB_TREE macros doesn't meet that requirement. The only user of RB_AUGMENT that I know of, therefore had to write his augmentation routine to walk to the root itself, leading to pointless duplicate augmentation. This change seeks to fix the macros and simplify augmentation in that one case. It also slightly improves performance when there is no augmentation, by using 'break' as the default augmentation command, to avoid needless walking to root when nothing would be done at each iteration.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
If RB_ code would do the walk, I am fine with it.
Did you tested this on a machine with DMAR enabled ? You may ask Peter, long-running high-load networking tests exercise the GAS allocator quite efficiently.
sys/sys/tree.h | ||
---|---|---|
576 ↗ | (On Diff #66774) | Why is this assignment correct? It looks as though RB_PARENT(old, field) should be on the RHS. |
Peter, can you help with this test please?
sys/sys/tree.h | ||
---|---|---|
576 ↗ | (On Diff #66774) | In this case, elm is the right child of old, and child is the right, and only, child of elm. We want the value of parent to become elm in line 594, so elm becomes its own parent here to make that happen. If we put "RB_PARENT(elm, field) = old;" here, then old becomes a parent of child in line 597 and is not detached from the tree. |
Here's a panic with dmar enabled:
hdac0: <Intel Patsburg HDA Controller> mem 0xd7f20000-0xd7f23fff irq 22 at device 27.0 numa-domain 0 on pci0 hdac0: dmar1 pci0:0:27:0 rid d8 domain 3 mgaw 48 agaw 48 re-mapped hpanic: driver error: busdma dflt_lock called cpuid = 21 time = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0126881970 vpanic() at vpanic+0x185/frame 0xfffffe01268819d0 panic() at panic+0x43/frame 0xfffffe0126881a30 bus_dma_dflt_lock() at bus_dma_dflt_lock+0x12/frame 0xfffffe0126881a40 dmar_bus_dmamap_complete() at dmar_bus_dmamap_complete+0x3f/frame 0xfffffe0126881a70 bus_dmamap_load_mem() at bus_dmamap_load_mem+0x31c/frame 0xfffffe0126881ae0 dmar_bus_task_dmamap() at dmar_bus_task_dmamap+0xff/frame 0xfffffe0126881b20 taskqueue_run_locked() at taskqueue_run_locked+0x178/frame 0xfffffe0126881b80 taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe0126881bb0 fork_exit() at fork_exit+0x80/frame 0xfffffe0126881bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0126881bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 0 tid 100173 ] Stopped at kdb_enter+0x37: movq $0,0x1085586(%rip) db> x/s version version: FreeBSD 13.0-CURRENT #0 r356795M: Thu Jan 16 17:33:24 CET 2020\012 pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO\012 db>
Correct the augmentation after RB_INSERT so that the newly inserted node is augmented too, before its parent.
Discard some parts from intel_gas.c that, theoretically, are no longer needed to fix augmentation problems.
With dmar enable I get:
hdacc0: <Realtek ALC889 HDA CODEC> at cad 0 on hdac0 ugen2.1: <Intel EHCI root HUB> at usbus2 hdaa0: <Realtek ALC889 Audio Function Group> at nid 1 on hdacc0 ahcich1: DMA load error (aprobe1:ahcich1:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00 (aprobe1:ahcich1:0:0:0): CAM status: CCB request was invalid (aprobe1:ahcich1:0:0:0): Error 22, Unretryable error pcm0: <Realtek ALC889 (Rear Analog 7.1/2.0)> at nid 20,22,21,23 and 24,26 on hdaa0 ahcich1: DMA load error (aprobe1:ahcich1:0:0:0): ATAPI_IDENTIFY. ACB: a1 00 00 00 00 40 00 00 00 00 00 00 (aprobe1:ahcich1:0:0:0): CAM status: CCB request was invalid (aprobe1:ahcich1:0:0:0): Error 22, Unretryable error
Rework intel_gas.c more than I had planned. Along with free_down, keep a first and last value for each entry describing the least and greatest value in the subtree rooted there. Update these in the augment routine. In lowermatch, look for the fit with greatest address less than the lowaddr, which I understand to be an upper bound on the address. In uppermatch, duplicate the lowermatch code, looking for the fit with least address greater than highaddr, which I understand to be a lower bound on the address.
This didn't compile
cc -c -O2 -pipe -fno-strict-aliasing -g -nostdinc -I. -I../../.. -I../../../contrib/ck/include -I../../../contrib/libfdt -D_KERNEL -Dc ../../../x86/iommu/intel_gas.c:178:17: error: no member named 'free_after' in 'struct dmar_map_entry' MPASS(entry->free_after == domain->end - entry->end);
isci0: <Intel(R) C600 Series Chipset SAS Controller> port 0x2000-0x20ff mem 0xd7c00000-0xd7c03fff,0xd7800000-0xd7bfffff irq 16 at device 0.0 numa-domain 0 on pci4 unknown: dmar1 pci0:0:29:0 rid e8 domain 0 mgaw 48 agaw 48 re-mapped unknown: dmar1 pci0:0:26:0 rid d0 domain 1 mgaw 48 agaw 48 re-mapped isci0: dmar1 pci0:4:0:0 rid 400 domain 2 mgaw 48 agaw 48 re-mapped Kernel page fault with the following non-sleepable locks held: exclusive sleep mutex dmardom (dmardom) r = 0 (0xfffff80003d26c30) locked @ x86/iommu/intel_gas.c:616 stack backtrace: #0 0xffffffff80c354b1 at witness_debugger+0x71 #1 0xffffffff80c364d0 at witness_warn+0x430 #2 0xffffffff8106dee0 at trap_pfault+0x80 #3 0xffffffff8106d500 at trap+0x2c0 #4 0xffffffff81043f5c at calltrap+0x8 #5 0xffffffff81021462 at dmar_gas_lowermatch+0x282 #6 0xffffffff81021242 at dmar_gas_lowermatch+0x62 #7 0xffffffff810209f5 at dmar_gas_map+0xf5 #8 0xffffffff810191a3 at dmar_bus_dmamap_load_something+0x153 #9 0xffffffff81018905 at dmar_bus_dmamap_load_buffer+0x1d5 #10 0xffffffff80c07cc6 at bus_dmamap_load+0x86 #11 0xffffffff80fec6fb at isci_allocate_dma_buffer+0xdb #12 0xffffffff80fed9bf at isci_controller_allocate_memory+0xaf #13 0xffffffff80fec54b at isci_initialize+0x18b #14 0xffffffff80feca7f at isci_attach+0x14f #15 0xffffffff80c01ffa at device_attach+0x3ca #16 0xffffffff80c01ba0 at device_probe_and_attach+0x70 #17 0xffffffff80c03228 at bus_generic_attach+0x18 Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff810217a7 stack pointer = 0x28:0xffffffff821182e0 frame pointer = 0x28:0xffffffff82118300 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (swapper) trap number = 12 panic: page fault cpuid = 2 time = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82117f40 vpanic() at vpanic+0x185/frame 0xffffffff82117fa0 panic() at panic+0x43/frame 0xffffffff82118000 trap_fatal() at trap_fatal+0x386/frame 0xffffffff82118060 trap_pfault() at trap_pfault+0x99/frame 0xffffffff821180e0 trap() at trap+0x2c0/frame 0xffffffff82118210 calltrap() at calltrap+0x8/frame 0xffffffff82118210 --- trap 0xc, rip = 0xffffffff810217a7, rsp = 0xffffffff821182e0, rbp = 0xffffffff82118300 --- dmar_gas_match_insert() at dmar_gas_match_insert+0x67/frame 0xffffffff82118300 dmar_gas_lowermatch() at dmar_gas_lowermatch+0x282/frame 0xffffffff82118350 dmar_gas_lowermatch() at dmar_gas_lowermatch+0x62/frame 0xffffffff821183a0 dmar_gas_map() at dmar_gas_map+0xf5/frame 0xffffffff82118430 dmar_bus_dmamap_load_something() at dmar_bus_dmamap_load_something+0x153/frame 0xffffffff82118500 dmar_bus_dmamap_load_buffer() at dmar_bus_dmamap_load_buffer+0x1d5/frame 0xffffffff821185a0 bus_dmamap_load() at bus_dmamap_load+0x86/frame 0xffffffff82118620 isci_allocate_dma_buffer() at isci_allocate_dma_buffer+0xdb/frame 0xffffffff82118650 isci_controller_allocate_memory() at isci_controller_allocate_memory+0xaf/frame 0xffffffff82118690 isci_initialize() at isci_initialize+0x18b/frame 0xffffffff821186d0 isci_attach() at isci_attach+0x14f/frame 0xffffffff82118700 device_attach() at device_attach+0x3ca/frame 0xffffffff82118740 device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118770 bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118790 pci_attach() at pci_attach+0xe0/frame 0xffffffff821187d0 acpi_pci_attach() at acpi_pci_attach+0x19/frame 0xffffffff82118820 device_attach() at device_attach+0x3ca/frame 0xffffffff82118860 device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118890 bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff821188b0 acpi_pcib_pci_attach() at acpi_pcib_pci_attach+0xa0/frame 0xffffffff821188f0 device_attach() at device_attach+0x3ca/frame 0xffffffff82118930 device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118960 bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118980 pci_attach() at pci_attach+0xe0/frame 0xffffffff821189c0 acpi_pci_attach() at acpi_pci_attach+0x19/frame 0xffffffff82118a10 device_attach() at device_attach+0x3ca/frame 0xffffffff82118a50 device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118a80 bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118aa0 acpi_pcib_acpi_attach() at acpi_pcib_acpi_attach+0x43c/frame 0xffffffff82118b10 device_attach() at device_attach+0x3ca/frame 0xffffffff82118b50 device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118b80 bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118ba0 acpi_attach() at acpi_attach+0xd50/frame 0xffffffff82118c30 device_attach() at device_attach+0x3ca/frame 0xffffffff82118c70 device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118ca0 bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118cc0 device_attach() at device_attach+0x3ca/frame 0xffffffff82118d00 device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118d30 bus_generic_new_pass() at bus_generic_new_pass+0xed/frame 0xffffffff82118d60 bus_set_pass() at bus_set_pass+0x46/frame 0xffffffff82118d90 configure() at configure+0x9/frame 0xffffffff82118da0 mi_startup() at mi_startup+0xec/frame 0xffffffff82118df0 btext() at btext+0x2c KDB: enter: panic [ thread pid 0 tid 100000 ] Stopped at kdb_enter+0x37: movq $0,0x1085536(%rip) db> x/s version version: FreeBSD 13.0-CURRENT #3 r356826M: Fri Jan 17 16:59:08 CET 2020\012 pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO\012 db> show registers cs 0x20 ds 0x3b ll+0x1a es 0x3b ll+0x1a fs 0x13 gs 0x28 ll+0x7 ss 0x28 ll+0x7 rax 0x12 rcx 0x80 ll+0x5f rdx 0xffffffff82117e30 rbx 0xffffffff811fa8fc rsp 0xffffffff82117f30 rbp 0xffffffff82117f40 rsi 0x80 ll+0x5f rdi 0xffffffff81c7af98 cnputs_mtx r8 0 r9 0x8080808080808080 r10 0x12 r11 0x1 r12 0 r13 0xffffffff82118220 r14 0 r15 0xffffffff81d9e180 thread0_st rip 0xffffffff80c135a7 kdb_enter+0x37 rflags 0x82 ll+0x61 kdb_enter+0x37: movq $0,0x1085536(%rip) db> (kgdb) l *0xffffffff810217a7 0xffffffff810217a7 is in dmar_gas_match_insert (../../../x86/iommu/intel_gas.c:348). 343 * allocations to ensure that out-of-bounds accesses fault. 344 */ 345 a->entry->end = a->entry->start + a->size; 346 347 next = RB_NEXT(dmar_gas_entries_tree, &a->domain->rb_root, prev); 348 KASSERT(next->start >= a->entry->end && 349 next->start - a->entry->start >= a->size && 350 prev->end <= a->entry->end, 351 ("dmar_gas_match_insert hole failed %p prev (%jx, %jx) " 352 "next (%jx, %jx) entry (%jx, %jx)", a->domain, (kgdb)
Stop sending the prev entry to dmar_gas_match_insert. I was sending the prev entry sometimes and the next entry sometimes, but the insert function was only using them for assertion checking anyway.
I ran a 30 minutest udp test with D23189.66959.diff and hw.dmar.enable=1.
I ran 100 tests in two hours followed by at buildworld / installworld.
No problems seen.
Push the setting of a->entry->start down into dmar_gas_match_one. Add a documentary comment before that function.
sys/x86/iommu/intel_gas.c | ||
---|---|---|
284 ↗ | (On Diff #66959) | Because there is no longer a field 'free_after'. I could add an assertion that RB_RIGHT is null, if you like. |
sys/x86/iommu/intel_gas.c | ||
---|---|---|
369 ↗ | (On Diff #66970) | Am I correct that lowermatch now prefers higher free address ranges over the low ranges ? For instance, assume that there is no upper hole at all, and that we have the only entry pre-allocated that consumes single page in the middle of the guest address space. Then old lowermatch selects lowest addresses for a new request, while your version would select the highest (after the allocated entry) ? |
sys/x86/iommu/intel_gas.c | ||
---|---|---|
369 ↗ | (On Diff #66970) | You are correct that lowermatch, after this change, invokes dmar_gas_match_insert on the highest possible free address range. You are wrong to imply that lowermatch, before this change, invoked dmar_gas_match_insert on the lowest possible free address range. Before, it tried a 'middle' range - whatever happened to start just right of the root of the tree, and if that failed, then it tried searching (recursively) the part before that middle range, then the part after that middle range. So while it was biased toward the lower free address ranges, it didn't pick the lowest, necessarily. I reasoned that the previous implementation was indifferent about which free range got picked, since it sought neither the first nor the last free range, but that the lowest addresses might be more 'precious', in the sense that if one allocation request asks for a range below 1000, and the next request asks for a range below 500, it is better if the first request didn't consume the only available range below 500 when there were other possibilities available. If I understand your 'upper hole' question correctly - and I may not - you're asking about the case when there is one allocated range less than lowaddr and thus two free ranges less than lowaddr, before and after that allocated range. The old lowermatch would try the free range after the allocated range - starting at 'prev->end', where prev points to the allocated range. The new lowermatch would get to the same place, after a call to lowermatch on the right subtree didn't produce an allocation. |
sys/x86/iommu/intel_gas.c | ||
---|---|---|
369 ↗ | (On Diff #66970) | There is no need to save the addresses at low. I think that the ideal scheme for guest AS allocation would be to have a cursor and move it past the end of the last successful allocation, wrapping on the end. Users cannot specify that they want something at low addresses for the given request. What can be done is to declare a 'hole' in the middle of the address space. Then I call the allocations from below the bottom of the hole 'lower', and above the top of the hole 'upper'. The upper range is very rarely supported/requested. Typically the AS is either the whole bus space, or hole starts at 4G and to the end of AS. |
Discard everything about x86/iommu/intel_* and just leave the tree.h changes, so that they don't get trapped in a discussion of other things.
Restore the intel_gas changes that exploit RB_AUGMENT - the ones that eliminate the need to walk from leaf to root to update free_down, and the ones that eliminate the need to use RB_NEXT to update free_after fields. But don't change the topmost-fit search method of lowermatch, or use free_down to speed up uppermatch. The order in which the match functions consider entries is not changed.
I'm told offline that this should all be redone in a way that doesn't use a search tree, but my objective is just to get the search tree to be used more effectively.