Page MenuHomeFreeBSD

Fix definition and use of RB_AUGMENT
ClosedPublic

Authored by dougm on Jan 15 2020, 9:19 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Oct 24, 6:00 PM
Unknown Object (File)
Mon, Oct 20, 7:04 AM
Unknown Object (File)
Fri, Oct 17, 2:34 AM
Unknown Object (File)
Wed, Oct 15, 9:06 AM
Unknown Object (File)
Wed, Oct 15, 9:05 AM
Unknown Object (File)
Wed, Oct 15, 9:05 AM
Unknown Object (File)
Wed, Oct 15, 9:04 AM
Unknown Object (File)
Wed, Oct 15, 12:13 AM
Subscribers

Details

Summary

RB_AUGMENT is supposed to update augmented tree data in an rb-tree when the tree is modified, by being applied to the root of every modified subtree from the bottom up to the root, but its use in the RB_TREE macros doesn't meet that requirement. The only user of RB_AUGMENT that I know of, therefore had to write his augmentation routine to walk to the root itself, leading to pointless duplicate augmentation. This change seeks to fix the macros and simplify augmentation in that one case. It also slightly improves performance when there is no augmentation, by using 'break' as the default augmentation command, to avoid needless walking to root when nothing would be done at each iteration.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

If RB_ code would do the walk, I am fine with it.

Did you tested this on a machine with DMAR enabled ? You may ask Peter, long-running high-load networking tests exercise the GAS allocator quite efficiently.

sys/sys/tree.h
576

Why is this assignment correct? It looks as though RB_PARENT(old, field) should be on the RHS.

In D23189#508661, @kib wrote:

If RB_ code would do the walk, I am fine with it.

Did you tested this on a machine with DMAR enabled ? You may ask Peter, long-running high-load networking tests exercise the GAS allocator quite efficiently.

Peter, can you help with this test please?

sys/sys/tree.h
576

In this case, elm is the right child of old, and child is the right, and only, child of elm. We want the value of parent to become elm in line 594, so elm becomes its own parent here to make that happen.

If we put "RB_PARENT(elm, field) = old;" here, then old becomes a parent of child in line 597 and is not detached from the tree.

Here's a panic with dmar enabled:

hdac0: <Intel Patsburg HDA Controller> mem 0xd7f20000-0xd7f23fff irq 22 at device 27.0 numa-domain 0 on pci0
hdac0: dmar1 pci0:0:27:0 rid d8 domain 3 mgaw 48 agaw 48 re-mapped
hpanic: driver error: busdma dflt_lock called
cpuid = 21
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0126881970
vpanic() at vpanic+0x185/frame 0xfffffe01268819d0
panic() at panic+0x43/frame 0xfffffe0126881a30
bus_dma_dflt_lock() at bus_dma_dflt_lock+0x12/frame 0xfffffe0126881a40
dmar_bus_dmamap_complete() at dmar_bus_dmamap_complete+0x3f/frame 0xfffffe0126881a70
bus_dmamap_load_mem() at bus_dmamap_load_mem+0x31c/frame 0xfffffe0126881ae0
dmar_bus_task_dmamap() at dmar_bus_task_dmamap+0xff/frame 0xfffffe0126881b20
taskqueue_run_locked() at taskqueue_run_locked+0x178/frame 0xfffffe0126881b80
taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe0126881bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe0126881bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0126881bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 0 tid 100173 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1085586(%rip)
db> x/s version
version:        FreeBSD 13.0-CURRENT #0 r356795M: Thu Jan 16 17:33:24 CET 2020\012    pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO\012
db>

Correct the augmentation after RB_INSERT so that the newly inserted node is augmented too, before its parent.

Discard some parts from intel_gas.c that, theoretically, are no longer needed to fix augmentation problems.

With dmar enable I get:

hdacc0: <Realtek ALC889 HDA CODEC> at cad 0 on hdac0
ugen2.1: <Intel EHCI root HUB> at usbus2
hdaa0: <Realtek ALC889 Audio Function Group> at nid 1 on hdacc0
ahcich1: DMA load error
(aprobe1:ahcich1:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00 
(aprobe1:ahcich1:0:0:0): CAM status: CCB request was invalid
(aprobe1:ahcich1:0:0:0): Error 22, Unretryable error
pcm0: <Realtek ALC889 (Rear Analog 7.1/2.0)> at nid 20,22,21,23 and 24,26 on hdaa0
ahcich1: DMA load error
(aprobe1:ahcich1:0:0:0): ATAPI_IDENTIFY. ACB: a1 00 00 00 00 40 00 00 00 00 00 00
(aprobe1:ahcich1:0:0:0): CAM status: CCB request was invalid
(aprobe1:ahcich1:0:0:0): Error 22, Unretryable error
dougm marked an inline comment as done.

Rework intel_gas.c more than I had planned. Along with free_down, keep a first and last value for each entry describing the least and greatest value in the subtree rooted there. Update these in the augment routine. In lowermatch, look for the fit with greatest address less than the lowaddr, which I understand to be an upper bound on the address. In uppermatch, duplicate the lowermatch code, looking for the fit with least address greater than highaddr, which I understand to be a lower bound on the address.

This didn't compile

cc -c -O2 -pipe -fno-strict-aliasing  -g -nostdinc  -I. -I../../.. -I../../../contrib/ck/include -I../../../contrib/libfdt -D_KERNEL -Dc                                                                                                                                    
../../../x86/iommu/intel_gas.c:178:17: error: no member named 'free_after' in 'struct dmar_map_entry'                                                                                                                                                                       
                        MPASS(entry->free_after == domain->end - entry->end);

Update assertion code to test first and last, rather than free_after.

isci0: <Intel(R) C600 Series Chipset SAS Controller> port 0x2000-0x20ff mem 0xd7c00000-0xd7c03fff,0xd7800000-0xd7bfffff irq 16 at device 0.0 numa-domain 0 on pci4
unknown: dmar1 pci0:0:29:0 rid e8 domain 0 mgaw 48 agaw 48 re-mapped
unknown: dmar1 pci0:0:26:0 rid d0 domain 1 mgaw 48 agaw 48 re-mapped
isci0: dmar1 pci0:4:0:0 rid 400 domain 2 mgaw 48 agaw 48 re-mapped
Kernel page fault with the following non-sleepable locks held:
exclusive sleep mutex dmardom (dmardom) r = 0 (0xfffff80003d26c30) locked @ x86/iommu/intel_gas.c:616
stack backtrace:
#0 0xffffffff80c354b1 at witness_debugger+0x71
#1 0xffffffff80c364d0 at witness_warn+0x430
#2 0xffffffff8106dee0 at trap_pfault+0x80
#3 0xffffffff8106d500 at trap+0x2c0
#4 0xffffffff81043f5c at calltrap+0x8
#5 0xffffffff81021462 at dmar_gas_lowermatch+0x282
#6 0xffffffff81021242 at dmar_gas_lowermatch+0x62
#7 0xffffffff810209f5 at dmar_gas_map+0xf5
#8 0xffffffff810191a3 at dmar_bus_dmamap_load_something+0x153
#9 0xffffffff81018905 at dmar_bus_dmamap_load_buffer+0x1d5
#10 0xffffffff80c07cc6 at bus_dmamap_load+0x86
#11 0xffffffff80fec6fb at isci_allocate_dma_buffer+0xdb
#12 0xffffffff80fed9bf at isci_controller_allocate_memory+0xaf
#13 0xffffffff80fec54b at isci_initialize+0x18b
#14 0xffffffff80feca7f at isci_attach+0x14f
#15 0xffffffff80c01ffa at device_attach+0x3ca
#16 0xffffffff80c01ba0 at device_probe_and_attach+0x70
#17 0xffffffff80c03228 at bus_generic_attach+0x18


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff810217a7
stack pointer           = 0x28:0xffffffff821182e0
frame pointer           = 0x28:0xffffffff82118300
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (swapper)
trap number             = 12
panic: page fault
cpuid = 2
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82117f40
vpanic() at vpanic+0x185/frame 0xffffffff82117fa0
panic() at panic+0x43/frame 0xffffffff82118000
trap_fatal() at trap_fatal+0x386/frame 0xffffffff82118060
trap_pfault() at trap_pfault+0x99/frame 0xffffffff821180e0
trap() at trap+0x2c0/frame 0xffffffff82118210
calltrap() at calltrap+0x8/frame 0xffffffff82118210
--- trap 0xc, rip = 0xffffffff810217a7, rsp = 0xffffffff821182e0, rbp = 0xffffffff82118300 ---
dmar_gas_match_insert() at dmar_gas_match_insert+0x67/frame 0xffffffff82118300
dmar_gas_lowermatch() at dmar_gas_lowermatch+0x282/frame 0xffffffff82118350
dmar_gas_lowermatch() at dmar_gas_lowermatch+0x62/frame 0xffffffff821183a0
dmar_gas_map() at dmar_gas_map+0xf5/frame 0xffffffff82118430
dmar_bus_dmamap_load_something() at dmar_bus_dmamap_load_something+0x153/frame 0xffffffff82118500
dmar_bus_dmamap_load_buffer() at dmar_bus_dmamap_load_buffer+0x1d5/frame 0xffffffff821185a0
bus_dmamap_load() at bus_dmamap_load+0x86/frame 0xffffffff82118620
isci_allocate_dma_buffer() at isci_allocate_dma_buffer+0xdb/frame 0xffffffff82118650
isci_controller_allocate_memory() at isci_controller_allocate_memory+0xaf/frame 0xffffffff82118690
isci_initialize() at isci_initialize+0x18b/frame 0xffffffff821186d0
isci_attach() at isci_attach+0x14f/frame 0xffffffff82118700
device_attach() at device_attach+0x3ca/frame 0xffffffff82118740
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118770
bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118790
pci_attach() at pci_attach+0xe0/frame 0xffffffff821187d0
acpi_pci_attach() at acpi_pci_attach+0x19/frame 0xffffffff82118820
device_attach() at device_attach+0x3ca/frame 0xffffffff82118860
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118890
bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff821188b0
acpi_pcib_pci_attach() at acpi_pcib_pci_attach+0xa0/frame 0xffffffff821188f0
device_attach() at device_attach+0x3ca/frame 0xffffffff82118930
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118960
bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118980
pci_attach() at pci_attach+0xe0/frame 0xffffffff821189c0
acpi_pci_attach() at acpi_pci_attach+0x19/frame 0xffffffff82118a10
device_attach() at device_attach+0x3ca/frame 0xffffffff82118a50
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118a80
bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118aa0
acpi_pcib_acpi_attach() at acpi_pcib_acpi_attach+0x43c/frame 0xffffffff82118b10
device_attach() at device_attach+0x3ca/frame 0xffffffff82118b50
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118b80
bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118ba0
acpi_attach() at acpi_attach+0xd50/frame 0xffffffff82118c30
device_attach() at device_attach+0x3ca/frame 0xffffffff82118c70
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118ca0
bus_generic_attach() at bus_generic_attach+0x18/frame 0xffffffff82118cc0
device_attach() at device_attach+0x3ca/frame 0xffffffff82118d00
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff82118d30
bus_generic_new_pass() at bus_generic_new_pass+0xed/frame 0xffffffff82118d60
bus_set_pass() at bus_set_pass+0x46/frame 0xffffffff82118d90
configure() at configure+0x9/frame 0xffffffff82118da0
mi_startup() at mi_startup+0xec/frame 0xffffffff82118df0
btext() at btext+0x2c
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1085536(%rip)
db> x/s version
version:        FreeBSD 13.0-CURRENT #3 r356826M: Fri Jan 17 16:59:08 CET 2020\012    pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO\012
db> show registers
cs                        0x20
ds                        0x3b  ll+0x1a
es                        0x3b  ll+0x1a
fs                        0x13
gs                        0x28  ll+0x7
ss                        0x28  ll+0x7
rax                       0x12
rcx                       0x80  ll+0x5f
rdx         0xffffffff82117e30
rbx         0xffffffff811fa8fc
rsp         0xffffffff82117f30
rbp         0xffffffff82117f40
rsi                       0x80  ll+0x5f
rdi         0xffffffff81c7af98  cnputs_mtx
r8                           0
r9          0x8080808080808080
r10                       0x12
r11                        0x1
r12                          0
r13         0xffffffff82118220
r14                          0
r15         0xffffffff81d9e180  thread0_st
rip         0xffffffff80c135a7  kdb_enter+0x37
rflags                    0x82  ll+0x61
kdb_enter+0x37: movq    $0,0x1085536(%rip)
db> 

(kgdb) l *0xffffffff810217a7
0xffffffff810217a7 is in dmar_gas_match_insert (../../../x86/iommu/intel_gas.c:348).
343              * allocations to ensure that out-of-bounds accesses fault.
344              */
345             a->entry->end = a->entry->start + a->size;
346
347             next = RB_NEXT(dmar_gas_entries_tree, &a->domain->rb_root, prev);
348             KASSERT(next->start >= a->entry->end &&
349                 next->start - a->entry->start >= a->size &&
350                 prev->end <= a->entry->end,
351                 ("dmar_gas_match_insert hole failed %p prev (%jx, %jx) "
352                 "next (%jx, %jx) entry (%jx, %jx)", a->domain,
(kgdb)

Stop sending the prev entry to dmar_gas_match_insert. I was sending the prev entry sometimes and the next entry sometimes, but the insert function was only using them for assertion checking anyway.

Fix the upper limits passed to dmar_gas_match_one from dmar_gas_uppermatch.

I ran a 30 minutest udp test with D23189.66959.diff and hw.dmar.enable=1.
I ran 100 tests in two hours followed by at buildworld / installworld.
No problems seen.

sys/x86/iommu/intel_gas.c
284

Why did you removed this assert ?

286

So maxaddr is the old end, and end is the end of the gap after the a->entry ? Could you please document the args meaning ?

dougm marked an inline comment as done.

Push the setting of a->entry->start down into dmar_gas_match_one. Add a documentary comment before that function.

sys/x86/iommu/intel_gas.c
284

Because there is no longer a field 'free_after'. I could add an assertion that RB_RIGHT is null, if you like.

sys/x86/iommu/intel_gas.c
376

Am I correct that lowermatch now prefers higher free address ranges over the low ranges ?

For instance, assume that there is no upper hole at all, and that we have the only entry pre-allocated that consumes single page in the middle of the guest address space. Then old lowermatch selects lowest addresses for a new request, while your version would select the highest (after the allocated entry) ?

sys/x86/iommu/intel_gas.c
376

You are correct that lowermatch, after this change, invokes dmar_gas_match_insert on the highest possible free address range. You are wrong to imply that lowermatch, before this change, invoked dmar_gas_match_insert on the lowest possible free address range. Before, it tried a 'middle' range - whatever happened to start just right of the root of the tree, and if that failed, then it tried searching (recursively) the part before that middle range, then the part after that middle range. So while it was biased toward the lower free address ranges, it didn't pick the lowest, necessarily.

I reasoned that the previous implementation was indifferent about which free range got picked, since it sought neither the first nor the last free range, but that the lowest addresses might be more 'precious', in the sense that if one allocation request asks for a range below 1000, and the next request asks for a range below 500, it is better if the first request didn't consume the only available range below 500 when there were other possibilities available.

If I understand your 'upper hole' question correctly - and I may not - you're asking about the case when there is one allocated range less than lowaddr and thus two free ranges less than lowaddr, before and after that allocated range. The old lowermatch would try the free range after the allocated range - starting at 'prev->end', where prev points to the allocated range. The new lowermatch would get to the same place, after a call to lowermatch on the right subtree didn't produce an allocation.

sys/x86/iommu/intel_gas.c
376

There is no need to save the addresses at low. I think that the ideal scheme for guest AS allocation would be to have a cursor and move it past the end of the last successful allocation, wrapping on the end.

Users cannot specify that they want something at low addresses for the given request. What can be done is to declare a 'hole' in the middle of the address space. Then I call the allocations from below the bottom of the hole 'lower', and above the top of the hole 'upper'. The upper range is very rarely supported/requested. Typically the AS is either the whole bus space, or hole starts at 4G and to the end of AS.

Discard everything about x86/iommu/intel_* and just leave the tree.h changes, so that they don't get trapped in a discussion of other things.

Restore the intel_gas changes that exploit RB_AUGMENT - the ones that eliminate the need to walk from leaf to root to update free_down, and the ones that eliminate the need to use RB_NEXT to update free_after fields. But don't change the topmost-fit search method of lowermatch, or use free_down to speed up uppermatch. The order in which the match functions consider entries is not changed.

I'm told offline that this should all be redone in a way that doesn't use a search tree, but my objective is just to get the search tree to be used more effectively.

I ran a brief 2 hour test with D23189.67306.diff. No problems seen.

This revision is now accepted and ready to land.Jan 26 2020, 5:07 PM

I ran tests with dmar enabled for the most of a day.
No problems seen.