(amd64 only for now)
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
What's the motivation for this change?
I'm worried that with this change, instead of cleanly panicking, the kernel will end up in a catatonic state, requiring some intervention. In particular, we do not set the vm_reclaimfn callback for kernel KVA arenas - after a failure, I think we'll end up sleeping forever.
A regular malloc(M_NOWAIT) call can end up in this function. When we ran out of pages pmap_growkernel() would panic before this patch, but with the patch the ENOMEM analog would be returned all the way up the allocation stack. Potentially machine can revive later, if memory hogs are gone. Example trace:
panic() at panic+0x43/frame 0xfffffe040404bfb0 pmap_growkernel() at pmap_growkernel+0x2a3/frame 0xfffffe040404c000 vm_map_insert1() at vm_map_insert1+0x254/frame 0xfffffe040404c090 vm_map_find_locked() at vm_map_find_locked+0x4fc/frame 0xfffffe040404c160 vm_map_find() at vm_map_find+0xaf/frame 0xfffffe040404c1e0 kva_import() at kva_import+0x36/frame 0xfffffe040404c220 vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c270 vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c300 kva_import_domain() at kva_import_domain+0x25/frame 0xfffffe040404c330 vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c380 vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c410 vmem_alloc() at vmem_alloc+0x37/frame 0xfffffe040404c460 kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0 keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520 zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0 cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620 cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660 in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690 syncache_socket() at syncache_socket+0x44/frame 0xfffffe040404c710 syncache_expand() at syncache_expand+0x81f/frame 0xfffffe040404c830 tcp_input_with_port() at tcp_input_with_port+0x8e0/frame 0xfffffe040404c980 tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe040404c9b0 tcp6_input() at tcp6_input+0xb/frame 0xfffffe040404c9c0 ip6_input() at ip6_input+0x82f/frame 0xfffffe040404ca90 netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cae0 ether_demux() at ether_demux+0x129/frame 0xfffffe040404cb10 ether_nh_input() at ether_nh_input+0x2f4/frame 0xfffffe040404cb60 netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cbb0 ether_input() at ether_input+0x36/frame 0xfffffe040404cbf0 tcp_lro_flush() at tcp_lro_flush+0x31f/frame 0xfffffe040404cc20 tcp_lro_flush_all() at tcp_lro_flush_all+0x1d3/frame 0xfffffe040404cc60 mlx5e_rx_cq_comp() at mlx5e_rx_cq_comp+0x10d2/frame 0xfffffe040404cd80 mlx5_cq_completion() at mlx5_cq_completion+0x78/frame 0xfffffe040404cde0 mlx5_eq_int() at mlx5_eq_int+0x2ad/frame 0xfffffe040404ce30 mlx5_msix_handler() at mlx5_msix_handler+0x15/frame 0xfffffe040404ce40 lkpi_irq_handler() at lkpi_irq_handler+0x29/frame 0xfffffe040404ce60 ithread_loop() at ithread_loop+0x249/frame 0xfffffe040404cef0 fork_exit() at fork_exit+0x7b/frame 0xfffffe040404cf30
pmap_growkernel() allocates with VM_ALLOC_INTERRUPT, which is able to allocate any available free page; vmd->vmd_interrupt_free_min pages are required to be free otherwise. Today, that means we should keep at least 2 free pages per NUMA domain. Should that threshold be increased instead?
If a thread calling malloc(M_WAITOK) hits this error, won't we end up sleeping forever in VMEM_CONDVAR_WAIT for the kernel KVA arena? kmem_free() doesn't free KVA back to kernel_arena, only to the per-domain arenas.
Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?
Of course there are more deadlocks/live locks behind this change. But we wouldn't see them until this fix is done.
Also it is under the knob. If you insist, I can flip the defaults. But IMO we do want to find and fix the next level of problems.
I'm suspicious that the reported panic is related to KASAN, which has special handling in pmap_growkernel(). If so, I'd prefer to fix this by raising the interrupt_min threshold when KASAN/KMSAN is enabled.
@glebius can you confirm whether this is the case? If not, could you please provide sysctl vm.uma output from the system in question? I would like to understand why this fragment appears in the backtrace:
kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0 keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520 zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0 cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620 cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660 in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690
No, KASAN is not in the kernel. The inpcb keg has uk_ppera = 4, hence its uk_allocf is page_alloc() which is basically kmem_malloc_domainset(). Our inpcb is 0x880 bytes, so I guess keg_layout() calculated 4 page slabs as most efficient.
vm.uma.tcp_inpcb.stats.xdomain: 0 vm.uma.tcp_inpcb.stats.fails: 0 vm.uma.tcp_inpcb.stats.frees: 602031303 vm.uma.tcp_inpcb.stats.allocs: 602042494 vm.uma.tcp_inpcb.stats.current: 11191 vm.uma.tcp_inpcb.domain.0.timin: 3 vm.uma.tcp_inpcb.domain.0.limin: 12 vm.uma.tcp_inpcb.domain.0.wss: 762 vm.uma.tcp_inpcb.domain.0.bimin: 1524 vm.uma.tcp_inpcb.domain.0.imin: 1524 vm.uma.tcp_inpcb.domain.0.imax: 3556 vm.uma.tcp_inpcb.domain.0.nitems: 2286 vm.uma.tcp_inpcb.limit.bucket_max: 18446744073709551615 vm.uma.tcp_inpcb.limit.sleeps: 0 vm.uma.tcp_inpcb.limit.sleepers: 0 vm.uma.tcp_inpcb.limit.max_items: 0 vm.uma.tcp_inpcb.limit.items: 0 vm.uma.tcp_inpcb.keg.domain.0.free_slabs: 0 vm.uma.tcp_inpcb.keg.domain.0.free_items: 23424 vm.uma.tcp_inpcb.keg.domain.0.pages: 30444 vm.uma.tcp_inpcb.keg.efficiency: 92 vm.uma.tcp_inpcb.keg.reserve: 0 vm.uma.tcp_inpcb.keg.align: 63 vm.uma.tcp_inpcb.keg.ipers: 7 vm.uma.tcp_inpcb.keg.ppera: 4 vm.uma.tcp_inpcb.keg.rsize: 2176 vm.uma.tcp_inpcb.keg.name: tcp_inpcb vm.uma.tcp_inpcb.bucket_size_max: 254 vm.uma.tcp_inpcb.bucket_size: 131 vm.uma.tcp_inpcb.flags: 0x850000<VTOSLAB,SMR,FIRSTTOUCH> vm.uma.tcp_inpcb.size: 2176
I see, thanks. One other question: does the system in question have more than one NUMA domain? If so, we will use a large import quantum, KVA_NUMA_IMPORT_QUANTUM, and we need to be able to allocate more than two PTPs in order to grow the map by that much. That is, when vm_ndomains > 1, we should set vmd_interrupt_min to a larger value.
syzkaller hit similar issue with fork(2):
https://syzkaller.appspot.com/bug?extid=6cd13c008e8640eceb4c
IMHO, we could propagate the pmap_growkernel() error all the way up and fail the syscall.
Ping. I think this is the way to go, we would need to handle any problems in the upper VM.
Sorry, I missed this. This particular panic is almost certainly due to a limitation of KASAN (we need to allocate shadow memory when growing page table pages), and unrelated to the motivation behind this patch. Actually, I think it means that the vmd_interrupt_free_min limit should be higher in KASAN and KMSAN kernels. But that is unrelated to the problem solved by the patch, I believe.
I still don't really like it. I'd prefer to understand exactly why pmap_growkernel() failed--is it because it tried to allocate more than v_interrupt_free_min (2) pages? Is it because there was some concurrent VM_ALLOC_INTERRUPT page allocation? Why don't we see such panics from stress2?
Again, I think the change is incomplete. Suppose the vm_map_find() in kva_import() fails for an M_WAITOK allocation. The thread will go to sleep on the kernel_arena internal condvar. Who wakes it up? Per-domain KVA arenas never release KVA ranges back to kernel_arena. I think it is likely to simply sleep forever.
Rather than add more complexity, it would be nice to understand why the v_interrupt_free_min limit is insufficient here. Some info from a core dump would be needed: stack traces of all on-CPU threads, and the original values of addr and kernel_vm_end.
I don't object to committing the patch, but I am not confident that it is the right solution.
Perhaps stress2 is not stressful enough. Jokes aside, I do know that hardware that Peter uses is relatively old and underpowered by the modern standards. I do not think that he has anything larger that 64 threads.
Another issue might be that stress2 mostly tests top level of the kernel, since this is what we can load from the syscall layer. There are no tests for things that would utilize VM_ALLOC_INTERRUPT enough. For instance, high-speed (200G/400G) ethernet card with lot of receive queues on large multiprocessor with the parallel swap load is something that is not tested by stress2.
And then, I read your arguments as the attempt to make vm_page_alloc(VM_ALLOC_INTERRUPT) non-failing. I do not think this is feasible or intended. All callers of vm_page_alloc() must be prepared for the failure, and the reserve would not work there.
Regarding the pmap, wouldn't the interrupt reserve depleted if all CPUs call pmap_growkernel() if back-to-back manner, not giving the pagedaemon threads a chance to become runnable? I think this is possible with differrent zones needing to grow KVA, or something similar. And no, I do not think that any bump of the reserve for interrupt allocation would solve this theoretically correct.
Again, I think the change is incomplete. Suppose the vm_map_find() in kva_import() fails for an M_WAITOK allocation. The thread will go to sleep on the kernel_arena internal condvar. Who wakes it up? Per-domain KVA arenas never release KVA ranges back to kernel_arena. I think it is likely to simply sleep forever.
This is the next level of the issue in the stack, I want to take care of the pmap now (or first).
Rather than add more complexity, it would be nice to understand why the v_interrupt_free_min limit is insufficient here. Some info from a core dump would be needed: stack traces of all on-CPU threads, and the original values of addr and kernel_vm_end.
I don't object to committing the patch, but I am not confident that it is the right solution.
I do not see why it is right to panic if VM_ALLOC_INTERRUPT did not succeeded. Similarly, I do not see why such request must or can always succeed. More fixes might be due afterward, but this one is needed.
Well, if we are not going to try and fix uses of VM_ALLOC_INTERRUPT, I don't see much need for this allocation class at all.
I think the summary of my view is, we can salvage the VM_ALLOC_INTERRUPT mechanism by ensuring that some minimum number of pages is always reserved, similar to how we handle vmem boundary tag reservations in vmem_startup(). That is probably simpler than trying to handle errors everywhere. But ok, let's try it.
This is the next level of the issue in the stack, I want to take care of the pmap now (or first).
BTW, it is also possible to see panics in pmap_demote_DMAP() when a vm_page_alloc(VM_ALLOC_INTERRUPT) call fails.
sys/amd64/amd64/pmap.c | ||
---|---|---|
581 | I think __read_frequently is not really apt, pmap_growkernel() is not called often. |