Paths

Table of Contentst

pmap_growkernel(): do not panic immediately, return error
ClosedPublic
Actions

Authored by kib on Dec 5 2024, 11:04 PM.

Details

Reviewers

alc
markj
andrew
manu

Commits

rGef9017aa174d: pmap_growkernel(): do not panic immediately, optionally return the error

Summary

(amd64 only for now)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.Dec 5 2024, 11:04 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 5 2024, 11:04 PM

kib requested review of this revision.Dec 5 2024, 11:04 PM

What's the motivation for this change?

I'm worried that with this change, instead of cleanly panicking, the kernel will end up in a catatonic state, requiring some intervention. In particular, we do not set the vm_reclaimfn callback for kernel KVA arenas - after a failure, I think we'll end up sleeping forever.

In D47935#1093091, @markj wrote:

What's the motivation for this change?

A regular malloc(M_NOWAIT) call can end up in this function. When we ran out of pages pmap_growkernel() would panic before this patch, but with the patch the ENOMEM analog would be returned all the way up the allocation stack. Potentially machine can revive later, if memory hogs are gone. Example trace:

panic() at panic+0x43/frame 0xfffffe040404bfb0                                                                                                      
pmap_growkernel() at pmap_growkernel+0x2a3/frame 0xfffffe040404c000                                                                                 
vm_map_insert1() at vm_map_insert1+0x254/frame 0xfffffe040404c090                                                                                   
vm_map_find_locked() at vm_map_find_locked+0x4fc/frame 0xfffffe040404c160                                                                           
vm_map_find() at vm_map_find+0xaf/frame 0xfffffe040404c1e0                                                                                          
kva_import() at kva_import+0x36/frame 0xfffffe040404c220                                                                                            
vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c270                                                                                    
vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c300                                                                                         
kva_import_domain() at kva_import_domain+0x25/frame 0xfffffe040404c330                                                                              
vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c380                                                                                    
vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c410                                                                                         
vmem_alloc() at vmem_alloc+0x37/frame 0xfffffe040404c460                                                                                            
kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0                                                                      
keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520                                                                                    
zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0                                                                                          
cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620                                                                                         
cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660                                                                              
in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690
syncache_socket() at syncache_socket+0x44/frame 0xfffffe040404c710
syncache_expand() at syncache_expand+0x81f/frame 0xfffffe040404c830
tcp_input_with_port() at tcp_input_with_port+0x8e0/frame 0xfffffe040404c980
tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe040404c9b0
tcp6_input() at tcp6_input+0xb/frame 0xfffffe040404c9c0
ip6_input() at ip6_input+0x82f/frame 0xfffffe040404ca90
netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cae0
ether_demux() at ether_demux+0x129/frame 0xfffffe040404cb10
ether_nh_input() at ether_nh_input+0x2f4/frame 0xfffffe040404cb60
netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cbb0
ether_input() at ether_input+0x36/frame 0xfffffe040404cbf0
tcp_lro_flush() at tcp_lro_flush+0x31f/frame 0xfffffe040404cc20
tcp_lro_flush_all() at tcp_lro_flush_all+0x1d3/frame 0xfffffe040404cc60
mlx5e_rx_cq_comp() at mlx5e_rx_cq_comp+0x10d2/frame 0xfffffe040404cd80
mlx5_cq_completion() at mlx5_cq_completion+0x78/frame 0xfffffe040404cde0
mlx5_eq_int() at mlx5_eq_int+0x2ad/frame 0xfffffe040404ce30
mlx5_msix_handler() at mlx5_msix_handler+0x15/frame 0xfffffe040404ce40
lkpi_irq_handler() at lkpi_irq_handler+0x29/frame 0xfffffe040404ce60
ithread_loop() at ithread_loop+0x249/frame 0xfffffe040404cef0
fork_exit() at fork_exit+0x7b/frame 0xfffffe040404cf30

We can add a knob to control the behavior, panic/not panic.

Add a knob to select panicing behavior.

In D47935#1093140, @glebius wrote:

In D47935#1093091, @markj wrote:

What's the motivation for this change?

A regular malloc(M_NOWAIT) call can end up in this function. When we ran out of pages pmap_growkernel() would panic before this patch, but with the patch the ENOMEM analog would be returned all the way up the allocation stack. Potentially machine can revive later, if memory hogs are gone. Example trace:

pmap_growkernel() allocates with VM_ALLOC_INTERRUPT, which is able to allocate any available free page; vmd->vmd_interrupt_free_min pages are required to be free otherwise. Today, that means we should keep at least 2 free pages per NUMA domain. Should that threshold be increased instead?

If a thread calling malloc(M_WAITOK) hits this error, won't we end up sleeping forever in VMEM_CONDVAR_WAIT for the kernel KVA arena? kmem_free() doesn't free KVA back to kernel_arena, only to the per-domain arenas.

Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?

Of course there are more deadlocks/live locks behind this change. But we wouldn't see them until this fix is done.

Also it is under the knob. If you insist, I can flip the defaults. But IMO we do want to find and fix the next level of problems.

In D47935#1093599, @kib wrote:

Of course there are more deadlocks/live locks behind this change. But we wouldn't see them until this fix is done.

Also it is under the knob. If you insist, I can flip the defaults. But IMO we do want to find and fix the next level of problems.

I'm suspicious that the reported panic is related to KASAN, which has special handling in pmap_growkernel(). If so, I'd prefer to fix this by raising the interrupt_min threshold when KASAN/KMSAN is enabled.

@glebius can you confirm whether this is the case? If not, could you please provide sysctl vm.uma output from the system in question? I would like to understand why this fragment appears in the backtrace:

kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0                                                                      
keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520                                                                                    
zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0                                                                                          
cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620                                                                                         
cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660                                                                              
in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690

In D47935#1093253, @markj wrote:

Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?

No, KASAN is not in the kernel. The inpcb keg has uk_ppera = 4, hence its uk_allocf is page_alloc() which is basically kmem_malloc_domainset(). Our inpcb is 0x880 bytes, so I guess keg_layout() calculated 4 page slabs as most efficient.

vm.uma.tcp_inpcb.stats.xdomain: 0
vm.uma.tcp_inpcb.stats.fails: 0
vm.uma.tcp_inpcb.stats.frees: 602031303
vm.uma.tcp_inpcb.stats.allocs: 602042494
vm.uma.tcp_inpcb.stats.current: 11191
vm.uma.tcp_inpcb.domain.0.timin: 3
vm.uma.tcp_inpcb.domain.0.limin: 12
vm.uma.tcp_inpcb.domain.0.wss: 762
vm.uma.tcp_inpcb.domain.0.bimin: 1524
vm.uma.tcp_inpcb.domain.0.imin: 1524
vm.uma.tcp_inpcb.domain.0.imax: 3556
vm.uma.tcp_inpcb.domain.0.nitems: 2286
vm.uma.tcp_inpcb.limit.bucket_max: 18446744073709551615
vm.uma.tcp_inpcb.limit.sleeps: 0
vm.uma.tcp_inpcb.limit.sleepers: 0
vm.uma.tcp_inpcb.limit.max_items: 0
vm.uma.tcp_inpcb.limit.items: 0
vm.uma.tcp_inpcb.keg.domain.0.free_slabs: 0
vm.uma.tcp_inpcb.keg.domain.0.free_items: 23424
vm.uma.tcp_inpcb.keg.domain.0.pages: 30444
vm.uma.tcp_inpcb.keg.efficiency: 92
vm.uma.tcp_inpcb.keg.reserve: 0
vm.uma.tcp_inpcb.keg.align: 63
vm.uma.tcp_inpcb.keg.ipers: 7
vm.uma.tcp_inpcb.keg.ppera: 4
vm.uma.tcp_inpcb.keg.rsize: 2176
vm.uma.tcp_inpcb.keg.name: tcp_inpcb
vm.uma.tcp_inpcb.bucket_size_max: 254
vm.uma.tcp_inpcb.bucket_size: 131
vm.uma.tcp_inpcb.flags: 0x850000<VTOSLAB,SMR,FIRSTTOUCH>
vm.uma.tcp_inpcb.size: 2176

In D47935#1094163, @glebius wrote:

In D47935#1093253, @markj wrote:

Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?

No, KASAN is not in the kernel. The inpcb keg has uk_ppera = 4, hence its uk_allocf is page_alloc() which is basically kmem_malloc_domainset(). Our inpcb is 0x880 bytes, so I guess keg_layout() calculated 4 page slabs as most efficient.

I see, thanks. One other question: does the system in question have more than one NUMA domain? If so, we will use a large import quantum, KVA_NUMA_IMPORT_QUANTUM, and we need to be able to allocate more than two PTPs in order to grow the map by that much. That is, when vm_ndomains > 1, we should set vmd_interrupt_min to a larger value.

In D47935#1094165, @markj wrote:

I see, thanks. One other question: does the system in question have more than one NUMA domain? If so, we will use a large import quantum, KVA_NUMA_IMPORT_QUANTUM, and we need to be able to allocate more than two PTPs in order to grow the map by that much. That is, when vm_ndomains > 1, we should set vmd_interrupt_min to a larger value.

The paniced system has vm_ndomains = 1.

syzkaller hit similar issue with fork(2):
https://syzkaller.appspot.com/bug?extid=6cd13c008e8640eceb4c

IMHO, we could propagate the pmap_growkernel() error all the way up and fail the syscall.

Ping. I think this is the way to go, we would need to handle any problems in the upper VM.

In D47935#1120644, @glebius wrote:

syzkaller hit similar issue with fork(2):
https://syzkaller.appspot.com/bug?extid=6cd13c008e8640eceb4c

IMHO, we could propagate the pmap_growkernel() error all the way up and fail the syscall.

Sorry, I missed this. This particular panic is almost certainly due to a limitation of KASAN (we need to allocate shadow memory when growing page table pages), and unrelated to the motivation behind this patch. Actually, I think it means that the vmd_interrupt_free_min limit should be higher in KASAN and KMSAN kernels. But that is unrelated to the problem solved by the patch, I believe.

In D47935#1161390, @kib wrote:

Ping. I think this is the way to go, we would need to handle any problems in the upper VM.

I still don't really like it. I'd prefer to understand exactly why pmap_growkernel() failed--is it because it tried to allocate more than v_interrupt_free_min (2) pages? Is it because there was some concurrent VM_ALLOC_INTERRUPT page allocation? Why don't we see such panics from stress2?

Again, I think the change is incomplete. Suppose the vm_map_find() in kva_import() fails for an M_WAITOK allocation. The thread will go to sleep on the kernel_arena internal condvar. Who wakes it up? Per-domain KVA arenas never release KVA ranges back to kernel_arena. I think it is likely to simply sleep forever.

Rather than add more complexity, it would be nice to understand why the v_interrupt_free_min limit is insufficient here. Some info from a core dump would be needed: stack traces of all on-CPU threads, and the original values of addr and kernel_vm_end.

I don't object to committing the patch, but I am not confident that it is the right solution.

In D47935#1161832, @markj wrote:

In D47935#1120644, @glebius wrote:

syzkaller hit similar issue with fork(2):
https://syzkaller.appspot.com/bug?extid=6cd13c008e8640eceb4c

IMHO, we could propagate the pmap_growkernel() error all the way up and fail the syscall.

Sorry, I missed this. This particular panic is almost certainly due to a limitation of KASAN (we need to allocate shadow memory when growing page table pages), and unrelated to the motivation behind this patch. Actually, I think it means that the vmd_interrupt_free_min limit should be higher in KASAN and KMSAN kernels. But that is unrelated to the problem solved by the patch, I believe.

In D47935#1161390, @kib wrote:

Ping. I think this is the way to go, we would need to handle any problems in the upper VM.

I still don't really like it. I'd prefer to understand exactly why pmap_growkernel() failed--is it because it tried to allocate more than v_interrupt_free_min (2) pages? Is it because there was some concurrent VM_ALLOC_INTERRUPT page allocation? Why don't we see such panics from stress2?

Perhaps stress2 is not stressful enough. Jokes aside, I do know that hardware that Peter uses is relatively old and underpowered by the modern standards. I do not think that he has anything larger that 64 threads.

Another issue might be that stress2 mostly tests top level of the kernel, since this is what we can load from the syscall layer. There are no tests for things that would utilize VM_ALLOC_INTERRUPT enough. For instance, high-speed (200G/400G) ethernet card with lot of receive queues on large multiprocessor with the parallel swap load is something that is not tested by stress2.

And then, I read your arguments as the attempt to make vm_page_alloc(VM_ALLOC_INTERRUPT) non-failing. I do not think this is feasible or intended. All callers of vm_page_alloc() must be prepared for the failure, and the reserve would not work there.

Regarding the pmap, wouldn't the interrupt reserve depleted if all CPUs call pmap_growkernel() if back-to-back manner, not giving the pagedaemon threads a chance to become runnable? I think this is possible with differrent zones needing to grow KVA, or something similar. And no, I do not think that any bump of the reserve for interrupt allocation would solve this theoretically correct.

Again, I think the change is incomplete. Suppose the vm_map_find() in kva_import() fails for an M_WAITOK allocation. The thread will go to sleep on the kernel_arena internal condvar. Who wakes it up? Per-domain KVA arenas never release KVA ranges back to kernel_arena. I think it is likely to simply sleep forever.

This is the next level of the issue in the stack, I want to take care of the pmap now (or first).

Rather than add more complexity, it would be nice to understand why the v_interrupt_free_min limit is insufficient here. Some info from a core dump would be needed: stack traces of all on-CPU threads, and the original values of addr and kernel_vm_end.

I don't object to committing the patch, but I am not confident that it is the right solution.

I do not see why it is right to panic if VM_ALLOC_INTERRUPT did not succeeded. Similarly, I do not see why such request must or can always succeed. More fixes might be due afterward, but this one is needed.

In D47935#1162094, @kib wrote:

In D47935#1161832, @markj wrote:

I still don't really like it. I'd prefer to understand exactly why pmap_growkernel() failed--is it because it tried to allocate more than v_interrupt_free_min (2) pages? Is it because there was some concurrent VM_ALLOC_INTERRUPT page allocation? Why don't we see such panics from stress2?

Perhaps stress2 is not stressful enough. Jokes aside, I do know that hardware that Peter uses is relatively old and underpowered by the modern standards. I do not think that he has anything larger that 64 threads.

Another issue might be that stress2 mostly tests top level of the kernel, since this is what we can load from the syscall layer. There are no tests for things that would utilize VM_ALLOC_INTERRUPT enough. For instance, high-speed (200G/400G) ethernet card with lot of receive queues on large multiprocessor with the parallel swap load is something that is not tested by stress2.

And then, I read your arguments as the attempt to make vm_page_alloc(VM_ALLOC_INTERRUPT) non-failing. I do not think this is feasible or intended. All callers of vm_page_alloc() must be prepared for the failure, and the reserve would not work there.

Regarding the pmap, wouldn't the interrupt reserve depleted if all CPUs call pmap_growkernel() if back-to-back manner, not giving the pagedaemon threads a chance to become runnable? I think this is possible with differrent zones needing to grow KVA, or something similar. And no, I do not think that any bump of the reserve for interrupt allocation would solve this theoretically correct.

Well, if we are not going to try and fix uses of VM_ALLOC_INTERRUPT, I don't see much need for this allocation class at all.

I think the summary of my view is, we can salvage the VM_ALLOC_INTERRUPT mechanism by ensuring that some minimum number of pages is always reserved, similar to how we handle vmem boundary tag reservations in vmem_startup(). That is probably simpler than trying to handle errors everywhere. But ok, let's try it.