vm: reduce lock contention when processing vm batchqueues
ClosedPublic
Actions

Authored by gallatin on Nov 7 2022, 8:09 PM.

Details

Reviewers

markj
jhibbits
glebius
jhb
kib

Commits

rG1cac76c93fb7: vm: reduce lock contention when processing vm batchqueues

Summary

VM batchqueues allow bulk insertion of a group of pages into a vm page queue. Doing things in bulk, with one acquire of the vm page queue mutex reduces overheads. Batchqueues have helped us considerably with our workload.

However, with a large core count machine (64 or more) using a "Netflix" style workload, where millions of pages per second are processed via sendfile(), we still see lock contention on the inactive queue. According to lockstat, the vm inactive pagequeue mutex is the most contended in the system, and is called out of vm_page_pqbatch_submit() [1].

This patch changes how batchqueues work. Rather than waiting until the batchqueue is full to acquire the lock & process the queue, we now start trying to acquire the lock using trylocks when the batchqueue is 1/2 full. This removes almost all contention on the vm pagequeue mutex for us. [2]

So that the system does not loose the benefit of processing large batchqueues, I've doubled the size of the batchqueues. This way, when there is no contention, we process the same batch size as before.

[1]:
``

Count indv cuml rcnt     nsec Lock                   Caller                  
-------------------------------------------------------------------------------
428303  28%  28% 0.00   179962 vm inactive pagequeue  vm_page_pqbatch_submit+0x234
372617  25%  53% 0.00    17459 counter_fo             counter_fo_add+0x1f4    
206186  14%  67% 0.00     1838 mlx5tx                 mlx5e_xmit+0x243        
157055  10%  77% 0.00     1044 tcp_hpts_lck           tcp_hpts_insert_diag+0x60e
104862   7%  84% 0.00     1143 tcp_hpts_lck           tcp_hpts_remove+0xa8    
``

[2]

Count indv cuml rcnt     nsec Lock                   Caller                  
-------------------------------------------------------------------------------
396499  33%  33% 0.00    13957 counter_fo             counter_fo_add+0x1f4    
203257  17%  50% 0.00     1343 mlx5tx                 mlx5e_xmit+0x243        
162038  13%  63% 0.00      860 tcp_hpts_lck           tcp_hpts_insert_diag+0x60e
108576   9%  72% 0.00      932 tcp_hpts_lck           tcp_hpts_remove+0xa8    
45724   4%  76% 0.00   165111 vm inactive pagequeue  vm_page_pqbatch_submit+0x234

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

gallatin created this revision.Nov 7 2022, 8:09 PM

Herald added a subscriber: imp. · View Herald TranscriptNov 7 2022, 8:09 PM

gallatin requested review of this revision.Nov 7 2022, 8:09 PM

Harbormaster completed remote builds in B48216: Diff 112741.Nov 7 2022, 8:09 PM

Only tangentially related, but I wonder if this constant shouldn't be defined for arm64 too?

In D37305#851395, @trasz wrote:

Only tangentially related, but I wonder if this constant shouldn't be defined for arm64 too?

Yes, most likely it'll help a bit on large server platforms. Though we do define the constant for all platforms, the batch size is just smaller.

sys/vm/vm_page.c
3672–3683	I think this diff is missing a change to vm_batchqueue_insert() to return the number of slots?
3677

Update diff to restore missing hunk that causes vm_batchqueue_insert() to return the number of free slots. As pointed out by Markj

In D37305#851659, @gallatin wrote:

Update diff to restore missing hunk that causes vm_batchqueue_insert() to return the number of free slots. As pointed out by Markj

Just to confirm: The correct patch has been running all week, and also works fine, and provides the stated benefit for our workload. I'm still not sure how that hunk went missing..

sys/vm/vm_page.c
3672–3683	I must have lost that hunk somewhere along the way moving between trees or rebasing. I will re-test at the end of the week.

The patch seems ok to me, but:

In vm_pageout_reinsert_active(), we should change the call to read if (vm_batchqueue_insert(bq, m) != 0), even though it's just a stylistic change.
Did you observe any measurable change in CPU usage with the patch? The reduction in lock contention looks good, but lockstat doesn't really reflect the cost of failed trylocks. Even with no observed difference I think this patch is the right way to go, especially if we can also demonstrate that pagequeue lock hold times don't go up at all.

This revision is now accepted and ready to land.Nov 30 2022, 3:48 PM

In D37305#853896, @markj wrote:

The patch seems ok to me, but:

In vm_pageout_reinsert_active(), we should change the call to read if (vm_batchqueue_insert(bq, m) != 0), even though it's just a stylistic change.

Will do, thank you.

Did you observe any measurable change in CPU usage with the patch? The reduction in lock contention looks good, but lockstat doesn't really reflect the cost of failed trylocks. Even with no observed difference I think this patch is the right way to go, especially if we can also demonstrate that pagequeue lock hold times don't go up at all.

I was not able to measure a CPU reduction with any confidence. The goal was to reduce output drops at near saturation on the Netflix NIC kTLS + sendfile workload, and it *does* seem to help with this. The reasoning is that the bulk of the lock contention that this helps with is coming via a stack like this:

vm_page_pqbatch_submit                            
            kernel`vm_page_pqstate_commit+0xaa
            kernel`sendfile_free_mext_pg+0x60
            kernel`mb_free_extpg+0x3e
            kernel`m_free+0xb2
            kernel`m_freem+0x28
            tcp_rack_21q22p13.ko`tcp_rack_21q22p13_rack_do_segment_nounlock+0x11e9
            tcp_rack_21q22p13.ko`tcp_rack_21q22p13_ctf_process_inbound_raw+0x9d
            tcp_rack_21q22p13.ko`tcp_rack_21q22p13_ctf_do_queued_segments+0x36
            kernel`tcp_lro_flush+0xce7
            kernel`tcp_lro_flush_all+0xfb
            kernel`mlx5e_rx_cq_comp+0x1074
            kernel`mlx5_cq_completion+0x7a
            kernel`mlx5_eq_int+0xcb
            kernel`mlx5_msix_handler+0x15
            kernel`lkpi_irq_handler+0x27
            kernel`ithread_loop+0x27a
            kernel`fork_exit+0x7e
            kernel`0xffffffff80998d6e

For at least the Mellanox driver, spinning or blocking waiting for a lock in the rx path will also delay cleaning transmit completions. So reducing this lock contention helps to avoid output drops, by allowing the Mellanox tx completions to be processed in a timely manner. Between this patch, and a fix for lock contention in a not-yet-upstream Netflix kernel component, I've seen far more stable performance near saturation, with much fewer seemingly sporadic output drops.

I'm mostly AFK this week, but I'll look into any changes in hold times as soon as I get a chance and before I push the change.

Thanks again for looking at this!

Closed by commit rG1cac76c93fb7: vm: reduce lock contention when processing vm batchqueues (authored by gallatin). · Explain WhyDec 14 2022, 7:44 PM

This revision was automatically updated to reflect the committed changes.

gallatin added a commit: rG1cac76c93fb7: vm: reduce lock contention when processing vm batchqueues.