Page MenuHomeFreeBSD

vm_pageout_scan_inactive: take a lock break
ClosedPublic

Authored by rlibby on May 21 2024, 7:32 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jun 20, 12:26 PM
Unknown Object (File)
Mon, Jun 3, 2:02 PM
Unknown Object (File)
Sun, May 26, 11:21 PM
Unknown Object (File)
Sat, May 25, 2:05 PM
Unknown Object (File)
Sat, May 25, 1:40 PM
Unknown Object (File)
Sat, May 25, 1:40 PM
Unknown Object (File)
Sat, May 25, 1:40 PM
Unknown Object (File)
Thu, May 23, 5:35 PM
Subscribers

Details

Summary

vm_pageout_scan_inactive: take a lock break

In vm_pageout_scan_inactive, release the object lock when we go to
refill the scan batch queue so that someone else has a chance to acquire
it. This improves access latency to the object when the pagedaemon is
processing many consecutive pages from a single object, and also in any
case avoids a hiccup during refill for the last touched object.

Sponsored by: Dell EMC Isilon

Test Plan
truncate -s 10G sparse.10G
dd if=sparse.10G of=/dev/zero &
sudo lockstat -P -x aggsize=4m -D 10 -H sleep 10

Empirically observe vmobject average hold time decreased from milliseconds to microseconds.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

rlibby edited the test plan for this revision. (Show Details)
sys/vm/vm_pageout.c
1492

Don't you want to reset run = 0; before the object == NULL check?

1630

The inactive queue scan operates on batches of pages fetched periodically by vm_pageout_next(). That fetching involves acquiring a page queue lock and a bunch of other work; that'd strike me as a more natural place to drop the object lock, especially since the batch size is close to the proposed value for this tunable.

Did you consider that already? The implementation would require some more work since vm_pageout_next() doesn't give the caller any info about what it's doing internally, but I think it's a more natural way to approach the problem.

sys/vm/vm_pageout.c
1492

If I understand the code suggestion, that would reset it every loop. I'm trying to count the loops under this object lock.

1630

No, I didn't try that, but I can look into it.

In the current patch, when the object lock is dropped, it will be dropped across the vm_pageout_next() logic since we don't reacquire it until the top of the next loop. However, in this approach, we only drop it when we are actually freeing a page, and not in the skip_page or reinsert conditions. I'm on the fence on whether that is desirable of not.

sys/vm/vm_pageout.c
1630

The nice things about dropping the object lock when we go to collect the next batch are that:

  • you never hold the object lock while fetching a batch, which itself might be expensive if the pagequeue lock is heavily contended,
  • if you encounter a long run of pages that aren't freed by the scan (perhaps a large run of pages was wired into the buffer cache, which would be caught by vm_pageout_defer(), which doesn't need the object lock), you won't hold onto the object lock the entire time.

Without data it's hard to say whether the second point matters or not, but I suspect that in your case the first point will go further than this patch towards reducing latency caused by object lock hold times.

sys/vm/vm_pageout.c
1492

Sorry, I meant the __predict_false(object == NULL)) check immediately above. At that point, we've released the object lock, if any, so it seems to me that we should reset run as well. It's not very important since this is probably a rare case.

sys/vm/vm_pageout.c
1630

I worked this up and I agree it's a cleaner approach. I'll update the diff here.

However, then I thought about also trying to capture the benefit of not doing the vm_page_free under the object lock. I think there's also a straightforward way to do this. We can just define a "to free" batch queue and push to it instead of freeing, then do the frees whenever we drop the object lock. It may involve some tradeoffs though:

  • Significantly less time under object lock
  • Pages may cool off in cache in the meantime
  • May take slightly longer for freed pages to become available (due to batching)
  • More complex

Here's an unpolished code demo. Ignore the sysctls, they're for testing on/off:
https://github.com/rlibby/freebsd/commit/2346c09d9daf3a80cab9080784e4fbc604d78ce0

Some cursory testing on a VM shows this having half the hold time of the object lock compared to just dropping when refilling the batch queue around vm_pageout_next(), which was already orders of magnitude better (microseconds vs milliseconds).

In earlier testing I did on the first diff revision, that was enough to relieve the aspect of the problem I was looking at, but in case this deferred free idea seems attractive to you, I can polish it further.

Implement markj's suggestion

markj added inline comments.
sys/vm/vm_pageout.c
1630

I like this idea. I believe that in the typical case we'll be freeing pages to a per-CPU cache[*], so freed pages will not be immediately available to other CPUs in general.

We are already touching each page twice: once when putting it into the batch, and again when scanning and freeing the page. So we're already incentivized to keep the batch size fairly small such that it fits in L1/L2 cache, though I don't think this has ever been tuned.

@gallatin might be interested in improvements along these lines. I don't personally deal with pagedaemon throughput these days.

\* You could go even further and add and use an interface to UMA which lets you free an array of pointers with one call. That would probably be generally useful, though a fair bit of work.

rlibby edited the test plan for this revision. (Show Details)

Thanks, this version looks good. Just a couple of minor comments.

sys/vm/vm_pageout.c
1457

I'd suggest elaborating a bit on why, i.e., 1) we want to avoid holding an object lock for a long time, 2) a batch refill is a natural place to drop the object lock.

1459

Could you please add a bool vm_batchqueue_empty() and use that here?

sys/vm/vm_pageout.c
1457

How about this?

		/*
		 * If we need to refill the scan batch queue, release any
		 * optimistically held object lock.  This gives someone else a
		 * chance to grab the lock, and also avoids holding it while we
		 * do unrelated work.
		 */
1459

Sure, I was thinking about that too. Okay to bundle as a single commit, or should I pull that out separately?

sys/vm/vm_pageout.c
1457

Looks good, thanks.

1459

Having it in the main commit is fine.

markj feedback: elaborate on comment and provide vm_batchqueue_empty()

This revision is now accepted and ready to land.May 22 2024, 5:38 PM
alc added inline comments.
sys/vm/vm_pagequeue.h
358

const struct vm_batchqueue *bq?

360

Style(9) no longer requires blank lines in this situation.

rlibby added inline comments.
sys/vm/vm_pagequeue.h
358

Yep, will fix.

360

I'm happy to remove it. The blank line seems to be prevailing in this file, but if we prefer all new code omitting it, that works for me.

rlibby marked an inline comment as done.

alc feedback: const and style fixups

This revision now requires review to proceed.May 22 2024, 6:48 PM
This revision is now accepted and ready to land.May 22 2024, 6:57 PM
sys/vm/vm_pageout.c
1630

Thanks for pointing me at this. I do indeed care about pagedaemon throughput.

I can try testing this patch on one of our servers...

rlibby added inline comments.
sys/vm/vm_pageout.c
1630

That'd be great. I'm intending to push this patch as it is in Diff 4 tomorrow morning PDT, then following up for the deferred free idea. I fixed up the deferred free sketch to apply on top of Diff 4, but haven't otherwise polished it:
https://github.com/rlibby/freebsd/commit/78f8a6ba6590bd05105cb62c7bcca3eb058bf826

I am not sure whether deferred free will change overall throughput much. I was thinking about it as reducing object lock hold times. It could have an effect when multiple page daemon threads end up intersecting on one object.

I also have an idea for a change in UMA that could be relevant to this path (different from what markj mentions above), but it's currently even less baked. I'll add you both to the review if/when I get either of these posted.

This revision was automatically updated to reflect the committed changes.
sys/vm/vm_pageout.c
1630

I was trying to take a baseline before using the patch. I could not make dtrace based lockstat behave (kept getting drops) on my 32c/64t test box (amd 7502P) running a netflix workload (~360Gb/s of static content being served via sendfile by nginx across ~160K TCP connections):

`c003.was001.dev# lockstat -n 262144 -P -x bufsize=2048m -x aggsize=4m -D 10 -H sleep 10 > /d/ls
lockstat: warning: 27504438 dynamic variable drops with non-empty dirty list
lockstat: warning: 268567693 dynamic variable drops with non-empty dirty list
lockstat: warning: ran out of data records (use -n for more)`

Looking at the suspect data, some of it seems believable, but I don't see the object lock as a problem:
`R/W writer hold: 3322950 events in 10.148 seconds (327435 events/sec)

Count indv cuml rcnt nsec Lock Caller

322332 51% 51% 0.00 106095 tcpinp _tcp_lro_flush_tcphpts+0xb6f
460419 25% 76% 0.00 35898 tcpinp tcp_hptsi+0x91d
2355753 6% 82% 0.00 1792 vmobject sendfile_free_mext_pg+0x5f
26778 6% 88% 0.00 151093 tcpinp tcp_usr_ready+0x134
21913 5% 93% 0.00 147985 tcpinp tcp_usr_send+0x69d
8899 2% 95% 0.00 175513 vmobject vm_page_grab_pages_unlocked+0x1ec
7025 2% 97% 0.00 190494 tcpinp tcp_usr_rcvd+0x11e
69996 1% 98% 0.00 8559 tcpinp tcp_hptsi+0x4f6
4030 0% 99% 0.00 46096 vmobject vm_pageout_scan_inactive+0x303

785   0%  99% 0.00   217977 tcpinp                 tcp_rack_23q12p8_rack_do_segment+0xe0

R/W reader hold: 704040 events in 10.148 seconds (69374 events/sec)

Count indv cuml rcnt nsec Lock Caller

448230 48% 48% 0.00 2581 sharedcwnd tcp_rack_23q12p8_rack_output+0x2b05
79220 19% 67% 0.00 5903 evclass_lock audit_syscall_enter+0x63
46497 11% 78% 0.00 5720 vmobject sys_sendfile+0xeb
28643 6% 83% 0.00 4732 vmobject vnode_pager_generic_getpages_done_async+0xe
20494 4% 87% 0.00 4749 vmobject vn_sendfile+0x296
10374 4% 91% 0.00 8276 vmobject vnode_pager_haspage+0xc8
36478 3% 93% 0.00 1731 pmap pv list vm_page_release_locked+0x74
2950 1% 95% 0.00 10598 sharedcwnd tcp_usr_send+0x264
11836 1% 96% 0.00 2301 sharedcwnd tcp_rack_23q32p7_rack_output+0x2c92

807   1%  96% 0.00    19608 pmap pv list           vm_page_test_dirty+0x14

`

So I built a LOCK_PROFILING kernel, and I see very low "avg" hold times (assuming I'm using it right):
`c003.was001.dev# sysctl debug.lock.prof.stats | head -2
debug.lock.prof.stats:

max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name

c003.was001.dev# sysctl debug.lock.prof.stats | grep vmob | sort -r -g -k 6 | head -20

  756        12         777          16           4    194      4  0      3 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:589 (rw:vmobject)
10472         0      627594           0        5211    120      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:582 (rw:vmobject)
10480      4819      645901        7425        7351     87      1  0      8 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:868 (rw:vmobject)
 6096         0        6124           0         104     58      0  0      0 /data/ocafirmware/FreeBSD/sys/kern/vfs_subr.c:2371 (rw:vmobject)
  539         0       32041           0         638     50      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_fault.c:1834 (rw:vmobject)
21497      9087  1472284747    14504866    36536614     40      0  0 1174586 /data/ocafirmware/FreeBSD/sys/vm/vm_page.c:5018 (rw:vmobject)
 1958         0        4056           0         260     15      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1552 (rw:vmobject)
 1958         0        4037           0         260     15      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1545 (rw:vmobject)
16849     22432    32195839     8222718     2496986     12      3  0  29332 /data/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:1484 (rw:vmobject)
 4519      3097      161810        5836       20474      7      0  0    158 /data/ocafirmware/FreeBSD/sys/vm/vm_page.c:2850 (rw:vmobject)
   52         0        1425           0         347      4      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vnode_pager.c:1704 (rw:vmobject)
    6         0         166           0          40      4      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_glue.c:616 (rw:vmobject)
 5300         0      222716           0       62007      3      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_map.c:2665 (rw:vmobject)
    7         0          11           0           3      3      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1679 (rw:vmobject)
 8468      6853     1852784       13290      634195      2      0  0     76 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1333 (rw:vmobject)
 2284      8335    80070231     5497770    39413807      2      0  0 211083 /data/ocafirmware/FreeBSD/sys/vm/vnode_pager.c:1242 (rw:vmobject)
    8         0         580           0         218      2      0  0      0 /data/ocafirmware/FreeBSD/sys/kern/uipc_shm.c:217 (rw:vmobject)
    7         0          19           0           8      2      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1681 (rw:vmobject)
 6399     13797     4639661      262450     2736029      1      0  0    247 /data/ocafirmware/FreeBSD/sys/vm/vm_fault.c:357 (rw:vmobject)
 3444         0       50682           0       44757      1      0  0      0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:2388 (rw:vmobject)`

Again, this is the base kernel, before this diff. My assumption is that we are not seeing issues with long hold times in this case.

sys/vm/vm_pageout.c
1630

I agree it doesn't look like that workload is hitting the issue I was focusing on. Are there many different files in the Netflix workload? You say 160k TCP connections, are they generally for clients reading different files or a few hot ones? Do they tend to be read once completely before the pagedaemon gets to them? In my debugging, the issue was most pronounced when dealing with a single file greater than memory size being continuously read.

Since it doesn't seem like this workload sees much vm object contention, I wouldn't expect either the lock break logic or deferring the page frees to after dropping the object lock to help.

Is there any part of the vm system that you think is not keeping up or are you basically doing line rate on that NIC?

In any case, thanks for taking the time to collect and post data.