Details

Reviewers

alc
kib
markj

Commits

rGa216e311a70c: vm_pageout_scan_inactive: take a lock break

Summary

vm_pageout_scan_inactive: take a lock break

In vm_pageout_scan_inactive, release the object lock when we go to
refill the scan batch queue so that someone else has a chance to acquire
it. This improves access latency to the object when the pagedaemon is
processing many consecutive pages from a single object, and also in any
case avoids a hiccup during refill for the last touched object.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rlibby created this revision.Tue, May 21, 7:32 PM

Herald added a subscriber: imp. · View Herald TranscriptTue, May 21, 7:32 PM

rlibby requested review of this revision.Tue, May 21, 7:32 PM

Harbormaster completed remote builds in B57822: Diff 138859.Tue, May 21, 7:32 PM

rlibby edited the summary of this revision. (Show Details)Tue, May 21, 7:39 PM

rlibby edited the test plan for this revision. (Show Details)

markj added inline comments.Tue, May 21, 7:43 PM

sys/vm/vm_pageout.c
1492	Don't you want to reset `run = 0;` before the `object == NULL` check?
1630	The inactive queue scan operates on batches of pages fetched periodically by vm_pageout_next(). That fetching involves acquiring a page queue lock and a bunch of other work; that'd strike me as a more natural place to drop the object lock, especially since the batch size is close to the proposed value for this tunable. Did you consider that already? The implementation would require some more work since vm_pageout_next() doesn't give the caller any info about what it's doing internally, but I think it's a more natural way to approach the problem.

rlibby added inline comments.Tue, May 21, 7:50 PM

sys/vm/vm_pageout.c
1492	If I understand the code suggestion, that would reset it every loop. I'm trying to count the loops under this object lock.
1630	No, I didn't try that, but I can look into it. In the current patch, when the object lock is dropped, it will be dropped across the vm_pageout_next() logic since we don't reacquire it until the top of the next loop. However, in this approach, we only drop it when we are actually freeing a page, and not in the skip_page or reinsert conditions. I'm on the fence on whether that is desirable of not.

markj added inline comments.Tue, May 21, 8:04 PM

sys/vm/vm_pageout.c
1630	The nice things about dropping the object lock when we go to collect the next batch are that: you never hold the object lock while fetching a batch, which itself might be expensive if the pagequeue lock is heavily contended, if you encounter a long run of pages that aren't freed by the scan (perhaps a large run of pages was wired into the buffer cache, which would be caught by vm_pageout_defer(), which doesn't need the object lock), you won't hold onto the object lock the entire time. Without data it's hard to say whether the second point matters or not, but I suspect that in your case the first point will go further than this patch towards reducing latency caused by object lock hold times.

markj added inline comments.Tue, May 21, 8:08 PM

sys/vm/vm_pageout.c
1492	Sorry, I meant the `__predict_false(object == NULL))` check immediately above. At that point, we've released the object lock, if any, so it seems to me that we should reset `run` as well. It's not very important since this is probably a rare case.

rlibby added inline comments.Wed, May 22, 3:44 PM

sys/vm/vm_pageout.c
1630	I worked this up and I agree it's a cleaner approach. I'll update the diff here. However, then I thought about also trying to capture the benefit of not doing the vm_page_free under the object lock. I think there's also a straightforward way to do this. We can just define a "to free" batch queue and push to it instead of freeing, then do the frees whenever we drop the object lock. It may involve some tradeoffs though: Significantly less time under object lock Pages may cool off in cache in the meantime May take slightly longer for freed pages to become available (due to batching) More complex Here's an unpolished code demo. Ignore the sysctls, they're for testing on/off: https://github.com/rlibby/freebsd/commit/2346c09d9daf3a80cab9080784e4fbc604d78ce0 Some cursory testing on a VM shows this having half the hold time of the object lock compared to just dropping when refilling the batch queue around vm_pageout_next(), which was already orders of magnitude better (microseconds vs milliseconds). In earlier testing I did on the first diff revision, that was enough to relieve the aspect of the problem I was looking at, but in case this deferred free idea seems attractive to you, I can polish it further.

Implement markj's suggestion

Harbormaster completed remote builds in B57849: Diff 138931.Wed, May 22, 4:24 PM

markj added a subscriber: gallatin.Wed, May 22, 4:25 PM

markj added inline comments.

sys/vm/vm_pageout.c
1630	I like this idea. I believe that in the typical case we'll be freeing pages to a per-CPU cache[], so freed pages will not be immediately available to other CPUs in general. We are already touching each page twice: once when putting it into the batch, and again when scanning and freeing the page. So we're already incentivized to keep the batch size fairly small such that it fits in L1/L2 cache, though I don't think this has ever been tuned. @gallatin might be interested in improvements along these lines. I don't personally deal with pagedaemon throughput these days. \ You could go even further and add and use an interface to UMA which lets you free an array of pointers with one call. That would probably be generally useful, though a fair bit of work.

rlibby edited the summary of this revision. (Show Details)Wed, May 22, 4:27 PM

rlibby edited the test plan for this revision. (Show Details)

Thanks, this version looks good. Just a couple of minor comments.

sys/vm/vm_pageout.c
1457	I'd suggest elaborating a bit on why, i.e., 1) we want to avoid holding an object lock for a long time, 2) a batch refill is a natural place to drop the object lock.
1459	Could you please add a `bool vm_batchqueue_empty()` and use that here?

rlibby added inline comments.Wed, May 22, 4:48 PM

sys/vm/vm_pageout.c

1457

How about this?

		/*
		 * If we need to refill the scan batch queue, release any
		 * optimistically held object lock.  This gives someone else a
		 * chance to grab the lock, and also avoids holding it while we
		 * do unrelated work.
		 */

1459

Sure, I was thinking about that too. Okay to bundle as a single commit, or should I pull that out separately?

markj added inline comments.Wed, May 22, 4:49 PM

sys/vm/vm_pageout.c
1457	Looks good, thanks.
1459	Having it in the main commit is fine.

markj feedback: elaborate on comment and provide vm_batchqueue_empty()

Harbormaster completed remote builds in B57850: Diff 138935.Wed, May 22, 5:17 PM

rlibby marked 6 inline comments as done.Wed, May 22, 5:20 PM

markj accepted this revision.Wed, May 22, 5:38 PM

This revision is now accepted and ready to land.Wed, May 22, 5:38 PM

alc accepted this revision.Wed, May 22, 6:38 PM

alc added inline comments.

sys/vm/vm_pagequeue.h
358	const struct vm_batchqueue *bq?
360	Style(9) no longer requires blank lines in this situation.

rlibby marked an inline comment as done.Wed, May 22, 6:44 PM

rlibby added inline comments.

sys/vm/vm_pagequeue.h
358	Yep, will fix.
360	I'm happy to remove it. The blank line seems to be prevailing in this file, but if we prefer all new code omitting it, that works for me.

alc feedback: const and style fixups

This revision now requires review to proceed.Wed, May 22, 6:48 PM

Harbormaster completed remote builds in B57851: Diff 138936.Wed, May 22, 6:48 PM

alc accepted this revision.Wed, May 22, 6:57 PM

This revision is now accepted and ready to land.Wed, May 22, 6:57 PM

gallatin added inline comments.Fri, May 24, 5:45 AM

sys/vm/vm_pageout.c
1630	Thanks for pointing me at this. I do indeed care about pagedaemon throughput. I can try testing this patch on one of our servers...

rlibby marked an inline comment as done.Fri, May 24, 6:44 AM

rlibby added inline comments.

sys/vm/vm_pageout.c
1630	That'd be great. I'm intending to push this patch as it is in Diff 4 tomorrow morning PDT, then following up for the deferred free idea. I fixed up the deferred free sketch to apply on top of Diff 4, but haven't otherwise polished it: https://github.com/rlibby/freebsd/commit/78f8a6ba6590bd05105cb62c7bcca3eb058bf826 I am not sure whether deferred free will change overall throughput much. I was thinking about it as reducing object lock hold times. It could have an effect when multiple page daemon threads end up intersecting on one object. I also have an idea for a change in UMA that could be relevant to this path (different from what markj mentions above), but it's currently even less baked. I'll add you both to the review if/when I get either of these posted.

Closed by commit rGa216e311a70c: vm_pageout_scan_inactive: take a lock break (authored by rlibby). · Explain WhyFri, May 24, 3:56 PM

This revision was automatically updated to reflect the committed changes.

rlibby added a commit: rGa216e311a70c: vm_pageout_scan_inactive: take a lock break.

gallatin added inline comments.Sat, May 25, 10:04 PM

sys/vm/vm_pageout.c
1630	I was trying to take a baseline before using the patch. I could not make dtrace based lockstat behave (kept getting drops) on my 32c/64t test box (amd 7502P) running a netflix workload (~360Gb/s of static content being served via sendfile by nginx across ~160K TCP connections): `c003.was001.dev# lockstat -n 262144 -P -x bufsize=2048m -x aggsize=4m -D 10 -H sleep 10 > /d/ls lockstat: warning: 27504438 dynamic variable drops with non-empty dirty list lockstat: warning: 268567693 dynamic variable drops with non-empty dirty list lockstat: warning: ran out of data records (use -n for more)` Looking at the suspect data, some of it seems believable, but I don't see the object lock as a problem: `R/W writer hold: 3322950 events in 10.148 seconds (327435 events/sec) Count indv cuml rcnt nsec Lock Caller 322332 51% 51% 0.00 106095 tcpinp _tcp_lro_flush_tcphpts+0xb6f 460419 25% 76% 0.00 35898 tcpinp tcp_hptsi+0x91d 2355753 6% 82% 0.00 1792 vmobject sendfile_free_mext_pg+0x5f 26778 6% 88% 0.00 151093 tcpinp tcp_usr_ready+0x134 21913 5% 93% 0.00 147985 tcpinp tcp_usr_send+0x69d 8899 2% 95% 0.00 175513 vmobject vm_page_grab_pages_unlocked+0x1ec 7025 2% 97% 0.00 190494 tcpinp tcp_usr_rcvd+0x11e 69996 1% 98% 0.00 8559 tcpinp tcp_hptsi+0x4f6 4030 0% 99% 0.00 46096 vmobject vm_pageout_scan_inactive+0x303 785 0% 99% 0.00 217977 tcpinp tcp_rack_23q12p8_rack_do_segment+0xe0 R/W reader hold: 704040 events in 10.148 seconds (69374 events/sec) Count indv cuml rcnt nsec Lock Caller 448230 48% 48% 0.00 2581 sharedcwnd tcp_rack_23q12p8_rack_output+0x2b05 79220 19% 67% 0.00 5903 evclass_lock audit_syscall_enter+0x63 46497 11% 78% 0.00 5720 vmobject sys_sendfile+0xeb 28643 6% 83% 0.00 4732 vmobject vnode_pager_generic_getpages_done_async+0xe 20494 4% 87% 0.00 4749 vmobject vn_sendfile+0x296 10374 4% 91% 0.00 8276 vmobject vnode_pager_haspage+0xc8 36478 3% 93% 0.00 1731 pmap pv list vm_page_release_locked+0x74 2950 1% 95% 0.00 10598 sharedcwnd tcp_usr_send+0x264 11836 1% 96% 0.00 2301 sharedcwnd tcp_rack_23q32p7_rack_output+0x2c92 807 1% 96% 0.00 19608 pmap pv list vm_page_test_dirty+0x14 ` So I built a LOCK_PROFILING kernel, and I see very low "avg" hold times (assuming I'm using it right): `c003.was001.dev# sysctl debug.lock.prof.stats \| head -2 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name c003.was001.dev# sysctl debug.lock.prof.stats \| grep vmob \| sort -r -g -k 6 \| head -20 756 12 777 16 4 194 4 0 3 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:589 (rw:vmobject) 10472 0 627594 0 5211 120 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:582 (rw:vmobject) 10480 4819 645901 7425 7351 87 1 0 8 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:868 (rw:vmobject) 6096 0 6124 0 104 58 0 0 0 /data/ocafirmware/FreeBSD/sys/kern/vfs_subr.c:2371 (rw:vmobject) 539 0 32041 0 638 50 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_fault.c:1834 (rw:vmobject) 21497 9087 1472284747 14504866 36536614 40 0 0 1174586 /data/ocafirmware/FreeBSD/sys/vm/vm_page.c:5018 (rw:vmobject) 1958 0 4056 0 260 15 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1552 (rw:vmobject) 1958 0 4037 0 260 15 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1545 (rw:vmobject) 16849 22432 32195839 8222718 2496986 12 3 0 29332 /data/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:1484 (rw:vmobject) 4519 3097 161810 5836 20474 7 0 0 158 /data/ocafirmware/FreeBSD/sys/vm/vm_page.c:2850 (rw:vmobject) 52 0 1425 0 347 4 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vnode_pager.c:1704 (rw:vmobject) 6 0 166 0 40 4 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_glue.c:616 (rw:vmobject) 5300 0 222716 0 62007 3 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_map.c:2665 (rw:vmobject) 7 0 11 0 3 3 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1679 (rw:vmobject) 8468 6853 1852784 13290 634195 2 0 0 76 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1333 (rw:vmobject) 2284 8335 80070231 5497770 39413807 2 0 0 211083 /data/ocafirmware/FreeBSD/sys/vm/vnode_pager.c:1242 (rw:vmobject) 8 0 580 0 218 2 0 0 0 /data/ocafirmware/FreeBSD/sys/kern/uipc_shm.c:217 (rw:vmobject) 7 0 19 0 8 2 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:1681 (rw:vmobject) 6399 13797 4639661 262450 2736029 1 0 0 247 /data/ocafirmware/FreeBSD/sys/vm/vm_fault.c:357 (rw:vmobject) 3444 0 50682 0 44757 1 0 0 0 /data/ocafirmware/FreeBSD/sys/vm/vm_object.c:2388 (rw:vmobject)` Again, this is the base kernel, before this diff. My assumption is that we are not seeing issues with long hold times in this case.

rlibby added inline comments.Sun, May 26, 12:20 AM

sys/vm/vm_pageout.c
1630	I agree it doesn't look like that workload is hitting the issue I was focusing on. Are there many different files in the Netflix workload? You say 160k TCP connections, are they generally for clients reading different files or a few hot ones? Do they tend to be read once completely before the pagedaemon gets to them? In my debugging, the issue was most pronounced when dealing with a single file greater than memory size being continuously read. Since it doesn't seem like this workload sees much vm object contention, I wouldn't expect either the lock break logic or deferring the page frees to after dropping the object lock to help. Is there any part of the vm system that you think is not keeping up or are you basically doing line rate on that NIC? In any case, thanks for taking the time to collect and post data.

vm_pageout_scan_inactive: take a lock break
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Count indv cuml rcnt nsec Lock Caller

Count indv cuml rcnt nsec Lock Caller

Revision Contents
Changeset List

Diff 139051

sys/vm/vm_pageout.c

sys/vm/vm_pagequeue.h

vm_pageout_scan_inactive: take a lock breakClosedPublicActions

Details

Diff Detail

Event Timeline

Count indv cuml rcnt nsec Lock Caller

Count indv cuml rcnt nsec Lock Caller

Revision ContentsChangeset List

Diff 139051

sys/vm/vm_pageout.c

sys/vm/vm_pagequeue.h

vm_pageout_scan_inactive: take a lock break
ClosedPublic
Actions

Revision Contents
Changeset List