Differential D20669

Cheaper wakeup_one()/sleepq_signal() alternative for taskqueue purposes.
ClosedPublic
Actions

Authored by mav on Jun 17 2019, 2:17 AM.

Details

Reviewers

mmacy
kib
jhb
markj
jeff
allanjude

Group Reviewers

manpages

Commits

rS349220: Add wakeup_any(), cheaper wakeup_one() for taskqueue(9).

Summary

Profiling ZFS write performance to ZVOLs with small (16KB) block size I found it bottle-necked by single per-objset dnodes sync thread. It could possibly be told that it is a bad design, but it is also bad that ~40% of that thread CPU time is spent inside taskqueue_enqueue() and descendants, and in particular, ~34% is spent inside wakeup_one() and descendants.

Investigating that I found two factors caused by attempt of sleepq_signal() to be fair, waking the longest sleeping thread with highest priority:

Full linear scan through the list of sleeping threads takes more time, that additionally multiplied by the sleepqueue_chain lock congestion. But it generally makes no any sense for taskqueue, since all its idle threads are identical.
Waking up the longest sleeping thread reduces chance of cache hits for both the thread itself and also the scheduler (this part is actually visible in profiler).

To address both of those effects this change introduces new sleepq_signal() flag SLEEPQ_UNFAIR and new wakeup_any() function, not declaring any priority or sleep time fairness, but doing exactly opposite -- waking up thread sleeping less time, not looking on it priority. It would be clean and simple to do just that, but I had to also add workaround for case when thread already present in the sleepqueue is still locked by the ongoing context switch. I found it beneficial to avoid lock spinning by choosing other thread, if there is a choice. Hope it is not too dirty.

As a nice side effect this change, due to not doing round-robin through all the taskqueue threads, but using minimal required number, the mentioned above ZFS bottleneck can now be clearly visible in top, while previously it was unclear why the write is capped at ~3GB/s while everything is quite idle.

Test Plan

As result of this change, benchmarks on 72-core Xeon v4 show ~34% of CPU time reduction from wakeup_one() to wakeup_any(), and resulting ~10% ZFS throughput increase on sequential write to 12 ZVOLs with 16KB volblocksize. Final test with all the changes except the actual switch of taskqueue to wakeup_any() shown the same number as before.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mav created this revision.Jun 17 2019, 2:17 AM

Herald added a subscriber: imp. · View Herald TranscriptJun 17 2019, 2:17 AM

Harbormaster completed remote builds in B24889: Diff 58720.Jun 17 2019, 2:17 AM

mav updated this revision to Diff 58722.Jun 17 2019, 2:19 AM

Harbormaster completed remote builds in B24890: Diff 58722.Jun 17 2019, 2:19 AM

mav updated this revision to Diff 58723.Jun 17 2019, 2:20 AM

Harbormaster completed remote builds in B24891: Diff 58723.Jun 17 2019, 2:20 AM

mav edited the summary of this revision. (Show Details)Jun 17 2019, 2:26 AM

markj added inline comments.Jun 17 2019, 8:20 PM

kern/subr_sleepqueue.c
920 ↗	(On Diff #58723)	Can't we just change the insertion order in sleepq_add()? I don't see anything else that relies on it.
kern/subr_taskqueue.c
807 ↗	(On Diff #58723)	Should gtaskqueue_thread_enqueue() be updated similarly?

mav added inline comments.Jun 17 2019, 8:26 PM

kern/subr_sleepqueue.c
920 ↗	(On Diff #58723)	I actually did that first, but then I found that there is also list traversal in sleepq_remove_matching(), which may benefit from some fairness. If somebody tell me that wakeup order in case of sleepq_broadcast() is irrelevant, I'll happily switch to that.
kern/subr_taskqueue.c
807 ↗	(On Diff #58723)	May be. I haven't searched through the tree for othse cases. I have no idea about gtaskqueue, never used it, and it seems to not have a man page.

I don't object to the change but it should be straightforward to implement a priority-sorted queue: threads of equal priority live on a list, and one thread of each priority lives on an stailq (and this thread contains the list head). Insertion requires searching the queue, but this should be cheap most of the time because the space of possible priorities is small, and the thread is about to go to sleep anyway. Then sleepq_signal() can simply remove a thread from the head of the stailq, so other callers of wakeup_one() would benefit. This requires an extra pointer in the thread struct, but we already have a large amount of internal fragmentation in thread slabs.

If you prefer to add wakeup_any(), it should be documented in sleep(9).

kern/subr_sleepqueue.c
920 ↗	(On Diff #58723)	I suppose that sx locks may benefit from waking up the oldest waiter first, but there is no guarantee that it will be scheduled before other waking threads.