Page MenuHomeFreeBSD

Cheaper wakeup_one()/sleepq_signal() alternative for taskqueue purposes.

Authored by mav on Mon, Jun 17, 2:17 AM.



Profiling ZFS write performance to ZVOLs with small (16KB) block size I found it bottle-necked by single per-objset dnodes sync thread. It could possibly be told that it is a bad design, but it is also bad that ~40% of that thread CPU time is spent inside taskqueue_enqueue() and descendants, and in particular, ~34% is spent inside wakeup_one() and descendants.

Investigating that I found two factors caused by attempt of sleepq_signal() to be fair, waking the longest sleeping thread with highest priority:

  • Full linear scan through the list of sleeping threads takes more time, that additionally multiplied by the sleepqueue_chain lock congestion. But it generally makes no any sense for taskqueue, since all its idle threads are identical.
  • Waking up the longest sleeping thread reduces chance of cache hits for both the thread itself and also the scheduler (this part is actually visible in profiler).

To address both of those effects this change introduces new sleepq_signal() flag SLEEPQ_UNFAIR and new wakeup_any() function, not declaring any priority or sleep time fairness, but doing exactly opposite -- waking up thread sleeping less time, not looking on it priority. It would be clean and simple to do just that, but I had to also add workaround for case when thread already present in the sleepqueue is still locked by the ongoing context switch. I found it beneficial to avoid lock spinning by choosing other thread, if there is a choice. Hope it is not too dirty.

As a nice side effect this change, due to not doing round-robin through all the taskqueue threads, but using minimal required number, the mentioned above ZFS bottleneck can now be clearly visible in top, while previously it was unclear why the write is capped at ~3GB/s while everything is quite idle.

Test Plan

As result of this change, benchmarks on 72-core Xeon v4 show ~34% of CPU time reduction from wakeup_one() to wakeup_any(), and resulting ~10% ZFS throughput increase on sequential write to 12 ZVOLs with 16KB volblocksize. Final test with all the changes except the actual switch of taskqueue to wakeup_any() shown the same number as before.

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mav created this revision.Mon, Jun 17, 2:17 AM
mav updated this revision to Diff 58722.Mon, Jun 17, 2:19 AM
mav updated this revision to Diff 58723.Mon, Jun 17, 2:20 AM
mav edited the summary of this revision. (Show Details)Mon, Jun 17, 2:26 AM
markj added inline comments.Mon, Jun 17, 8:20 PM
920 ↗(On Diff #58723)

Can't we just change the insertion order in sleepq_add()? I don't see anything else that relies on it.

807 ↗(On Diff #58723)

Should gtaskqueue_thread_enqueue() be updated similarly?

mav added inline comments.Mon, Jun 17, 8:26 PM
920 ↗(On Diff #58723)

I actually did that first, but then I found that there is also list traversal in sleepq_remove_matching(), which may benefit from some fairness. If somebody tell me that wakeup order in case of sleepq_broadcast() is irrelevant, I'll happily switch to that.

807 ↗(On Diff #58723)

May be. I haven't searched through the tree for othse cases. I have no idea about gtaskqueue, never used it, and it seems to not have a man page.

markj accepted this revision.Wed, Jun 19, 4:58 PM

I don't object to the change but it should be straightforward to implement a priority-sorted queue: threads of equal priority live on a list, and one thread of each priority lives on an stailq (and this thread contains the list head). Insertion requires searching the queue, but this should be cheap most of the time because the space of possible priorities is small, and the thread is about to go to sleep anyway. Then sleepq_signal() can simply remove a thread from the head of the stailq, so other callers of wakeup_one() would benefit. This requires an extra pointer in the thread struct, but we already have a large amount of internal fragmentation in thread slabs.

If you prefer to add wakeup_any(), it should be documented in sleep(9).

920 ↗(On Diff #58723)

I suppose that sx locks may benefit from waking up the oldest waiter first, but there is no guarantee that it will be scheduled before other waking threads.

This revision is now accepted and ready to land.Wed, Jun 19, 4:58 PM
mav added a comment.Wed, Jun 19, 5:08 PM

I did think about sorting the queue on priority. What stopped me is the fact that priority may change.

markj added a comment.Wed, Jun 19, 5:19 PM
In D20669#447323, @mav wrote:

I did think about sorting the queue on priority. What stopped me is the fact that priority may change.

We have turnstile_adjust() to handle this scenario for turnstiles; I don't see why it is a fatal problem here.

This revision was automatically updated to reflect the committed changes.