Profiling ZFS write performance to ZVOLs with small (16KB) block size I found it bottle-necked by single per-objset dnodes sync thread. It could possibly be told that it is a bad design, but it is also bad that ~40% of that thread CPU time is spent inside taskqueue_enqueue() and ancestors, and in particular, ~34% is spent inside wakeup_one() and ancestors.
Investigating that I found two factors caused by attempt of sleepq_signal() to be fair, waking the longest sleeping thread with highest priority:
- Full linear scan through the list of sleeping threads takes more time, that additionally multiplied by the sleepqueue_chain lock congestion. But it generally makes no any sense for taskqueue, since all its idle threads are identical.
- Waking up the longest sleeping thread reduces chance of cache hits for both the thread itself and also the scheduler (this part is actually visible in profiler).
To address both of those effects this change introduces new sleepq_signal() flag SLEEPQ_UNFAIR and new wakeup_any() function, not declaring any priority or sleep time fairness, but doing exactly opposite -- waking up thread sleeping less time, not looking on it priority. It would be clean and simple to do just that, but I had to also add workaround for case when thread already present in the sleepqueue is still locked by the ongoing context switch. I found it beneficial to avoid lock spinning by choosing other thread, if there is a choice. Hope it is not too dirty.
As a nice side effect this change, due to not doing round-robin through all the taskqueue threads, but using minimal required number, the mentioned above ZFS bottleneck can now be clearly visible in top, while previously it was unclear why the write is capped at ~3GB/s while everything is quite idle.