Add a new PI_SOFTCLOCK for use by softclock threads. Currently this
maps to PI_AV which is the second-highest ithread priority.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
This is perhaps a bit more of a "is this a good idea?" This is intended to resolve the issues described in D29663.
I'd say such re-balance is impossible without proper re-examination of what priority levels are used for what these days. Till very recent time we've had Giant callout handlers, sometimes blocking console refresh for seconds, for example. Those all are fixed now, but we still probably have some too heavy handlers, that don't really need high priority. I suspect system may have more real-time tasks than those. Or we just affirm that any callout handlers taking non-trivial amount of time are evil?
I think that the answer is basically yes, any callout handler taking non-trivial amount of work is buggy, not same as a (threaded) interrupt handler doing a lot of work. We have enough mechanisms like fast taskqueue and taskqueue_enqueue_timeout etc to delegate heavy work to proper context.
But I do not like assigning the highest ithread priority to the clock thread. I think it is wise to leave some freedom for special consumers to claim that they are more important than system handlers, if desired. PI_AV is used by audio drivers (I do not believe that this is reasonable), also I used it for DMAR fault/logging interrupts. I do not see why clock cannot live at AV, leaving some priorities for something that feels more important.
Adding @glebius since he commented on the old review. I think kib@'s suggestion of PI_AV is probably fine, but do we also want to have a slightly larger discussion about what the relative priority of different interrupts should be? Do we think network vs storage should still be different and if so does the existing order (network over disk) still make sense? The existing priorities today are:
#define PI_REALTIME (PRI_MIN_ITHD + 0) #define PI_AV (PRI_MIN_ITHD + 4) #define PI_NET (PRI_MIN_ITHD + 8) #define PI_DISK (PRI_MIN_ITHD + 12) #define PI_TTY (PRI_MIN_ITHD + 16) #define PI_DULL (PRI_MIN_ITHD + 20)
PI_REALTIME is only used by al_eth(4) (that seems like it is probably wrong) and asmc(4) (that may be ok) and INTR_TYPE_CLK. All of the "real" clock interrupt handlers are filters and don't use the ithread at all (two GPIO drivers (aw_gpio and gpiopps) use threaded handlers with INTR_TYPE_CLK). For callouts, it is basically assumed (e.g. C_DIRECT_EXEC) that the interrupt handler runs as a filter, and I consider softclock to really be the ithread handler for clock interrupts (in effect).
PI_AV is used by INTR_TYPE_AV and dmar.
So this is complicated....
I'm starting to think we need a PI_FLASH which is separate from PI_DISK. NVMe is on the cusp of being fundamentally different in its needs than raid and/or ahci HBAs. Though I don't have any good data to prove this assertion at present. I have some bad, anecdotal evidence that's suggestive though.
We sometimes see (or saw) at work cases where the heavy-weight PI_NET jobs would starve NVMe (and in extreme cases mpr/mps, though that was conjecture not actually measured). Drew bumped the priority of PI_DISK so it was more important than PI_NET (or maybe lowered PI_NET below other things including PI_DISK, I forget) in response to timeouts we were seeing that turned out to be scheduling induced rather than misbehaving hardware. cperciva observed something similar in AWS, which is what triggered a race between the error recovery and completion code often. To get IOPs up on NVMe, you really want the completion ISR to happen ahead of almost everything else so more work can be scheduled to keep the queues full (though I don't care too much about IOPs, I know others do and I suspect this will become a discussion point once mav@ has cleared out all the other iop-limiting issues in the system).
I think that PI_AV is fine, btw, for this, though maybe we need a PI_TIMEOUT that defaults to PI_AV if it's not otherwise defined if you want people to be able to tweak this stuff.... better to do it in the kernel config (IMHO) with some supported mechanism than via hacks to priority.h, but maybe I'm too optimistic that would be doable w/o much hassle.
I think those two quotes are related. Wouldn't network interrupt threads try to monopolize CPU, we would not have this discussion. From interrupt latency perspective disks are usually much more forgiving than NICs, since they don't loose packets on delays, but unlike NICs disks rarely cause DoS, that is why I suppose there is a wish to swap them. But whatever order is selected there, user-space will still suffer, since none of those interrupt threads are time-share. I think there should be some sort of throttling or some voluntary or involuntary context switching for interrupt threads running for too long, otherwise we may discuss the priorities forever.
Yea. The problem isn't so much the relative priority, but the amount of work being done in this context affects others being able to get interrupts....
So in the older review I had more or less stated that timers should probably run ahead of all other interrupts, but that the other interrupts should probably all run at about the same priority. If we had a way to force interrupt threads to timeshare (in essence) then I wonder how much that would resolve some of the other concerns. Especially if we were to collapse most interrupts to the same priority so that, for example, network and disk handlers would effectively time share vs having network always starve disk.
I'm good with this if it improves timer accuracy.
Did you do any statistics from user-space on this?
I've tried several times to make a 1.0ms timer for an audio app on FreeBSD, but time after time it derives from the 1.0ms point.
--HPS
Hans, what API have you used for your timers? Many of user-level time APIs should not depend on callout threads, waking up user thread directly from interrupt context via CALLOUT_DIRECT, so should not care about this priority. Though any way on default kernel callout threads are not bound to CPUs, so unless your system is completely busy the callout threads should freely migrate to other CPUs if the right one is busy.
Alexander, I've tried to use clock_nanosleep() and usleep(). It was almost impossible to sustain a precise 1.0ms UDP packet send from user-space. Just try for yourself.
I measured the time using tcpdump .
Doesnt' all real-time priority threads run to completion? In other words, if all ithreads are put under the same priority, then there is no preemption to let other ithread to run until current ithread finishes? To put it differently, making ithreads time-share interrupt time requires changes to the schedulers, not just adjusting the priorities.
BTW: Another thing we should do is to bump default hz to 2000, because many applications operate on millisecond granularity and task switching every ms, is the sometimes too slow!
--HPS
@hselasky Do you know some time-critical places that still depending on HZ, not rewritten to use _sbt() variants?
@mav : This is not about sbt vs HZ.
When multiple programs run at the same priority, I think they shifted around every system tick ???
Every (or each n'th) stahz, not hz, and only user-space time-share threads, not kernel/real-time. Scheduler uses hz for some accounting, but IIRC does not schedule on it.
@mav: It is user-space I'm most interested in. I have a program that sleeps excatly 1ms and then wakes up to send a UDP packet. It is is very difficult to get it running correctly, because of scheduling taking time (this is my suspect) and I believe HZ=2000 helps.
If you have time you can install audio/hpsjam and audio/jack . Then setup a server and watch the timing on the UDP packets going on the wire.
-_HPS
Yes, it would require scheduling changes. Also, if we do those changes we don't actually have to collapse them all down to the same priority necessarily. I think I want to try it next, but my idea is to add time-sharing for ithreads by setting some sort of quantum for threads (settable via a tunable) and forcing a preemption if an ithread uses an entire quantum without yielding. I would also lower the priority of the pthread one "level" (probably 4 values) each time it is preempted (but never lower than PRI_MAX_ITHD). When an ithread yields it would resume its "normal" priority. There are some extra wrinkles around that to deal with priority propagation, but I think this is something that is doable, and unlike my previous ideas that depended on voluntary yielding in drivers, this approach doesn't require driver changes and doesn't require tying driver ISRs to the same ithread in order to get sharing.
It's possible for userspace to schedule callouts directly, via kevent(EVFILT_TIMER). Periodic kevent timers rearm themselves. One problem we've seen is that kevent timer periods can be shorter than the amount of time it takes to schedule a callout, so you can cause a softclock thread to consume 100% of a core doing nothing but rearming itself. Kostik mitigated this by ensuring that SIGSTOP and SIGKILL pause periodic timers. But now it's even worse since the softclock thread will preempt everything else including realtime threads. So an unprivileged user can bring down the system easily, e.g., by preempting watchdogd forever. I tend to think that callouts armed directly by userspace (kevent(EVFILT_TIMER), setitimer(2), probably some others) should be scheduled at a relatively low priority. Or at least they should somehow be differentiated from driver callouts, e.g., by limiting the maximum rate, though I have no idea how to choose that rate.
Note that softclock was already preempting realtime threads as those realtime threads were always below pthread priority. The only threads affected here are hardware ithreads running at priorities >= PI_AV. I did come up with a first cut at changes to do time-sharing of ithreads yesterday though which might help with that problem somewhat.
So I haven't done much exhaustive testing yet, but I have a first cut of ithread time-sharing for ULE available at https://github.com/freebsd/freebsd-src/compare/main...bsdjhb:swi_refactor
I'm actually able to provoke it in a VM by overscheduling the host (running a -j <numcpus> build on the host while the VM is running) though that is not the intended use-case. The intention is for live-lock type cases as Drew had described in D29663.