Hmm, the later commits in the series to shrink the range make this harder to be a tunable. Also, I hope to eventually rework iflib to use regular ithreads in place of its internal cooperative time-sharing (which only works among iflib drivers vs this approach which works with all ithreads including between, say, nvme and a NIC).

kib added inline comments.Jul 2 2022, 9:34 AM

sys/kern/sched_ule.c
2380	I suspect that the word 'idle' in the comment should be changed. BTW, why doing it there and not in sched_add(), as it is done for normal timeshare threads?

markj added inline comments.Jul 5 2022, 1:57 PM

sys/kern/sched_ule.c
318	Explicitly include counter.h?
2380	I wonder as well.
2641	I think this condition will always be true, tdq_lowpri includes the priority of the thread currently on the CPU.
2650	It'd be a bit nicer to balance the two cases: if (PRI_BASE(...) == PRI_ITHD) { ... } else { ... } and you can move the timeshare check above into the else case.

jhb added inline comments.Jul 6 2022, 6:35 PM

sys/kern/sched_ule.c
2380	Hmm, it should be "idle ithread". I commented on the other review (which used sched_wakeup in the ithread code) as to my reasoning here. Not all sched_add() invocations of an ithread should reset the priority. If a spinning ithread (i.e. one that is running long enough to have its priority demoted) gets preempted either via sched_clock() or when a new interrupt arrives, we should leave its priority demoted when it is placed on the run queue when it is preempted (which is a call to sched_add()). We only want to restore its priority back to "normal" when the ithread goes idle. I originally tried restoring the ithread's priority in kern_intr.c, etc. where sched_add() was called to resume an idle ithread, but it became gross as I needed a way to reset, e.g. 'ts_slice' in ULE at those points. Calling sched_wakeup() which already dealt with that was simpler.
2641	Oh, humm, I had thought it was only the threads on the runq. Oh well, in that case it's fine for it to just always preempt which matches what I ended up doing for 4BSD.
2650	Ok. I had done that to minimize the diff, but I agree that is cleaner.

jhb edited the summary of this revision. (Show Details)Jul 11 2022, 10:40 PM

Address some review feedback.

Harbormaster completed remote builds in B46371: Diff 108040.Jul 11 2022, 10:40 PM

jhb marked 2 inline comments as done.Jul 11 2022, 10:41 PM

jhb added inline comments.Jul 11 2022, 10:44 PM

sys/kern/sched_ule.c
2650	So rather than moving the earlier code into the else, I ended up refactoring a bit differently (and in a way that ends up matching the 4BSD change more) to remove duplication of the changes to td_slice and instead using a wrapper function to determine the value to compare ts_slice against to keep the ts_slice code otherwise shared.

One observation is that ithreads are well suited to cooperative scheduling. If an ithread consumes its full slice, there is at least one natural preemption point at the top of its work loop. Setting td_owepreempt works, of course, but it seems to me that there's a good chance the thread will yield while holding locks.

sys/kern/sched_ule.c
2375	Do we need to call sched_interact_update() for non-timeshare threads?
3360	This perhaps belongs under kern.sched.stats. Right now that's only added if SCHED_STATS is configured, but we could define it unconditionally.

This revision is now accepted and ready to land.Jul 12 2022, 1:56 PM

In D35644#812036, @markj wrote:

One observation is that ithreads are well suited to cooperative scheduling. If an ithread consumes its full slice, there is at least one natural preemption point at the top of its work loop. Setting td_owepreempt works, of course, but it seems to me that there's a good chance the thread will yield while holding locks.

Well, except that the use case that needs this (livelock conditions) basically involves ithread handlers that run forever. We already do the "interrupt storm" protection thing in the main ithread loop but that doesn't kick in during livelock.

I had previously considered trying to have more explicit cooperative scheduling but it required rather large changes in drivers and also relied on drivers being able to estimate work well to know when yielding might make sense. One thing we might consider perhaps is adding a sched_intr_yield() function that device drivers can call from their interrupt routines in a place where they don't hold locks (e.g. after if_input in a NIC driver) and if we notice an ithread doing that wait to preempt on the next call to that instead of forcing it from sched_clock. A way that might work is that calling that function would normally just set a flag in the ts noting that a yield was attempted. Then when sched_clock wants to preempt, if it sees the "would yield" flag it instead doesn't force a preemption but instead sets a second "do yield" flag and the next call to sched_intr_yield would then preempt. Setting the second flag would also clear the first flag so that if the ithread keeps running without calling the new routine it would eventually get force-preempted (e.g. if it switched to a different handler due to a shared interrupt and the new handler didn't call the function). You'd also want to clear the first flag in sched_wakeup() I think.

In D35644#812266, @jhb wrote:

In D35644#812036, @markj wrote:

One observation is that ithreads are well suited to cooperative scheduling. If an ithread consumes its full slice, there is at least one natural preemption point at the top of its work loop. Setting td_owepreempt works, of course, but it seems to me that there's a good chance the thread will yield while holding locks.

Well, except that the use case that needs this (livelock conditions) basically involves ithread handlers that run forever. We already do the "interrupt storm" protection thing in the main ithread loop but that doesn't kick in during livelock.

Right, the main ithread loop isn't the place to yield. It'd have to be handled in the consumer somehow. e.g., for netisr threads, you'd want to yield at the beginning of the loop which pulls an mbuf chain off a workqueue.

I had previously considered trying to have more explicit cooperative scheduling but it required rather large changes in drivers and also relied on drivers being able to estimate work well to know when yielding might make sense. One thing we might consider perhaps is adding a sched_intr_yield() function that device drivers can call from their interrupt routines in a place where they don't hold locks (e.g. after if_input in a NIC driver) and if we notice an ithread doing that wait to preempt on the next call to that instead of forcing it from sched_clock. A way that might work is that calling that function would normally just set a flag in the ts noting that a yield was attempted. Then when sched_clock wants to preempt, if it sees the "would yield" flag it instead doesn't force a preemption but instead sets a second "do yield" flag and the next call to sched_intr_yield would then preempt. Setting the second flag would also clear the first flag so that if the ithread keeps running without calling the new routine it would eventually get force-preempted (e.g. if it switched to a different handler due to a shared interrupt and the new handler didn't call the function). You'd also want to clear the first flag in sched_wakeup() I think.

This is roughly what I had in mind: a mechanism similar to owepreempt, except with larger sections determined by the unit of work of the ithread. I'm not sure it's worth it though, given that this mechanism is mainly intended to keep the system responsive in the face of a DOS.

jhb added inline comments.Jul 13 2022, 10:19 PM

sys/kern/sched_ule.c
2375	It doesn't hurt. sched_clock only calls it for PRI_CLASS_TIMESHARE but various other places call it always. td_slptick should normally be zero for ithreads except for the busdma swi thread which uses plain wakeup/sleep.
3360	Ok. SCHED_STAT_DEFINE_VAR is rather weird, but I guess it can't use SYSCTL_ADD as it needs the runtime linker to resolve the dpcpu symbol? Also, it seems like these stats should now just be reimplemented as counter_u64 instead of dpcpu vars.

jhb added inline comments.Jul 13 2022, 10:45 PM

sys/kern/sched_ule.c
3360	I punted on trying to redo sched_stats as counter_u64 (someone should do that someday), but I will probably reimplement these stats as SCHED_STATS instead.