Page MenuHomeFreeBSD

ule: Simplistic time-sharing for interrupt threads.
ClosedPublic

Authored by jhb on Jun 28 2022, 10:31 PM.
Tags
None
Referenced Files
F104223027: D35644.diff
Wed, Dec 4, 11:29 PM
Unknown Object (File)
Mon, Nov 18, 6:44 PM
Unknown Object (File)
Mon, Nov 18, 12:01 PM
Unknown Object (File)
Mon, Nov 18, 11:53 AM
Unknown Object (File)
Mon, Nov 18, 11:28 AM
Unknown Object (File)
Mon, Nov 18, 11:27 AM
Unknown Object (File)
Mon, Nov 18, 9:20 AM
Unknown Object (File)
Sun, Nov 17, 11:51 AM
Subscribers

Details

Summary

If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.

Sponsored by: Netflix

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 46420
Build 43309: arc lint + arc unit

Event Timeline

jhb requested review of this revision.Jun 28 2022, 10:31 PM

Can we have some knob to disable this behavior, keeping ithreads at fixed priorities? Perhaps it has to be a tunable.

In D35644#808382, @kib wrote:

Can we have some knob to disable this behavior, keeping ithreads at fixed priorities? Perhaps it has to be a tunable.

Hmm, the later commits in the series to shrink the range make this harder to be a tunable. Also, I hope to eventually rework iflib to use regular ithreads in place of its internal cooperative time-sharing (which only works among iflib drivers vs this approach which works with all ithreads including between, say, nvme and a NIC).

sys/kern/sched_ule.c
2339

I suspect that the word 'idle' in the comment should be changed.

BTW, why doing it there and not in sched_add(), as it is done for normal timeshare threads?

sys/kern/sched_ule.c
302

Explicitly include counter.h?

2339

I wonder as well.

2599

I think this condition will always be true, tdq_lowpri includes the priority of the thread currently on the CPU.

2608

It'd be a bit nicer to balance the two cases:

if (PRI_BASE(...) == PRI_ITHD) {
    ...
} else {
    ...
}

and you can move the timeshare check above into the else case.

sys/kern/sched_ule.c
2339

Hmm, it should be "idle ithread".

I commented on the other review (which used sched_wakeup in the ithread code) as to my reasoning here. Not all sched_add() invocations of an ithread should reset the priority. If a spinning ithread (i.e. one that is running long enough to have its priority demoted) gets preempted either via sched_clock() or when a new interrupt arrives, we should leave its priority demoted when it is placed on the run queue when it is preempted (which is a call to sched_add()). We only want to restore its priority back to "normal" when the ithread goes idle. I originally tried restoring the ithread's priority in kern_intr.c, etc. where sched_add() was called to resume an idle ithread, but it became gross as I needed a way to reset, e.g. 'ts_slice' in ULE at those points. Calling sched_wakeup() which already dealt with that was simpler.

2599

Oh, humm, I had thought it was only the threads on the runq. Oh well, in that case it's fine for it to just always preempt which matches what I ended up doing for 4BSD.

2608

Ok. I had done that to minimize the diff, but I agree that is cleaner.

jhb marked an inline comment as done.
  • Address some review feedback.
jhb marked 2 inline comments as done.Jul 11 2022, 10:41 PM
sys/kern/sched_ule.c
2608

So rather than moving the earlier code into the else, I ended up refactoring a bit differently (and in a way that ends up matching the 4BSD change more) to remove duplication of the changes to td_slice and instead using a wrapper function to determine the value to compare ts_slice against to keep the ts_slice code otherwise shared.

One observation is that ithreads are well suited to cooperative scheduling. If an ithread consumes its full slice, there is at least one natural preemption point at the top of its work loop. Setting td_owepreempt works, of course, but it seems to me that there's a good chance the thread will yield while holding locks.

sys/kern/sched_ule.c
2334

Do we need to call sched_interact_update() for non-timeshare threads?

3314

This perhaps belongs under kern.sched.stats. Right now that's only added if SCHED_STATS is configured, but we could define it unconditionally.

This revision is now accepted and ready to land.Jul 12 2022, 1:56 PM

One observation is that ithreads are well suited to cooperative scheduling. If an ithread consumes its full slice, there is at least one natural preemption point at the top of its work loop. Setting td_owepreempt works, of course, but it seems to me that there's a good chance the thread will yield while holding locks.

Well, except that the use case that needs this (livelock conditions) basically involves ithread handlers that run forever. We already do the "interrupt storm" protection thing in the main ithread loop but that doesn't kick in during livelock.

I had previously considered trying to have more explicit cooperative scheduling but it required rather large changes in drivers and also relied on drivers being able to estimate work well to know when yielding might make sense. One thing we might consider perhaps is adding a sched_intr_yield() function that device drivers can call from their interrupt routines in a place where they don't hold locks (e.g. after if_input in a NIC driver) and if we notice an ithread doing that wait to preempt on the next call to that instead of forcing it from sched_clock. A way that might work is that calling that function would normally just set a flag in the ts noting that a yield was attempted. Then when sched_clock wants to preempt, if it sees the "would yield" flag it instead doesn't force a preemption but instead sets a second "do yield" flag and the next call to sched_intr_yield would then preempt. Setting the second flag would also clear the first flag so that if the ithread keeps running without calling the new routine it would eventually get force-preempted (e.g. if it switched to a different handler due to a shared interrupt and the new handler didn't call the function). You'd also want to clear the first flag in sched_wakeup() I think.

In D35644#812266, @jhb wrote:

One observation is that ithreads are well suited to cooperative scheduling. If an ithread consumes its full slice, there is at least one natural preemption point at the top of its work loop. Setting td_owepreempt works, of course, but it seems to me that there's a good chance the thread will yield while holding locks.

Well, except that the use case that needs this (livelock conditions) basically involves ithread handlers that run forever. We already do the "interrupt storm" protection thing in the main ithread loop but that doesn't kick in during livelock.

Right, the main ithread loop isn't the place to yield. It'd have to be handled in the consumer somehow. e.g., for netisr threads, you'd want to yield at the beginning of the loop which pulls an mbuf chain off a workqueue.

I had previously considered trying to have more explicit cooperative scheduling but it required rather large changes in drivers and also relied on drivers being able to estimate work well to know when yielding might make sense. One thing we might consider perhaps is adding a sched_intr_yield() function that device drivers can call from their interrupt routines in a place where they don't hold locks (e.g. after if_input in a NIC driver) and if we notice an ithread doing that wait to preempt on the next call to that instead of forcing it from sched_clock. A way that might work is that calling that function would normally just set a flag in the ts noting that a yield was attempted. Then when sched_clock wants to preempt, if it sees the "would yield" flag it instead doesn't force a preemption but instead sets a second "do yield" flag and the next call to sched_intr_yield would then preempt. Setting the second flag would also clear the first flag so that if the ithread keeps running without calling the new routine it would eventually get force-preempted (e.g. if it switched to a different handler due to a shared interrupt and the new handler didn't call the function). You'd also want to clear the first flag in sched_wakeup() I think.

This is roughly what I had in mind: a mechanism similar to owepreempt, except with larger sections determined by the unit of work of the ithread. I'm not sure it's worth it though, given that this mechanism is mainly intended to keep the system responsive in the face of a DOS.

sys/kern/sched_ule.c
2334

It doesn't hurt. sched_clock only calls it for PRI_CLASS_TIMESHARE but various other places call it always.

td_slptick should normally be zero for ithreads except for the busdma swi thread which uses plain wakeup/sleep.

3314

Ok. SCHED_STAT_DEFINE_VAR is rather weird, but I guess it can't use SYSCTL_ADD as it needs the runtime linker to resolve the dpcpu symbol?

Also, it seems like these stats should now just be reimplemented as counter_u64 instead of dpcpu vars.

sys/kern/sched_ule.c
3314

I punted on trying to redo sched_stats as counter_u64 (someone should do that someday), but I will probably reimplement these stats as SCHED_STATS instead.

This revision now requires review to proceed.Jul 13 2022, 11:09 PM
This revision is now accepted and ready to land.Jul 14 2022, 12:57 PM