Page MenuHomeFreeBSD

Make ULE and 4BSD coexists
Needs ReviewPublic

Authored by kib on Thu, Jan 22, 7:13 PM.
Tags
None
Referenced Files
F142860055: D54831.id170274.diff
Sat, Jan 24, 6:03 AM
F142860053: D54831.id170297.diff
Sat, Jan 24, 6:03 AM
F142860044: D54831.id170267.diff
Sat, Jan 24, 6:03 AM
F142859931: D54831.diff
Sat, Jan 24, 6:02 AM
Unknown Object (File)
Fri, Jan 23, 4:19 AM
Unknown Object (File)
Fri, Jan 23, 2:46 AM
Unknown Object (File)
Fri, Jan 23, 12:07 AM
Unknown Object (File)
Thu, Jan 22, 10:20 PM

Details

Reviewers
shurd
olce
andrew
manu
Group Reviewers
iflib
Summary

Make both our schedulers configurable into the same kernel image. Add tunable kern.sched.name to select the scheduler on boot (ULE or 4BSD).

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Thu, Jan 22, 7:13 PM

I only tried to boot into ULE so far.
There are two MD places that need to be handled for each arch, but I only did that for amd64 right now:

  • cpu_switch() made unconditionally wait for blocked thread lock unblock
  • sched_instance_select() must be called before ifuncs are resolved.

As a consequence, armv7 is broken in principle until it implements ifuncs for kernel. In the recent tradition, I do not consider this a hard block from committing the change.

@olce and I are working for the same goal but with different approach. Instead of putting them both in kernel binary, our approach is to make sched_4bsd as a kernel module which overrides sched_ule's weak symbol on boot. This gives us two advantages:

  1. Kernel binary size stays the same (but letting them coexisting won't have significant impact on kernel size so this is not a huge advantage)
  2. We can let users to create third-party scheduler as kernel modules. This way, they can run their own scheduler by just building their kernel module without buildkernel and installworld .

The first step of this approach is D54830 which I opened a few hours ago.

sys/amd64/amd64/cpu_switch.S
495

This should be done for other arch's swtch.S as well.

kib marked an inline comment as done.Thu, Jan 22, 8:32 PM

@olce and I are working for the same goal but with different approach. Instead of putting them both in kernel binary, our approach is to make sched_4bsd as a kernel module which overrides sched_ule's weak symbol on boot. This gives us two advantages:

  1. Kernel binary size stays the same (but letting them coexisting won't have significant impact on kernel size so this is not a huge advantage)
  2. We can let users to create third-party scheduler as kernel modules. This way, they can run their own scheduler by just building their kernel module without buildkernel and installworld .

The first step of this approach is D54830 which I opened a few hours ago.

Weak symbols do not work exactly this way generally, and esp. in kernel. Anyway, you would not get what you want without significantly hacking either very early linker or even the loader.
I am slightly curious how far this would lead you.

sys/amd64/amd64/cpu_switch.S
495

Did you read what I wrote in the first comment?

In D54831#1253386, @kib wrote:

@olce and I are working for the same goal but with different approach. Instead of putting them both in kernel binary, our approach is to make sched_4bsd as a kernel module which overrides sched_ule's weak symbol on boot. This gives us two advantages:

  1. Kernel binary size stays the same (but letting them coexisting won't have significant impact on kernel size so this is not a huge advantage)
  2. We can let users to create third-party scheduler as kernel modules. This way, they can run their own scheduler by just building their kernel module without buildkernel and installworld .

The first step of this approach is D54830 which I opened a few hours ago.

Weak symbols do not work exactly this way generally, and esp. in kernel. Anyway, you would not get what you want without significantly hacking either very early linker or even the loader.
I am slightly curious how far this would lead you.

I think I was trying to tackle that idea without deeper understanding in kernel linker. I'll do more research on this. But at least I want hear opinions on loading scheduler as kernel module.

sys/amd64/amd64/cpu_switch.S
495

Yes. What I'm trying to say is

  1. I would prefer doing this for all arch at once instead of just doing it for amd64. Or at least split this into two commits where the second one handles MD part specifically (e.g. amd64: ....). So far I don't see any commits under commits tab.
  2. You said armv7 is broken in principle due to this patch. I didn't know the recent tradition says we allow commits that breaks build or run. Even recently, I heard from other committers that we still want all commits to be buildable and runnable. Otherwise it makes bisect harder. Maybe I'm misunderstanding, so please correct me if I'm wrong.

Side note: In CI team, we are planning to introduce something like merge queue where commits should pass CI tests (or at least build) before being merged into main branch. This is not confirmed, but the idea of commits should be buildable still remains.

Minsoo beat me to it.

I suggested weak symbols, but indeed anticipating that some work in the kernel linker was needed. If you think that this is a lot of work, perhaps we could start with your approach instead.

The nice thing here is that there's no runtime overhead thanks to ifuncs (that is not an advantage with respect to the other approach, but clears the concern about the overhead of dynamic dispatch).

A not so nice thing is that we are now exporting sysctl knobs for both schedulers, although only one is really in use. That's quite easily fixable.

Also, both for clarity and backwards compatibility, it might be better to keep the knobs hierarchy as it is now, i.e., without the additional node named after the specific scheduler.

I'd like also that we investigate why the loop on the blocked lock was elided for 4BSD.

kib marked an inline comment as done.Thu, Jan 22, 9:07 PM
In D54831#1253386, @kib wrote:

@olce and I are working for the same goal but with different approach. Instead of putting them both in kernel binary, our approach is to make sched_4bsd as a kernel module which overrides sched_ule's weak symbol on boot. This gives us two advantages:

  1. Kernel binary size stays the same (but letting them coexisting won't have significant impact on kernel size so this is not a huge advantage)
  2. We can let users to create third-party scheduler as kernel modules. This way, they can run their own scheduler by just building their kernel module without buildkernel and installworld .

The first step of this approach is D54830 which I opened a few hours ago.

Weak symbols do not work exactly this way generally, and esp. in kernel. Anyway, you would not get what you want without significantly hacking either very early linker or even the loader.
I am slightly curious how far this would lead you.

I think I was trying to tackle that idea without deeper understanding in kernel linker. I'll do more research on this. But at least I want hear opinions on loading scheduler as kernel module.

It is possible, but it requires work to allow kernel to reference modules symbols. We currently cannot do this at all.
Also, there is no symbol preemption AKA interposing for kernel or modules, not even GOT is there.

For weak symbols, the only requirements are really that linker allows non-resolved weak references, and that two symbols with the same name, one of which is weak, do not conflict for static linker. We somewhat abused weak symbols in the runtime loader(s) so that e.g. I have to introduce LD_DYNAMIC_WEAK by request of a sanitizers' developer. Unfortunately, the special weak treatment cannot be removed since it would break the ABI, but at least we should not introduce new cases.

sys/amd64/amd64/cpu_switch.S
495

This is a diff against the branch, which contains around 20 commits.

I do not see a reason to spend time on machines I cannot even test, until the idea of this change is circulated in public and get accepted enough for the MD work to make sense.

For armv7, the arch maintainers or people who want the platform to stay alive, need to finally implement ifuncs for the kernel. It is not much work, much less than for userspace (where it is not too hard as well).

kib marked an inline comment as done.Thu, Jan 22, 9:07 PM

I'd like also that we investigate why the loop on the blocked lock was elided for 4BSD.

Simply because threads are never blocked because there is only one global scheduler lock on 4BSD instead of per-runq lock on ULE. So on ULE you are not guaranteed that the sched_switch() finished with removing the thread from CPU while cpu_switch() already tries to switch to the new context. On 4BSD it is impossible (global lock).

Checking the blocked state on 4BSD would give 4 instructions overhead which I dismiss with prejudice.

In D54831#1253408, @kib wrote:

I'd like also that we investigate why the loop on the blocked lock was elided for 4BSD.

Simply because threads are never blocked because there is only one global scheduler lock on 4BSD instead of per-runq lock on ULE. So on ULE you are not guaranteed that the sched_switch() finished with removing the thread from CPU while cpu_switch() already tries to switch to the new context. The blocked state makes this race closed without the need to somehow lock several runqs. On 4BSD it is impossible (global lock).

Checking the blocked state on 4BSD would give 4 instructions overhead which I dismiss with prejudice.

sys/amd64/amd64/cpu_switch.S
495

Make sense. We can ask other arch developers for testing and once they report success (hopefully with additional investigation in the MD code) it can be merged. There are many voices in the mailing list who want to keep support for armv7, but I'm not sure how many of them will actually test this and do the ifunc implementation.

Since the idea of scheduler as kernel module takes time for more investigation, I'm fine with this approach. Once merged, I'll ask 4BSD users on the mailing list to benchmark 4BSD vs ULE.

sys/kern/sched_ule.c
3323

__unused?

kib marked an inline comment as done.Thu, Jan 22, 10:01 PM
kib added inline comments.
sys/kern/sched_ule.c
3323

What do you mean? It is used in the sense that its pointer is taken by the slots table.

sys/kern/sched_ule.c
3323

I mean the parameter. Is it considered as used if the function pointer is stored in the slots table?

kib marked 2 inline comments as done.Thu, Jan 22, 10:43 PM
kib added inline comments.
sys/kern/sched_ule.c
3323

Well, there is no compiler' complain, so why add this?

sys/kern/sched_ule.c
3323

Interesting... I see __unused everywhere in sys so I thought compiler was strict on this. But if not this should be fine...

In D54831#1253408, @kib wrote:

I'd like also that we investigate why the loop on the blocked lock was elided for 4BSD.

Simply because threads are never blocked because there is only one global scheduler lock on 4BSD instead of per-runq lock on ULE. So on ULE you are not guaranteed that the sched_switch() finished with removing the thread from CPU while cpu_switch() already tries to switch to the new context. The blocked state makes this race closed without the need to somehow lock several runqs. On 4BSD it is impossible (global lock).

I well know what it is used for in ULE. Was not sure for 4BSD as it uses the block lock as well. After a check, that is just to release sleepqueues et alter's locks earlier (as ULE also does), and indeed the new selected thread cannot have the block lock.

Checking the blocked state on 4BSD would give 4 instructions overhead which I dismiss with prejudice.

And rightly so, as it's not going to make any noticeable performance difference.

sys/amd64/amd64/cpu_switch.S
495

Since the idea of scheduler as kernel module takes time for more investigation, I'm fine with this approach. Once merged, I'll ask 4BSD users on the mailing list to benchmark 4BSD vs ULE.

Please coordinate at least with me before sending anything. We have to be careful in the choice of words, since the really important thing is that users do these tests after my ULE fixes. If people want to test 4BSD vs. ULE in the meantime, that's of course fine, but we don't want to press them to do that.

kib marked an inline comment as done.Fri, Jan 23, 3:19 PM
In D54831#1253408, @kib wrote:

I'd like also that we investigate why the loop on the blocked lock was elided for 4BSD.

Simply because threads are never blocked because there is only one global scheduler lock on 4BSD instead of per-runq lock on ULE. So on ULE you are not guaranteed that the sched_switch() finished with removing the thread from CPU while cpu_switch() already tries to switch to the new context. The blocked state makes this race closed without the need to somehow lock several runqs. On 4BSD it is impossible (global lock).

I well know what it is used for in ULE. Was not sure for 4BSD as it uses the block lock as well. After a check, that is just to release sleepqueues et alter's locks earlier (as ULE also does), and indeed the new selected thread cannot have the block lock.

I do not see what the code fragment in cpu_switch.S has to do with sleepqueue. It can only observe either runq lock, or blocked lock, if ever.
But anyway, the outcome is same.

Checking the blocked state on 4BSD would give 4 instructions overhead which I dismiss with prejudice.

And rightly so, as it's not going to make any noticeable performance difference.

So we agreed, I believe.

Make kern.sched sysctls working when 4BSD is selected:
sysctl kern.sched.ule.topology_spec: allow to run if ULE is not initialized
sched_shim: restore kern.ccpu sysctl

It is apparently should be considered part of the ABI, and is used by
the base top(1).  But do not declare the ccpu variable in headers, it is
needed only by 4bsd. So put the variable definition into sched_shim.c to
make the kernel buildable without SCHED_4BSD.
In D54831#1253557, @kib wrote:

I do not see what the code fragment in cpu_switch.S has to do with sleepqueue. It can only observe either runq lock, or blocked lock, if ever.

I was talking about the use of the block lock in sched_4bsd.c.

So we agreed, I believe.

Sure. I just wanted to understand why the code elision for 4BSD, in case it had been a mistake, in addition to making sure executing it now does not change anything functional.

sys/amd64/amd64/cpu_switch.S
495

Got that.

Handle all arches for cpu_switch.S.

Globally enable both schedulers for LINT.
Enable both schedulers for GENERIC on amd64.

I believe this variant of the branch is more or less complete, modulo the feedback.

kib marked 2 inline comments as done.Fri, Jan 23, 4:18 PM
sys/conf/NOTES
214

I have no idea what this character does. My pending review removes this as a cleanup but it will take some time to land. Could you check if this is a typo?

kib marked an inline comment as done.Fri, Jan 23, 4:37 PM
kib added inline comments.
sys/conf/NOTES
214

This is ASCII 'form feed' character. Sometimes referenced as 'next page'. It skips page for printers, and some editors might use it to split the view.

Why removing something that you do not understand?

sys/conf/NOTES
214

Sorry. There was no documentation or comment on that and different editors (phab, vim, github) displayed it differently, so I was confused.

In D54831#1253559, @kib wrote:

Make kern.sched sysctls working when 4BSD is selected:

I do not see any new change related to that (compared to the previous diff)?

sysctl kern.sched.ule.topology_spec: allow to run if ULE is not initialized

Mmm... That kind of problem would be automatically avoided if not exporting settings for both 4BSD and ULE, which also changes the sysctl hierarchy.

Could we just have the same kern.sched as before with all applicable settings of the current scheduler (and not showing those that are not applicable), without the intermediate ule or 4bsd? E.g., by converting all the static SYSCTL_*() macros into SYSCTL_ADD_*() ones in the particular sched_setup() functions.

Once done, the kern.sched.instance_name can just be removed (there is already kern.sched.name). kern.sched.available can of course be kept.

sched_shim: restore kern.ccpu sysctl

(It is used by ps(1) as well.)

Other points:

  1. I find having the sched_ule_ and sched_4bsd_ prefixes of function names a bit ugly. Could we have a macro similar to SLOT() for these, so that we would write, e.g.:
...
static void
SCHED(fork)(struct thread *td, struct thread *childtd)
...

instead of:

...
static void
sched_4bsd_fork(struct thread *td, struct thread *childtd)
...
  1. Please see inline comments.
  2. sswitch is unfortunate, but I agree bypasses are not really worth the trouble. :-)
sys/kern/sched_shim.c
254

I'd use kern.sched.name here, to match the kern.sched.name knob (see herald comment).

256

There's the problem here that 4BSD is not the default when it is the only one compiled-in.

Could you change the logic so that, if there's only one scheduler, it is automatically selected?

Additionally, if an explicit scheduler has been passed and is not available, then what about printing a warning and continuing running with some arbitrary one?

kib marked 3 inline comments as done.Fri, Jan 23, 5:26 PM
In D54831#1253559, @kib wrote:

Make kern.sched sysctls working when 4BSD is selected:

I do not see any new change related to that (compared to the previous diff)?

The KASSERT() in sysctl_kern_sched_ule_topology_spec() was changed to if(), so it avoids a panic or UB when ULE is not initialized.

sysctl kern.sched.ule.topology_spec: allow to run if ULE is not initialized

Mmm... That kind of problem would be automatically avoided if not exporting settings for both 4BSD and ULE, which also changes the sysctl hierarchy.

I want users that tweak schedulers, to put any schedulers' tweaks into their /etc/sysctl.conf, without causing errors if some of the scheduler is not enabled or not configured.
Such users must be intelligent enough to understand that settings for non-active schedulers are not active.
Also I want to be very clear which setting belong to which scheduler.
So I prefer the new scheme.

Could we just have the same kern.sched as before with all applicable settings of the current scheduler (and not showing those that are not applicable), without the intermediate ule or 4bsd? E.g., by converting all the static SYSCTL_*() macros into SYSCTL_ADD_*() ones in the particular sched_setup() functions.

See above, I want this to be explicit and always available.

Once done, the kern.sched.instance_name can just be removed (there is already kern.sched.name). kern.sched.available can of course be kept.

I renamed the tunable to kern.sched.name.
I am not sure what do you mean by kern.sched.instance_name, I cannot find such thing.

sched_shim: restore kern.ccpu sysctl

(It is used by ps(1) as well.)

Ok, so restored already.

Other points:

  1. I find having the sched_ule_ and sched_4bsd_ prefixes of function names a bit ugly. Could we have a macro similar to SLOT() for these, so that we would write, e.g.:
...
static void
SCHED(fork)(struct thread *td, struct thread *childtd)
...

instead of:

...
static void
sched_4bsd_fork(struct thread *td, struct thread *childtd)
...

I do not like it, It adds a level of obfuscation and I do not see any advantages. I been there with the i386 pmap variants, but for pmap it at least was needed. It is still ugly.

  1. Please see inline comments.
  2. sswitch is unfortunate, but I agree bypasses are not really worth the trouble. :-)

Add sched_instance_select() call to all arches.
Rename the tunable to kern.sched.name.
Automatically fall back to some scheduler if the named one is not found, and there is one.

Apparently, both armv7 and riscv lack ifunc for kernel.

In D54831#1253606, @kib wrote:

Mmm... That kind of problem would be automatically avoided if not exporting settings for both 4BSD and ULE, which also changes the sysctl hierarchy.

I want users that tweak schedulers, to put any schedulers' tweaks into their /etc/sysctl.conf, without causing errors if some of the scheduler is not enabled or not configured.
Such users must be intelligent enough to understand that settings for non-active schedulers are not active.
Also I want to be very clear which setting belong to which scheduler.
So I prefer the new scheme.

I see. I don't like it very much as it is slightly confusing for regular users and breaking the existing interface, but OK.

I am not sure what do you mean by kern.sched.instance_name, I cannot find such thing.

Indeed, I meant kern.sched.sched_instance_name. Could you rename it to simply kern.sched.name?

While here, I'd also rename the sched_instance_name variable to just sched_name.

I do not like it, It adds a level of obfuscation and I do not see any advantages. I been there with the i386 pmap variants, but for pmap it at least was needed. It is still ugly.

Still ugly, yes, but less ugly IMHO. Anyway.

In D54831#1253606, @kib wrote:

I am not sure what do you mean by kern.sched.instance_name, I cannot find such thing.

Indeed, I meant kern.sched.sched_instance_name. Could you rename it to simply kern.sched.name?

I still do not understand. There is no such sysctl, and I do not remember introducing it. On the patched kernel,

root@:/ # sysctl kern.sched | grep instance
root@:/ #

While here, I'd also rename the sched_instance_name variable to just sched_name.

ok.

Rename sched_instance_name variable to sched_name.

Looks good.

You're better placed than me to know the policy about arm and riscv and if this can be committed even before ifuncs are developed for them.

In D54831#1253638, @kib wrote:

I still do not understand. There is no such sysctl, and I do not remember introducing it. On the patched kernel,

My bad, I misread twice, the sysctl is already named kern.sched.name.

armv7 and riscv64 are tier 2, breaking them is not permitted. I have an idea for how to make them work though, without being too invasive.

I think this warrants a "Relnotes: yes" tag.

armv7 and riscv64 are tier 2, breaking them is not permitted. I have an idea for how to make them work though, without being too invasive.

ifunc can be emulated with the function pointer. This is of course breaks the microoptimizations real ifunc provides, but should keep the arches going until maintainers finally implement the proper solution.
I do not want to code that.

In D54831#1253758, @kib wrote:

armv7 and riscv64 are tier 2, breaking them is not permitted. I have an idea for how to make them work though, without being too invasive.

ifunc can be emulated with the function pointer. This is of course breaks the microoptimizations real ifunc provides, but should keep the arches going until maintainers finally implement the proper solution.
I do not want to code that.

I was going to make the shims actual shims rather than ifuncs, yes. I didn't implement kernel ifuncs for riscv when I did userspace because there was no use for them at the time, unlike userspace where they had been requested. But given armv7 needs support too and AFAIR is the only arch to not even have support in rtld you'd still need such a workaround. I plan to write the shim version later tonight.

Add function-pointers based workaround for risvc and arm.
Hopefully at least riscv would grow ifuncs in some time frame.

For now such arches are marked with __DO_NOT_HAVE_SYS_IFUNC.

https://reviews.freebsd.org/P685 is a proposal for how to drastically reduce the amount of boilerplate here (even compared with the ifunc-only version). The wrapper vs ifunc difference is mostly hidden, with the only way it leaks out being that the macro needs to take argument types and names separately so it can use just the names in the call for the wrapper variant. That could be simplified back down in a future ifunc-only version. Built and test-booted arm64 GENERIC and riscv64 QEMU (with other patches to deal with the other issues here).

sys/kern/sched_4bsd.c
1610

This has been unused since c72188d85a793c7610208beafb83af544de6e3b7, no need to keep it and add a stub for ULE

1801

This and the corresponding ULE change don't build on non-KTR kernels, which is most of them (powerpc's DPAA is the only non-LINT one), since ts_name isn't defined in the corresponding struct for !KTR. Either the slot and shim need to be conditional too or it should be made a stub in each for !KTR (and the callers can drop the #ifdef). Did you test build this at all?

sys/kern/sched_shim.c
374
380
527–538

This currently matches if one is a prefix of the other, since the loop terminates as soon as you hit the terminator of one string and treats it as a match. This is one way of writing a fixed version (I believe), based on lib/libc/string/strcmp.c's structure.

kib marked 5 inline comments as done.

Take the DEFINE_SHIM() proposal.
Fix inlined strcmp().
Also hopefully fix compilation issues, but I only started tinderbox.

sys/kern/sched_4bsd.c
1742

This one's still missing a KTR #ifdef, no?

sys/kern/sched_4bsd.c
1801

I built and run-time tested by debug config on amd64.
I started tinderbox after posting the patch, and now restarted it for the latest update.

The patch is still in the state of discussion, so I did not see it reasonable to put that resources on it.
Thank you for taking a try and finding the issues.

kib marked an inline comment as done.

One more #ifdef KTR for 4bsd.

Deduplicate sched stats, sdt probes, and kdtrace hooks helper vars.
At least amd64 GENERIC and LINT build.