Page MenuHomeFreeBSD

Scheduler and Mutex micro optimizations
Needs ReviewPublic

Authored by lex.studio.base_gmail.com on Thu, Feb 19, 9:08 PM.

Details

Reviewers
kib
mav
olce
Summary

Scheduler and Mutex micro optimizations

This optimizations where made and tested with GENERIC configuration.

Test 1: 16 Threads

MetricOptimizedStandardDifference
Total Events850074811952+38122 events
Avg Latency (ms)2.262.36-0.10
95th Percentile (ms)16.2117.32-1.11
Events (avg/stddev)53129.6250/620.0250747.0000/584.61Better throughput

Test 2: 32 Threads

MetricOptimizedStandardDifference
Total Events506721500707+6014 events
Avg Latency (ms)7.587.67-0.09
95th Percentile (ms)62.1963.32-1.13
Events (avg/stddev)15835.0312/768.7015647.0938/714.52Better throughput

Test 3: 64 Threads

MetricOptimizedStandardDifference
Total Events595346583812+11534 events
Avg Latency (ms)12.8813.16-0.28
95th Percentile (ms)102.97106.75-3.78
Events (avg/stddev)9317.9062/408.559122.0625/342.80Better distribution

*the tests were made using sysbench threads --threads=N --time=120 run. The system was Virtual Box with Ryzen 7 4800H (16) and VirtualBox had access to all of the threads.
*FreeBSD 16.0

For more informations (or if i need to post another patch or something) email me at lex.studio.base@gmail.com

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

lex.studio.base_gmail.com created this object with edit policy "Administrators".

Could you please generate diff with full context, as described here and without separately attached patch? See https://wiki.freebsd.org/Phabricator.

pi changed the edit policy from "Administrators" to "committers (Project)".Sat, Feb 21, 11:45 AM

Being out-of-time (and off for the next week), I only reviewed changes to ULE, and the first one in this revision. But I'll also take a look at the rest when coming back.

General comments:

  • The inlining proposals I've reviewed in general look good, but I'm really unsure about their actual effect in the results you obtain.
  • Prefetching can be good, but it has to be done sufficiently early to be useful. There are several of occurrences here where it is useless for this reason. Even for prefetching changes I tagged with 'Fine.' (because they conceptually are), I'd like to know what is their practical effect in the results you're seeing.
  • __predict_false() and __predict_true() should not be used unless it's absolutely certain that almost all evaluations will be false or true and using them can be proved to have a positive practical effect. Branch predictors today (and that has been the case for more than a decade) are good enough to the point that any hint that does not meet these conditions is likely to be an overall pessimization.
  • There are gratuitous changes, such as comments removal (please never do that unless you have a good reason, and in some occurrences I reviewed, I don't see any good reason), or using the preprocessor where compilers have long been smart enough to do constant propagation and eliminate impossible branches, or changing the style again without good reason. Please refrain from doing that.
  • Benchmarking in a virtual machine can give you hints, but really is to be taken with a grain of salt, and final benchmarking should happen on a real machine.

It would be great if you can benchmark each change separately, or at least changes to each file separately, and for sched_ule.c, with additional runs where changes that I flagged as useless or likely to have no practical effect are removed and the others present or not.

sys/kern/kern_mutex.c
612–623

I think this reasoning and code is wrong.

First, the prefetch does not happen ahead of the actual use, so should not have any positive effect (actually, it could be a slight pessimization). This is also inconsistent with not prefetching when owner is changed, which anyway is the same case (and there's no need for prefetching). Second, you try to save one lv_mtx_owner(), which is only a single bitwise-and, at the expense of a branch prediction, which is much more costly even assuming the annotation __predict_false() is right most of the time (and it may be not for heavily contended mutexes).

All in all, I only see pessimizations here.

sys/kern/sched_ule.c
615

This one looks reasonable.

640

Reasonable too.

697

Feels reasonable too. clang sometimes makes inlining decisions that can be weird. Inlining this function may (didn't actually check) have the "drawback" that tdq_choose() will stop being inlined in tdq_setlowpri(), but I think there's indeed more potential synergies from inlining the tdq_setowpri() code in callers.

792–811

Fine.

884–897

Fine.

1117–1124

Please do not remove comments unless you have a good reason to (I see none here).

1317–1323

Fine.

1775–1787

We are not in the business of only optimizing for CPU-heavy threads. Most threads on a machine are quiescent and run infrequently, and you'll be penalizing them.

Anyway, I doubt this has any practical effect. See also general comment about __predict_true() and __predict_false().

1840–1844

Same here.

1906–1907

This one seems to make sense.

1929–1930

Gratuitous change. Compilers have been smart enough to do that for a very long time.

1934–1947

This change is OK, but not for the justification you're giving, which is most probably wrong.

Indeed, inverting * and / should not change anything in terms of assembly instructions used, as these are unsigned integers, and the compiler will use a shift instead of a multiply by 4 in both cases.

IMO, the actual reason to do this change is that, given the * 4 does not overflow, accuracy of the result will be better.

2038–2063

Yes, this function was designed exactly for that (nothing changes if lu_span is 0).

It's a rare case, except if a thread is constantly doing a syscall that needs waiting in the kernel (I/O mostly), in which case really that should not matter in relative real thread time.

So I don't expect this to have a visible practical effect.

2529–2534

Useless prefetch (too close to use).

2641

Most probably wrong prediction.

See also general comment.

2884–2890

sched_pctcpu_update() must be call every tick, given the current decaying mechanism is sensitive to the timing of the calls. So this change is not acceptable.

3121–3127

Gratuitous. This slightly obfuscates the initial intent with no practical difference in produced assembly.