Page MenuHomeFreeBSD

Optimize curthread
AbandonedPublic

Authored by jtl on Oct 18 2018, 3:24 PM.

Details

Reviewers
jhb
kib
Summary

curthread is aliased to curthread(). curthread() is currently an inline function with the const attribute.

The const attribute should mean that the compiler only emits one call to curthread() per function and caches the result. However, in practice, the compiler is emitting multiple calls to curthread() (each of which involves a read of a %gs register) and even assuming the result can change in between invocations.

This change makes __curthread() a real function. To avoid some of the potential pitfalls of making this a function, the function is defined with the no_caller_saved_registers attribute and the function is written in assembly to avoid unnecessarily storing/restoring the stack pointer.

The end result seems to be more compact assembly with less overall instructions.

NOTE: This is intended for head *after* stable/12 branches.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 20276
Build 19747: arc lint + arc unit

Event Timeline

I think this is a very questionable idea. In my experience clang does *mostly* not re-read. For all cases you are adding a function call.

See vfs_ref/vfs_rel or fsetown as examples.

Can you show me a specific function which compiles to multiple reads of gs just to get curthread?

Also biggest culprits for kernel size are inlined locking primitives (which I plan to deinline after 12, but need to do more tests) and the hand-rolled asm ops which set/test. While clang does not support indicating the result in flags, this can be worked around by pretending cmpxchg and others are c11 atomics.

In D17609#375841, @mjg wrote:

I think this is a very questionable idea. In my experience clang does *mostly* not re-read. For all cases you are adding a function call.

See vfs_ref/vfs_rel or fsetown as examples.

Can you show me a specific function which compiles to multiple reads of gs just to get curthread?

epoch_enter_preempt_KBI() and spinlock_enter() both illustrate the problem. Examine the assembly before and after the change. In the epoch_enter_preempt_KBI case, the emitted assembly is ~10% smaller.

Also biggest culprits for kernel size are inlined locking primitives (which I plan to deinline after 12, but need to do more tests) and the hand-rolled asm ops which set/test. While clang does not support indicating the result in flags, this can be worked around by pretending cmpxchg and others are c11 atomics.

I suspect one of the bigger culprits is actually excessive inlining. For example, clang doesn't actually inline the new epoch_enter_preempt() call in every case. Instead, it emits one copy per file. (Look at the address of the epoch_enter_preempt() call instructions in tcp_timer_discard() and tcp_input(). You'll see they call different copies of epoch_enter_preempt().)

In D17609#375881, @jtl wrote:

epoch_enter_preempt_KBI() and spinlock_enter() both illustrate the problem. Examine the assembly before and after the change. In the epoch_enter_preempt_KBI case, the emitted assembly is ~10% smaller.

Indeed. I suspect this is overly broad constraints/clobbers in the macro. I'll see about addressing that.

Note that majority of the kernel does not suffer the problem and the proposed change would be a pessimization for these cases.

If it turns out the behavoiur can't be helped, I think the way to go will be to explicitly pass curthread as an argument.

Also biggest culprits for kernel size are inlined locking primitives (which I plan to deinline after 12, but need to do more tests) and the hand-rolled asm ops which set/test. While clang does not support indicating the result in flags, this can be worked around by pretending cmpxchg and others are c11 atomics.

I suspect one of the bigger culprits is actually excessive inlining. For example, clang doesn't actually inline the new epoch_enter_preempt() call in every case. Instead, it emits one copy per file. (Look at the address of the epoch_enter_preempt() call instructions in tcp_timer_discard() and tcp_input(). You'll see they call different copies of epoch_enter_preempt().)

If memory serves not inlining mutexes shrinks the entire kernel by about 7%. Part of the bloat stems from overly big cmpxchg inline asm, other part is repeated tests of lockstat_enabled. I plan create simple assembly funcs without the check and hotpatch them at runtime as lock profiling gets enabled/disabled.

So I have trouble fighting off the compiler. OFFSETOFF_CURTHREAD is 0, so for a sanity check I modified the macro to just include it:

-       __asm("movq %%gs:%1,%0" : "=r" (td)
-           : "m" (*(char *)OFFSETOF_CURTHREAD));
+       __asm("movq %%gs:0,%0" : "=r" (td)
+           : : );

This stopped the re-reads.

The real fix is supposed to use the 'p' ("address operand") constraint, but using it somehow makes clang go through the stack. Probably something stupid going on.

In absolutely worst case we can go for lame stringification of OFFSETOF_* macros and just glue the offsets as strings instead of providing them as input constraints.

Either way, with the above confirmed I don't think there is any need for a function.

I gave up on fighting this. Neither NetBSD nor Dragonfly provide a better approach either.

The following lame piece of crap does the job and unless someone can provide nice short input arg using macro, I think is the way to go:

diff --git a/sys/amd64/include/pcpu.h b/sys/amd64/include/pcpu.h
index f1493176708a..60d680225d7c 100644
--- a/sys/amd64/include/pcpu.h
+++ b/sys/amd64/include/pcpu.h
@@ -227,8 +227,7 @@ __curthread(void)
 {
        struct thread *td;
 
-       __asm("movq %%gs:%1,%0" : "=r" (td)
-           : "m" (*(char *)OFFSETOF_CURTHREAD));
+       __asm("movq %%gs:" __XSTRING(OFFSETOF_CURTHREAD) ",%0" : "=r" (td) : );
        return (td);
 }
 #ifdef __clang__

Note there are other places which will need updating.

Welp, kind of embarassing. Somewhat by default I did not even bother checking OpenBSD, but it is them who got it right:

diff --git a/sys/amd64/include/pcpu.h b/sys/amd64/include/pcpu.h
index f1493176708a..c5fda8526de5 100644
--- a/sys/amd64/include/pcpu.h
+++ b/sys/amd64/include/pcpu.h
@@ -227,8 +227,8 @@ __curthread(void)
 {
        struct thread *td;
 
-       __asm("movq %%gs:%1,%0" : "=r" (td)
-           : "m" (*(char *)OFFSETOF_CURTHREAD));
+       __asm("movq %%gs:%P1,%0" : "=r" (td)
+           : "n" (OFFSETOF_CURTHREAD));
        return (td);
 }
 #ifdef __clang__

The 'n' constraint is 'immediate integer with a known value'. I tried it myself, but did not s/%1/%P1/. The above works like a charm and is my proposal.

An alternative, better, version was implemented by @mjg in rS339449.