Page MenuHomeFreeBSD

sort: Cache value of MB_CUR_MAX
ClosedPublic

Authored by cyril_freebsdfoundation.org on May 7 2021, 9:58 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 6, 10:12 PM
Unknown Object (File)
Wed, Oct 16, 12:35 AM
Unknown Object (File)
Tue, Oct 15, 11:10 AM
Unknown Object (File)
Mon, Oct 14, 9:43 PM
Unknown Object (File)
Sat, Oct 12, 11:10 AM
Unknown Object (File)
Sat, Oct 12, 11:09 AM
Unknown Object (File)
Sat, Oct 12, 11:09 AM
Unknown Object (File)
Sat, Oct 12, 11:09 AM
Subscribers
None

Details

Summary

Every usage of MB_CUR_MAX results in a call to __mb_cur_max. This is inefficient and redundant. Caching the value of MB_CUR_MAX in a global variable removes these calls and speeds up the runtime of sort. For numeric sorting, runtime is almost halved. Helps with bug 255551.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

usr.bin/sort/sort.c
102

This isn't particularly obvious, but the type of MB_CUR_MAX is size_t, not int, so the type of the cached value should match.

1000

I believe the assignment should come after the call to set_locale(). I would suggest having the assignment occur at the end of that function.

Fixed issues mentioned by markj. mb_cur_max is now updated right after the call to setlocale(LC_ALL, "").

Is there a way we could make MB_CUR_MAX or __mb_cur_max suck less, in a more general way? Like, isn't it basically always four or one?

Nothing looks mechanically wrong with this patch, except that the solution is specialized to only sort(1).

In D30170#678138, @cem wrote:

Is there a way we could make MB_CUR_MAX or __mb_cur_max suck less, in a more general way? Like, isn't it basically always four or one?

Not sure if they're often used, but some locales defined in libc have a value of 2 or 3.

Nothing looks mechanically wrong with this patch, except that the solution is specialized to only sort(1).

___mb_cur_max() is pretty simple: it fetches the __mb_cur_max definition for the current thread's CTYPE locale. The cost comes from the TLS lookup. I agree it would be better if we had some optimization that applies to more uses. The regex code in libc uses in what look like commonly executed code.

Hmm. setlocale(3) specifies the locale for the entire process, and uselocale(3) overrides the selected locale for the current thread. Now that I look a bit more, we shouldn't be trying to load the current thread's locale at all, we should have __get_thread_locale == 0. Running sort under gdb confirms this, but the disasm of ___mb_cur_max() looks like the __tls_get_addr() call is unconditional:

Dump of assembler code for function ___mb_cur_max:
   0x000000080115e300 <+0>:     push   %rbp
   0x000000080115e301 <+1>:     mov    %rsp,%rbp
   0x000000080115e304 <+4>:     push   %rbx
   0x000000080115e305 <+5>:     push   %rax
   0x000000080115e306 <+6>:     mov    0x113fbb(%rip),%rbx        # 0x8012722c8 <-- load __has_thread_locale
   0x000000080115e30d <+13>:    data16 lea 0x113fa3(%rip),%rdi        # 0x8012722b8
   0x000000080115e315 <+21>:    data16 data16 rex.W call 0x8012654b0 <__tls_get_addr@plt>
   0x000000080115e31d <+29>:    mov    (%rax),%rax
   0x000000080115e320 <+32>:    test   %rax,%rax
   0x000000080115e323 <+35>:    mov    0x113e6e(%rip),%rcx        # 0x801272198
   0x000000080115e32a <+42>:    cmove  %rcx,%rax
   0x000000080115e32e <+46>:    cmpl   $0x0,(%rbx) <-- test __has_thread_locale
   0x000000080115e331 <+49>:    cmove  %rcx,%rax
   0x000000080115e335 <+53>:    mov    0x18(%rax),%rax
   0x000000080115e339 <+57>:    mov    0x70(%rax),%eax
   0x000000080115e33c <+60>:    add    $0x8,%rsp
   0x000000080115e340 <+64>:    pop    %rbx
   0x000000080115e341 <+65>:    pop    %rbp
   0x000000080115e342 <+66>:    ret

So clang appears to be pessimizing what should be the common case, by inserting an indirect function call whose result is unused...

So clang appears to be pessimizing what should be the common case, by inserting an indirect function call whose result is unused...

I haven't yet dug into why clang is making this tradeoff. Splitting the function into two indeed helps since we no longer call into rtld each time MB_CUR_MAX is loaded. But caching the value in sort itself helps further by avoiding calls into libc.so. So I think we should go ahead with this change and submit a PR to investigate the general problem further.

This revision is now accepted and ready to land.May 12 2021, 8:41 PM

But caching the value in sort itself helps further by avoiding calls into libc.so.

To be clear, I don't think we can really avoid that in general.

This revision was automatically updated to reflect the committed changes.