Page MenuHomeFreeBSD

lib/libc/amd64: add archlevel-based simd dispatch framework
ClosedPublic

Authored by fuz on Jun 21 2023, 3:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Sep 5, 11:33 AM
Unknown Object (File)
Wed, Sep 4, 6:33 PM
Unknown Object (File)
Wed, Sep 4, 1:31 PM
Unknown Object (File)
Tue, Sep 3, 9:28 AM
Unknown Object (File)
Sat, Aug 24, 10:22 AM
Unknown Object (File)
Mon, Aug 19, 11:12 AM
Unknown Object (File)
Sat, Aug 17, 9:50 PM
Unknown Object (File)
Thu, Aug 15, 2:40 PM

Details

Summary

Add a framework for selecting from one of multiple implementations
of a function based on amd64 architecture level (cf. amd64 SysV
ABI supplement). As a sample use, a set of optimised strlen
functions is implemented.

Documentation is provided in the new manual page simd(7).

Open questions:

  • Should we provide a way to override the detected architecture level from within a program using libc? This could e.g. be done by defining a weak symbol that can be overridden to supply a custom archlevel limit.
  • kib@ mentioned that using SIMD inside the libc may lead to worse performance for applications that usually do not use SIMD due to lazy state initialisation. Unclear if this still applies as the C compiler has started to use SIMD instructions in ordinary code anyway.
  • Should we provide a utility to print the current architecture level?

Sponsored by: FreeBSD Foundation

Test Plan

libc test suite passes. Test suite needs to be expanded so it tests all
architecture levels of each SIMD-enhanced function that can be run on
the test machine.

For benchmarks, see https://github.com/clausecker/strperf

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
libexec/rtld-elf/rtld-libc/Makefile.inc
41 ↗(On Diff #124029)

I do not see why limit this to amd64.

fuz planned changes to this revision.Jun 30 2023, 4:04 PM
fuz added inline comments.
lib/libc/amd64/amd64_archlevel.h
50 ↗(On Diff #124029)

The q suffix is not required as we have a register operand.

88 ↗(On Diff #124029)

I'm not sure how there is any “pollution” when these macros logically belong here. This isn't even an external header, it's private to the amd64 machdep bits of libc. Are you sure I should spread this feature over yet another file instead of keeping related bits in one file?

I try to avoid assembly macros as they are harder to introspect when things go wrong. In this particular case, some of the macros cannot be turned into assembly macros as they make use of CPP macros (e.g. ARCHEND and ARCHRESOLVE). So I could only turn some into asm macros, which would yield an inconsistency for little to no benefit.

lib/libc/amd64/string/amd64_archlevel.c
231 ↗(On Diff #124029)

Will change to <machine/atomic.h>.

Do I understand you correctly that you do not want me to add defensive code for when architecture level comes to a different conclusion in different threads? It should just misbehave and select different architecture levels? There's a difference between supporting something and adding defensive code just in case (the latter is intended). The performance impact of the atomic instruction is really insignificant; it only runs once per process in the usual case.

libexec/rtld-elf/rtld-libc/Makefile.inc
41 ↗(On Diff #124029)

I want these changes to be minimal as I do not have a way to test this on other architectures. Chances are some libc warnings that do not occur on amd64 will break the build. Also, we currently only have the ifunc dispatch logic on amd64, so other architectures can use machdep string functions just fine (except perhaps ppc64, but I don't understand how that's handled right now).

apply kib@'s feedback to archlevel functions

lib/libc/amd64/amd64_archlevel.h
50 ↗(On Diff #124029)

It is not required but commonly used for consistency, so I request you to do the same.

88 ↗(On Diff #124029)

Archlevel definitions might be useful in other places.

Also, I suggest to make the bitmasks for features for CPUID_XXX/CPUID2_XXX etc symbols. Even more, I want the symbols for higher levels to be defined by lower level symbols, so it is obvious that level n+1 extends level n.

The last, the header would go somewhere into sys/amd64/include and installed.

lib/libc/amd64/string/amd64_archlevel.c
231 ↗(On Diff #124029)

I believe I already expressed my PoV in the previous round of discussion, please look back in the comments history.

libexec/rtld-elf/rtld-libc/Makefile.inc
41 ↗(On Diff #124029)

I want rtld to be as similar for all arches as possible. Esp. not due to such trivial reason.

Compiler warnings are catched with make tinderbox, this cannot be an argument.

I noted $elsewhere that preferable way would be to export routines pick by the kernel in the shared page. The upshot is that jails running different versions (postdating the general change) will get updates transparently and even share some i-cache associated with it, but this is not a hard requirement.

lib/libc/amd64/string/amd64_archlevel.c
232 ↗(On Diff #124032)

this routine should only get called once and even then only in a single-threaded environment, so i don't understand what's up with the atomic to begin with

lib/libc/amd64/string/strlen.S
3

stock file is entirely mine, just relicense to whatever and it will be fine

lib/libc/amd64/string/strlen.S
3

I'd omit the copyright in this case and have the foundation update its policies to reflect project practices

share/man/man7/simd.7
6 ↗(On Diff #124032)

I'd strongly suggest using the new policy of just having the spdx license expression lines here and for other new files.

179 ↗(On Diff #124032)

I don't think you need the arch history

lib/libc/amd64/string/amd64_archlevel.c
232 ↗(On Diff #124032)

I object not to atomic as is, but to the idea that we make active steps to handle unsupported situation.

lib/libc/amd64/string/strlen.S
3

I am instructed by the FreeBSD Foundation contract to put a copyright notice like this on existing files I modify. I have asked for clarification on how to proceed with public domain files but did not receive any guidance yet. Perhaps @jrm could help?

Here is an excerpt from our #bsdmips@efnet conversation on 2023-06-09.

[2023-06-09 15:01] <bsdimp_> fuz if it is public domain, you can copyright it and license it how you want. However, you should add a note saying that it was derived from material created by author X and donated to the public domain
[2023-06-09 15:03] <bsdimp_> since the public domain dedication may or may not be sufficient. If we document the intent and who did it, this will insulate ourselves from future claims against us by someone else that claims copyright over the now quite similar material and from a change of heart by author X.

[2023-06-09 19:11] <jrm> fuz: What bsdimp says seems reasonable (caveat IANAL), but let's get confirmation from emaste_. It might be decent to give mjg@ a heads-up.

Given @mjg's comment, is this now a closed case?

lib/libc/amd64/string/strlen.S
3

Yea. The missing context that changed my answer is that in the past any pd code in the tree stays pd unless the original author consents to a change. The foundation's instructions to authors likely should be updated for this admittedly unusual case. The vast majority of public domain code in the tree can't be copyrighted though.

But if mjd is happy with what you want to do... then that's covered by our ask first tradition...

The answer you quoted was from a hypothetical that was too abstract. Just for future reference I huess

fuz marked 7 inline comments as done.Jul 15 2023, 9:50 PM
fuz added inline comments.
lib/libc/amd64/amd64_archlevel.h
88 ↗(On Diff #124029)

I do not plan to expose this mechanism to libc consumers as of now as details may change in the future. Perhaps we can talk about moving it into a public header file at a later date, once it has landed and a good number of functions have been accelerated.

Also, I suggest to make the bitmasks for features for CPUID_XXX/CPUID2_XXX etc symbols. Even more, I want the symbols for higher levels to be defined by lower level symbols, so it is obvious that level n+1 extends level n.

Will do so in an upcoming change.

lib/libc/amd64/string/amd64_archlevel.c
232 ↗(On Diff #124032)

Yes, this is called “defensive programming.” It costs pretty much nothing and removes a whole lot of headache should the race condition I describe occur.

libexec/rtld-elf/rtld-libc/Makefile.inc
41 ↗(On Diff #124029)

Will make the change general with my next revision.

share/man/man7/simd.7
6 ↗(On Diff #124032)

Unfortunately I am mandated by the Foundation contract to use this license header. Omitting it would be breach of contract.

179 ↗(On Diff #124032)

I spent some time collating it and I think it might be interesting for our users. Would you prefer if I took it out?

Assorted updates responding to reviewer feedback:

  • lib/libc/amd64/string/archlevel.c: use symbolic constants from <machine/specialreg.h>
  • lib/libc/amd64/string/archlevel.c: address additional review comments
  • lib/libc/amd64: rename archlevel.[ch] to amd64_archlevel.[ch]
  • libexec/rtld-elf/rtld-libc/Makefile.inc: do not use machdep string functions on amd64
  • lib/libc/amd64/string/strlen.S: use # to introduce comments
  • lib/libc/amd64/string/amd64_archlevel.c: implement jrtc27's recommendations
  • lib/libc/amd64/amd64_archlevel.h: replace .int with .4byte
  • lib/libc/amd64: apply kib@'s feedback to archlevel functions
  • share/man/man7/simd.7: remove mention of ffs
  • lib/libc/amd64/string/amd64_archlevel.h: use __CONCAT for concatenation
  • lib/libc/amd64/string/amd64_archlevel.c: use symbolic names for archlevel CPU features
  • libexec/rtld-elf/rtld-libc/Makefile.inc: unconditionally use generic string functions

One thing I am thinking about adding is to put each architecture level's
functions into a subsection of .text (e.g. .text.baseline) with another
section for resolver functions. This should improve cache usage slightly
as the pages corresponding to architecture levels we don't use need not to
be mapped or touched. Resolver functions only need to be looked at once,
so they too can go into a block on their own.

What do you think?

As per @kib's request, I ran make tinderbox on this changeset. There were some failures (riscv64 buildworld [due to an exec format error?] and various kernels) but they did not seem to be related to the changeset.

lib/libc/amd64/amd64_archlevel.h
88 ↗(On Diff #124029)

I am not suggesting to expose it to libc consumers. I want it to be available inside amd64 libc, not just libc string functions.

lib/libc/amd64/string/amd64_archlevel.c
56 ↗(On Diff #124721)
57 ↗(On Diff #124721)

and same for all other defines

libexec/rtld-elf/rtld-libc/Makefile.inc
41 ↗(On Diff #124029)

Please extract the rtld change into a separate review. It should be a dedicated commit, and I believe that it is ready in its current form.

(In fact all objects need to be recompiled, but this is for later).

share/man/man7/simd.7
6 ↗(On Diff #124032)

This is easily handled by asking FF. Least the foundation wants is to impede the project improvements.

fuz marked 5 inline comments as done.Jul 16 2023, 2:41 PM

Will submit an update with the other requested changes shortly.

lib/libc/amd64/amd64_archlevel.h
88 ↗(On Diff #124029)

It already is available to other amd64 libc functions. They can just include it as "amd64_archlevel.h" and it works.

libexec/rtld-elf/rtld-libc/Makefile.inc
41 ↗(On Diff #124029)

Please see D41050 for a DR of just these changes.

fuz marked 2 inline comments as done.

Remove parentheses around macros. Move rtld bits into D41050.

lib/libc/amd64/string/amd64_archlevel.c
241 ↗(On Diff #124736)

Adding a typedef for the return type and using it there and several other places would be a significant step forward in readability.

242 ↗(On Diff #124736)

First four arg-receiving registers are already pre-filled with cpu_feature, cpu_feature2, cpu_stdext_feature, cpu_stdext_feature2. This might shave off one CPUID in archlevel() and costs nothing because asm just needs to pass the functions table in %r8 instead of %rdi.

lib/libc/amd64/string/amd64_archlevel.c
241 ↗(On Diff #124736)

Would you be okay with me using dlfunc_t as the return type? It seems to be appropriate. Note that <x86/ifunc.h> doesn't use a typedef either (but maybe it should).

242 ↗(On Diff #124736)

Are you sure we want this tight coupling between rtld and libc? If yes, I can change the code to make use of this detail. I think we might be able to avoid all cpuid calls except the ones to get the AMD flags.

lib/libc/amd64/string/amd64_archlevel.c
241 ↗(On Diff #124736)

Type is ok, my only itch there would be a header pollution, but this should be mostly irrelevant.

242 ↗(On Diff #124736)

I added these args so perhaps yes, I do want and encourage the coupling.

share/man/man7/simd.7
6 ↗(On Diff #124032)

https://docs.freebsd.org/en/articles/license-guide/

all license files will be stored in the LICENSES/ directory of the repository

We can certainly update the FF policy but need to have a canonical copy of the license in the tree.

lib/libc/amd64: use dlfunc_t for __archlevel_resolve() return value

Also refactor code to make use of rtld supplied cpuid values
as per kib@'s recommendation. As a side effect, providing
__archlevel() as an interface is no longer useful: the caller
would have to give this function the cpuid values, too, which
would be quite annoying for the user to do.

I'm not really happy with this one. Shoving the four cpuid values
thorugh three function calls makes the code harder to read and
mixes concerns that could be separated. Is that really worth it?

lib/libc/amd64/string/amd64_archlevel.c
165 ↗(On Diff #125043)

Use libc_private.h to get envrion declaration, do not do this.

221 ↗(On Diff #125043)

the line is not needed

237 ↗(On Diff #125043)

So why not use dlfunc_t in the return statement?

fuz planned changes to this revision.Jul 24 2023, 7:20 PM

Will submit another update shortly. I hope the colour of the bikeshed is finally to your likening.

Is there anything else missing on your part, @kib, that prevents the DR from being accepted? @mjg requested me to first present benchmarks on the performance of strlen, which is something I am currently working on. I would really like to finally get done with this DR so I can focus on writing string functions, of which there are many left to cover.

lib/libc/amd64/string/amd64_archlevel.c
165 ↗(On Diff #125043)

Thank you. I was not aware of the declaration in that header.

221 ↗(On Diff #125043)

Yes, it is. This code ensures that if we fail to be the first thread to set amd64_archlevel, we return whatever amd64_archlevel was set to (which could be different from wantlevel in case there is a programming error on my part, some unaccounted condition, or a modification to the environment). This prevents misuse or programming errors on my part from causing the process to disagree on what the architecture level should be. I am still not sure why you are so hung up on this extremely simple piece of defensive programming that saves me a lot of headache down the line in case the assumption that all threads come to the same conclusion on what the architecture level is ever fails.

237 ↗(On Diff #125043)

I forgot to do this, thank you for the remark.

Address review feedback due to @kib.

lib/libc/amd64/string/amd64_archlevel.c
221 ↗(On Diff #125043)

I mean that the 'else' line is not needed, not the second return.

More important, amd64_archlevel should be not proclaimed volatile, but the atomic_load_int() used for load in the second return. The 'volatile' semantic is implementation-defined, and we avoid using it explicitly, relying on our memory model. You definitely need to ensure that amd64_archlevel is reloaded if atomic_cmpset failed. Or use atomic_fcmpset().

lib/libc/amd64/string/amd64_archlevel.c
221 ↗(On Diff #125043)

The atomic(9) man page tells me that atomic variables should be qualified volatile. With C11, _Atomic would be correct, but you told me not to use this, so I used the old API as it is defined. Am I to assume that the documentation is incorrect?

Also, could you please just tell me all the issues in the code at once instead of only raising one issue at a time and then waiting for me to prepare a new revision? Every revision has me stop work on the SIMD code and go back to the framework just for you to then demand another change. And another. And another. This is really not a productive way of conduct code review. This change has now been lingering for over a month with 10 revisions, most due to you and if this wasn't sponsored work I would have already given up twice over.

lib/libc/amd64/string/amd64_archlevel.c
221 ↗(On Diff #125043)

As for atomic_fcmpset, this one is documented to possibly fail spuriously. Despite the underlying primitive on x86 not having this possibility, I would prefer not to assume properties of system interfaces that are not guaranteed by the specification.

Apply @kib's feedback on use of atomic_load.

lib/libc/amd64/string/strlen.S: revise and improve performance

The original code was meant as a proof of concept to demonstrate the use
of the dispatch framework. But as stated by @mjg, a decision to accept
the framework won't be made until one accelerated function is production
ready. So update strlen() to the requested state.

  • do head load as early as possible.
  • rework v4 kernel head to be more similar to v3 and baseline
  • rework main loops to increment later
  • merge the two cases for main loop matches

Slightly better for all SIMD versions, v3 and v4 still slower than
baseline on short strings. Baseline beats the bionic strlen routine
except on very long strings.

The poor performance of the v3 kernel is likely due to VZEROUPPER, an
instruction needed to avoid adverse effects the performance of SSE code in
functions after the call to strlen.

Benchmark results can be found here: https://github.com/clausecker/strperf
Comparison of results using the benchstat command:

os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
        │ strlen_baseline.out │          strlen_bionic.out           │           strlen_scalar.out           │         strlen_x86-64-v3.out         │        strlen_x86-64-v4.out         │
        │       sec/op        │    sec/op     vs base                │    sec/op     vs base                 │    sec/op     vs base                │   sec/op     vs base                │
Short             43.50µ ± 1%    48.03µ ± 1%  +10.40% (p=0.000 n=20)    69.84µ ± 1%   +60.55% (p=0.000 n=20)    53.11µ ± 2%  +22.09% (p=0.000 n=20)   49.66µ ± 3%  +14.16% (p=0.000 n=20)
Mid               13.30µ ± 1%    14.52µ ± 0%   +9.21% (p=0.000 n=20)    21.33µ ± 1%   +60.39% (p=0.000 n=20)    14.00µ ± 1%   +5.29% (p=0.000 n=20)   13.81µ ± 1%   +3.89% (p=0.000 n=20)
Long             2227.3n ± 0%   2035.9n ± 0%   -8.59% (p=0.000 n=20)   7587.2n ± 0%  +240.64% (p=0.000 n=20)   2128.4n ± 0%   -4.44% (p=0.000 n=20)   671.4n ± 0%  -69.85% (p=0.000 n=20)
geomean           10.88µ         11.24µ        +3.29%                   22.44µ       +106.24%                   11.65µ        +7.10%                  7.723µ       -29.03%

        │ strlen_baseline.out │          strlen_bionic.out          │          strlen_scalar.out           │         strlen_x86-64-v3.out         │          strlen_x86-64-v4.out          │
        │         B/s         │     B/s       vs base               │     B/s       vs base                │     B/s       vs base                │      B/s       vs base                 │
Short            2.676Gi ± 1%   2.424Gi ± 1%  -9.42% (p=0.000 n=20)   1.667Gi ± 1%  -37.71% (p=0.000 n=20)   2.192Gi ± 2%  -18.09% (p=0.000 n=20)    2.344Gi ± 3%   -12.41% (p=0.000 n=20)
Mid              8.756Gi ± 1%   8.017Gi ± 0%  -8.43% (p=0.000 n=20)   5.459Gi ± 1%  -37.65% (p=0.000 n=20)   8.316Gi ± 1%   -5.03% (p=0.000 n=20)    8.428Gi ± 1%    -3.74% (p=0.000 n=20)
Long             52.27Gi ± 0%   57.18Gi ± 0%  +9.40% (p=0.000 n=20)   15.34Gi ± 0%  -70.64% (p=0.000 n=20)   54.70Gi ± 0%   +4.65% (p=0.000 n=20)   173.38Gi ± 0%  +231.73% (p=0.000 n=20)
geomean          10.70Gi        10.36Gi       -3.19%                  5.188Gi       -51.51%                  9.990Gi        -6.63%                   15.07Gi        +40.90%

The test cases are: *Short* has 128 KiB worth of strings with an average length of 16,
*Mid* is the same but with an average length of 64, and *Long* is one big string of
128 KiB length. One "op" is one full pass through the test buffer.

Given this state, I propose to commit only the baseline implementation.

I am fine with the selector. I did not even started to look at the str* function(s)

lib/libc/amd64/string/amd64_archlevel.c
221 ↗(On Diff #125180)

Still the line is not needed

This revision is now accepted and ready to land.Jul 26 2023, 10:52 AM
fuz retitled this revision from lib/libc/amd64: add archlevel-based simd dispatch framework (DRAFT) to lib/libc/amd64: add archlevel-based simd dispatch framework.Jul 26 2023, 10:59 AM
fuz edited the summary of this revision. (Show Details)
fuz edited the test plan for this revision. (Show Details)

I'm ok with the sse2 strlen, as the rest is currently pessimal it should not be going in yet.

Since kib reviewed the machinery, I think you should make one commit which adds it with him in 'reviewed by'. Then another one which plops in sse2 strlen (aka baseline) with 'reviewed by' me.