This was surprisingly annoying to get right, despite being such a simple
function.  A scalar implementation is also provided, it just calls into
our optimised memchr(), memcpy(), and memset() routines to carry out its
job.
The unit test for stpncpy has been extended significantly and now covers
the function quite well.
I'm quite happy with the performance.  glibc only beats us for very long
strings, likely due to the use of AVX-512.  The scalar implementation
just calls into our optimised memchr(), memcpy(), and memset() routines,
so it has a high overhead to begin with but then performs ok for the
amount of effort that went into it.  Still beats the old C code, except
for very short strings.
os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
        │ stpncpy.pre.out │          stpncpy.scalar.out          │        stpncpy.baseline.out         │
        │     sec/op      │    sec/op     vs base                │   sec/op     vs base                │
Short        106.88µ ± 0%   134.08µ ± 0%  +25.45% (p=0.000 n=20)   75.95µ ± 0%  -28.94% (p=0.000 n=20)
Mid          49.292µ ± 0%   18.558µ ± 0%  -62.35% (p=0.000 n=20)   9.911µ ± 1%  -79.89% (p=0.000 n=20)
Long         48.106µ ± 0%   11.614µ ± 0%  -75.86% (p=0.000 n=20)   3.606µ ± 0%  -92.50% (p=0.000 n=20)
geomean       63.28µ         30.69µ       -51.51%                  13.95µ       -77.96%
        │ stpncpy.pre.out │           stpncpy.scalar.out           │          stpncpy.baseline.out           │
        │       B/s       │      B/s       vs base                 │      B/s       vs base                  │
Short       1115.4Mi ± 0%    889.1Mi ± 0%   -20.29% (p=0.000 n=20)   1569.5Mi ± 0%    +40.72% (p=0.000 n=20)
Mid          2.362Gi ± 0%    6.273Gi ± 0%  +165.61% (p=0.000 n=20)   11.746Gi ± 1%   +397.36% (p=0.000 n=20)
Long         2.420Gi ± 0%   10.024Gi ± 0%  +314.22% (p=0.000 n=20)   32.284Gi ± 0%  +1234.05% (p=0.000 n=20)
geomean      1.840Gi         3.794Gi       +106.22%                   8.345Gi        +353.66%
os: Linux
arch: x86_64
cpu:
        │ stpncpy.glibc.out │
        │      sec/op       │
Short           153.9µ ± 0%
Mid             11.96µ ± 1%
Long            2.623µ ± 0%
geomean         16.90µ
        │ stpncpy.glibc.out │
        │        B/s        │
Short          774.6Mi ± 0%
Mid            9.731Gi ± 1%
Long           44.39Gi ± 0%
geomean        6.888GiSponsored by: The FreeBSD Foundation