Allow __builtin_memset instead of bzero for small buffers of known size
In particular this eliminates function calls and related register save/restore
when only few writes would suffice.
Example speed up can be seen in a fstat microbenchmark on AMD Ryzen cpus, where
the throughput went up by ~4.5%.
Thanks to cem@ for benchmarking and reviewing the patch.
MFC after: 1 week