HomeFreeBSD

For writing and reading single pixels, avoid some pessimizations for

Description

For writing and reading single pixels, avoid some pessimizations for
depths > 8. Add some smaller optimizations for these depths. Use a
more generic method for all depths >= 8, although this gives tiny
pessimizations for these depths.

For clearing the whole frame buffer, avoid the same pessimizations
for depths > 8. Add some larger optimizations for these depths. Use
an even more generic method for all depths >= 8 to give the optimizations
for depths > 8 and a tiny pessimization for depth 8.

The main pessimization was that old versions of bcopy() copy 1 byte at a
time for all trailing bytes. (i386 still does this. amd64 now pessimizzes
large sizes instead of small ones if the CPU supports ERMS. dev/fb gets
this wrong by mostly not using the bcopy() family or the technically correct
bus space functions but by mostly copying 2 bytes at a time using an
unoptimized loop without even volatile declarations to prevent the compiler
rewriting it.)

The sizes here are 1, 2, 3 or 4 bytes, so depths 9-16 were up to twice as
slow as necessary and depths 17-24 were up to 3 times slower than necessary.
Fix this (except depths 17-24 are still up to 2 times slower than necessary)
by using (builtin) memcpy() instead of bcopy() and reorganizing so that the
complier can see the small constant sizes. Reduce special cases while
reorganizing although this is slightly slower than adding special cases.
The compiler inlining (and even -O2 vs -O0) makes little difference compared
with reducing the number of accesses except on modern hardware it gives a
small improvement.

Clearing was also pessimized mainly by the extra accesses. Fix it quite
differently by creating a MEMBUF containing 1 line (in fast memory using
a slow method) and copying this. This is only slightly slower than reducing
everything to efficient memset()s and bcopy()s, but simpler, especially
for the segmented case. This works for planar modes too, but don't use it
then since the old method was actually optimal for planar modes (it works
by moving the slow i/o instructions out of inner loops), while for direct
modes the slow instructions were all in the invisible inner loop in bcopy().

Use htole32() and le32toh() and some type puns instead of unoptimized
functions for converting colors. This optimization is mostly in the noise.
libvgl is only supported on x86, so it could hard-code the assumption that
the byte order is le32, but the old conversion functions didn't hard-code
this.

Details

Provenance
bdeAuthored on
Parents
rS346214: MFC r345319:
Branches
Unknown
Tags
Unknown