Page MenuHomeFreeBSD

address a performance problem w/ partial sscanf on long strings...
AcceptedPublic

Authored by jmg on Mar 2 2021, 12:02 AM.

Details

Reviewers
pstef
Summary

Every time sscanf is called, strlen is called... If the string is
long, maybe megabytes), but only a small part of the string is
consumed, we have geometric growth scanning the remaining, but never
parsed part of the string...

This breaks the string up into CHUNK_SIZE sized chunks, and only
verifies that a NUL is not in the chunk, before processing it...
This currently uses a hack by directly frobing fp->_p, but as we're
already munging the internals of the FILE, not a terrible solution..

allow redefining the chunk size to make it easier to benchmark..

wrapper memchr which has a better implementation than this...

This is only 30% slower than strlen (pre-asm version) on large strings
like 200k in size, rather than 600%...

Test Plan

add additional sscanf tests to verify functionality across chunks...

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 37491
Build 34380: arc lint + arc unit

Event Timeline

jmg requested review of this revision.Mar 2 2021, 12:02 AM

The strnlen will be committed as a separate commit, likely before this change, as this change depends upon it for performance.

emaste added inline comments.
lib/libc/string/strnlen.c
39–44

I think '\0' is pretty typical

Attached is a set of benchmarks. First number is chunk size, second number is number of 0's in the string that is sscanf'd.
Bench (results from an amd64, AMD PRO A10-8770E, 2.8GHz):

source file:

script to run to generate files for ministat:

Small nit in commit message: Missing ( in the second paragraph.

I tend to agree with @emaste on '\0'.

lib/libc/stdio/vsscanf.c
83
100–101

I'd tend to swap these two lines to match the struct definition order.

lib/libc/string/strnlen.c
41
43
  • use a more common spelling of NUL...
jmg marked an inline comment as done.Mar 2 2021, 12:21 AM

style(9) uses \0 exclusively, so I guess I'll change.

  • address brooks's comments,
jmg marked 4 inline comments as done.Mar 2 2021, 12:26 AM

address comments.

strlen tests pass:

freebsd@test:/usr/tests $ kyua test lib/libc/string/strlen_test
lib/libc/string/strlen_test:strlen_basic  ->  passed  [0.004s]
lib/libc/string/strlen_test:strlen_huge  ->  passed  [0.027s]
lib/libc/string/strlen_test:strnlen_basic  ->  passed  [0.004s]

Results file id is usr_tests.20210302-003845-703041
Results saved to /home/freebsd/.kyua/store/results.usr_tests.20210302-003845-703041.db

3/3 passed (0 failed)

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

(I’m amused that this libc microoptimization of a 30? year old behavior was inspired by GTA V load times.)

In D29007#649700, @cem wrote:

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

I'm going to run some perf numbers on ARM, but w/ the attached file w/ benchmarks, it looks like the chunk should be closer to 16.

Here's a graph from my amd64 box:

64 bytes has some interesting outliers, but on the larger sizes, the smaller makes a bigger difference, which is likely because there will be 8 calls to sscanf for every 16 byte chunk, so that means strnlen really wants to be short, otherwise we are wasting a LOT of time doing strnlen just to throw out the results.

(I’m amused that this libc microoptimization of a 30? year old behavior was inspired by GTA V load times.)

you and me both. :)

In D29007#649973, @jmg wrote:
In D29007#649700, @cem wrote:

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

I'm going to run some perf numbers on ARM, but w/ the attached file w/ benchmarks, it looks like the chunk should be closer to 16.

That sounds about right if we're optimizing for single conversions (%d, %g, whatever) at the start of long input strings. Does the small size penalize larger conversions much? It seems like we'll be bouncing around _vsscanf_read every N bytes now. (I wonder how common the GTA use case actually is.)

In D29007#650040, @cem wrote:
In D29007#649973, @jmg wrote:
In D29007#649700, @cem wrote:

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

I'm going to run some perf numbers on ARM, but w/ the attached file w/ benchmarks, it looks like the chunk should be closer to 16.

That sounds about right if we're optimizing for single conversions (%d, %g, whatever) at the start of long input strings. Does the small size penalize larger conversions much? It seems like we'll be bouncing around _vsscanf_read every N bytes now. (I wonder how common the GTA use case actually is.)

yeah, I did a quick grep over the FreeBSD tree for ssacnf, and most are conversions that likely consume <16 bytes. There are a few that likely will exceed 16 bytes, but rarely over 32 bytes.

The only case that I can think of it impacting is if sscanf is used to parse(copy) a long string, "%128s" or something similar, where it'll just copy the bytes, then call back in to fill the buffer. The thing is that the buffer fill function is REALLY small/fast. But we don't have any cases like that in our tree, but doesn't mean other code isn't doing that.

Also the difference in the small sizes (16 vs 32 vs 64) for short strings is either it doesn't change performance, or it improves it by a small amount, where as for longer strings, the shorter strings FAR out perform the longer ones.

The other issue that I thought about is w/ unaligned strings. I am thinking that the first chunk should be CHUNK_SIZE + enough_to_realign. This helps ensure that future calls to strnlen (aka memchr) will be aligned allowing the use of the word path every time instead of having to realign each time.

The performance for short strings above is interesting when combined w/ the previous point (about unaligned strings), because it means that even spending time in a non word loop out performs larger strnlen calls that can spend more time in the optimized word loop.

Ok, the results from an arm64 run (A53, Pine64 A64-LTS).

Graph:

Bench text:

The biggest difference is that there are less differences on arm64 than amd64. On a short string (20 bytes), no provable differences, and slightly longer strings, shorter 16 bytes does seem to win out.

It does seem like 16, 32 or 64 have different benefits to each. 64 does seem to have a slight benefit on amd64, but less on arm64.

I think I will go with 32 unless someone wants to make the case for 16 being not too much worse/better on the small sizes, and significantly better on the long strings.

Thoughts?

pstef added a subscriber: pstef.
pstef added inline comments.
lib/libc/stdio/vsscanf.c
51–56

Why do names of sscancookie and vsscanf_read() need to start with an underscore?

This revision is now accepted and ready to land.Sep 23 2021, 4:14 PM