Paths

Table of Contentst

address a performance problem w/ partial sscanf on long strings...
AcceptedPublic
Actions

Authored by jmg on Mar 2 2021, 12:02 AM.

Details

Reviewers

Summary

Every time sscanf is called, strlen is called... If the string is
long, maybe megabytes), but only a small part of the string is
consumed, we have geometric growth scanning the remaining, but never
parsed part of the string...

This breaks the string up into CHUNK_SIZE sized chunks, and only
verifies that a NUL is not in the chunk, before processing it...
This currently uses a hack by directly frobing fp->_p, but as we're
already munging the internals of the FILE, not a terrible solution..

allow redefining the chunk size to make it easier to benchmark..

wrapper memchr which has a better implementation than this...

This is only 30% slower than strlen (pre-asm version) on large strings
like 200k in size, rather than 600%...

Test Plan

add additional sscanf tests to verify functionality across chunks...

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 37490
Build 34379: arc lint + arc unit

Event Timeline

jmg created this revision.Mar 2 2021, 12:02 AM

Herald added a subscriber: imp. · View Herald TranscriptMar 2 2021, 12:02 AM

jmg requested review of this revision.Mar 2 2021, 12:02 AM

Harbormaster completed remote builds in B37489: Diff 84928.Mar 2 2021, 12:02 AM

The strnlen will be committed as a separate commit, likely before this change, as this change depends upon it for performance.

emaste added a subscriber: emaste.Mar 2 2021, 12:07 AM

emaste added inline comments.

lib/libc/string/strnlen.c
39–44	I think `'\0'` is pretty typical

Attached is a set of benchmarks. First number is chunk size, second number is number of 0's in the string that is sscanf'd.
Bench (results from an amd64, AMD PRO A10-8770E, 2.8GHz):

comp.bench.txt10 KBDownload

source file:

t.c1 KBDownload

script to run to generate files for ministat:

script605 BDownload

Small nit in commit message: Missing ( in the second paragraph.

I tend to agree with @emaste on '\0'.

lib/libc/stdio/vsscanf.c
84
101–102	I'd tend to swap these two lines to match the struct definition order.
lib/libc/string/strnlen.c
42
44

use a more common spelling of NUL...

Harbormaster completed remote builds in B37490: Diff 84930.Mar 2 2021, 12:20 AM

style(9) uses \0 exclusively, so I guess I'll change.

address brooks's comments,

Harbormaster completed remote builds in B37491: Diff 84931.Mar 2 2021, 12:25 AM

address comments.

strlen tests pass:

freebsd@test:/usr/tests $ kyua test lib/libc/string/strlen_test
lib/libc/string/strlen_test:strlen_basic  ->  passed  [0.004s]
lib/libc/string/strlen_test:strlen_huge  ->  passed  [0.027s]
lib/libc/string/strlen_test:strnlen_basic  ->  passed  [0.004s]

Results file id is usr_tests.20210302-003845-703041
Results saved to /home/freebsd/.kyua/store/results.usr_tests.20210302-003845-703041.db

3/3 passed (0 failed)

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

(I’m amused that this libc microoptimization of a 30? year old behavior was inspired by GTA V load times.)

In D29007#649700, @cem wrote:

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

I'm going to run some perf numbers on ARM, but w/ the attached file w/ benchmarks, it looks like the chunk should be closer to 16.

Here's a graph from my amd64 box:

64 bytes has some interesting outliers, but on the larger sizes, the smaller makes a bigger difference, which is likely because there will be 8 calls to sscanf for every 16 byte chunk, so that means strnlen really wants to be short, otherwise we are wasting a LOT of time doing strnlen just to throw out the results.

(I’m amused that this libc microoptimization of a 30? year old behavior was inspired by GTA V load times.)

you and me both. :)

In D29007#649973, @jmg wrote:

In D29007#649700, @cem wrote:

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

I'm going to run some perf numbers on ARM, but w/ the attached file w/ benchmarks, it looks like the chunk should be closer to 16.

That sounds about right if we're optimizing for single conversions (%d, %g, whatever) at the start of long input strings. Does the small size penalize larger conversions much? It seems like we'll be bouncing around _vsscanf_read every N bytes now. (I wonder how common the GTA use case actually is.)

In D29007#650040, @cem wrote:

In D29007#649973, @jmg wrote:

In D29007#649700, @cem wrote:

Is 128 a reasonable chunk here? How does it relate to the internal buffering of FILE?

I'm going to run some perf numbers on ARM, but w/ the attached file w/ benchmarks, it looks like the chunk should be closer to 16.

That sounds about right if we're optimizing for single conversions (%d, %g, whatever) at the start of long input strings. Does the small size penalize larger conversions much? It seems like we'll be bouncing around _vsscanf_read every N bytes now. (I wonder how common the GTA use case actually is.)

yeah, I did a quick grep over the FreeBSD tree for ssacnf, and most are conversions that likely consume <16 bytes. There are a few that likely will exceed 16 bytes, but rarely over 32 bytes.

The only case that I can think of it impacting is if sscanf is used to parse(copy) a long string, "%128s" or something similar, where it'll just copy the bytes, then call back in to fill the buffer. The thing is that the buffer fill function is REALLY small/fast. But we don't have any cases like that in our tree, but doesn't mean other code isn't doing that.

Also the difference in the small sizes (16 vs 32 vs 64) for short strings is either it doesn't change performance, or it improves it by a small amount, where as for longer strings, the shorter strings FAR out perform the longer ones.

The other issue that I thought about is w/ unaligned strings. I am thinking that the first chunk should be CHUNK_SIZE + enough_to_realign. This helps ensure that future calls to strnlen (aka memchr) will be aligned allowing the use of the word path every time instead of having to realign each time.

The performance for short strings above is interesting when combined w/ the previous point (about unaligned strings), because it means that even spending time in a non word loop out performs larger strnlen calls that can spend more time in the optimized word loop.

Ok, the results from an arm64 run (A53, Pine64 A64-LTS).

Graph:

Bench text:

comp.arm64.bench.txt10 KBDownload

The biggest difference is that there are less differences on arm64 than amd64. On a short string (20 bytes), no provable differences, and slightly longer strings, shorter 16 bytes does seem to win out.

It does seem like 16, 32 or 64 have different benefits to each. 64 does seem to have a slight benefit on amd64, but less on arm64.

I think I will go with 32 unless someone wants to make the case for 16 being not too much worse/better on the small sizes, and significantly better on the long strings.

Thoughts?

pstef accepted this revision.Sep 23 2021, 4:14 PM

pstef added a subscriber: pstef.

pstef added inline comments.