Page MenuHomeFreeBSD

pipe: reduce atime precision
ClosedPublic

Authored by mjg on Mar 4 2020, 10:12 PM.

Details

Summary

The routine is called on successful write and read, which on pipes happens a lot and for small sizes.

Precision provided by default seems way bigger than necessary and it causes problems in vms on amd64 (it rdtscp's which vmexits). getnanotime seems to provide the level roughly in lines of Linux so we should be good here.

Sample result in a vm running pipe1_processes on virtualbox + haswell:

[23:09] test:~/will-it-scale (130) # ./pipe1_processes -t 1 -s 5
testcase:pipe read/write
warmup
min:416236 max:416236 total:416236
min:427691 max:427691 total:427691
min:413516 max:413516 total:413516
min:426149 max:426149 total:426149
min:422719 max:422719 total:422719
min:433463 max:433463 total:433463
measurement
min:423289 max:423289 total:423289
min:434897 max:434897 total:434897
min:424951 max:424951 total:424951
min:429241 max:429241 total:429241
min:419946 max:419946 total:419946
average:426464
./pipe1_processes -t 1 -s 5 0.00s user 11.54s system 99% cpu 11.536 total
[23:09] test:~/will-it-scale # ./pipe1_processes -t 1 -s 5
[23:09] test:~/will-it-scale (130) # sysctl vfs.timestamp_precision=1
vfs.timestamp_precision: 2 -> 1
[23:09] test:~/will-it-scale # ./pipe1_processes -t 1 -s 5
testcase:pipe read/write
warmup
min:3341419 max:3341419 total:3341419
min:3301849 max:3301849 total:3301849
min:3227739 max:3227739 total:3227739
min:3215604 max:3215604 total:3215604
min:3442148 max:3442148 total:3442148
min:3268131 max:3268131 total:3268131
measurement
min:3219759 max:3219759 total:3219759
min:3211822 max:3211822 total:3211822
min:3276166 max:3276166 total:3276166
min:3291221 max:3291221 total:3291221
min:3238138 max:3238138 total:3238138
average:3247421
./pipe1_processes -t 1 -s 5 1.19s user 10.11s system 100% cpu 11.299 total

That is 761% of the baseline. kvm + cascade lake had a similar win.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Am I right that this does not apply to named pipes?

sys/kern/sys_pipe.c
398 ↗(On Diff #69190)

typo, unnecessary

This revision is now accepted and ready to land.Mar 4 2020, 10:34 PM

Am I right that this does not apply to named pipes?

Ah, seems it would. I am not sure it is right to give the same treatment to named pipes, since then they get less precision than every other file type, but we have the global timestamp_precision to control this if performance is a concern.

Wouldn't other vfs_timestamp() calls on similar systems have similar issues, e.g. on tmpfs or even on ufs with data blocks cached ?
Why wouldn't you simply change default vfs.timestamp_precision when virtualized ? Then user has the full control and can fix this stuff if he wants better precision.

This indeed covers all pipes, it's trivial to make it switchable on PIPE_NAMED.

I don't think it's of significance right now. Most filesystems keep deferring fetching time timestamp, meaning the actual precision not anywhere near the one claimed. Moreover this completely ignores noatime and it's quite unclear to me what would end up getting written by the fs anyway. With in in mind, I think short of the actual fix this can use a comment that this also covers named pipes when it perhaps should not. The actual fix would add a VOP to call to take of that case.

In D23964#526760, @kib wrote:

Wouldn't other vfs_timestamp() calls on similar systems have similar issues, e.g. on tmpfs or even on ufs with data blocks cached ?
Why wouldn't you simply change default vfs.timestamp_precision when virtualized ? Then user has the full control and can fix this stuff if he wants better precision.

This was posted as I was writing my other comment.

To elaborate on the vfs_timestamp problem, other places already avoid calling it very often. For instance tmpfs_read:

tmpfs_set_status(VFS_TO_TMPFS(vp->v_mount), node, TMPFS_NODE_ACCESSED);

... which only conditionally sets the flag which then can get updated on close or next getattr. Meaning the timestamp precision is already lackluster.

ufs has the same trick in ffs_read:

if ((error == 0 || uio->uio_resid != orig_resid) &&
    (vp->v_mount->mnt_flag & (MNT_NOATIME | MNT_RDONLY)) == 0)
        UFS_INODE_SET_FLAG_SHARED(ip, IN_ACCESS);

which again does not update the timestamp. this waits in the worst case for the syncer. And so on.

As such, should this stick to de facto file system precision, it would get much worse.

Given how the routine is used by aforementioned callers (and pipes) it is way too expensive.
Changing the default to affect everyone without inspecting all consumers is imo questionable and said inspection is not worth doing.
As noted earlier the actual fix in my opinion would provide something callable by pipes where filesystems can decide what to do (in particular ignore atime if noatime is set and get whatever precision they feel like).
The change as proposed is basically neutral from functionality perspective and takes care of most of the problem until someone sorts this out.

That said, I consider changing the default timestamp precision in vms to be a separate issue.

You change is not neutral, it enforces the policy. I.e. the same argument you use against vfs_timestamp default precision, equally applicable to the pipe_timestamp. Also, the timestamps are costly only in specific configurations.

It would be fine if you add separate knob for pipe timestamp precision, and again automatically adjust it for virtualized environments.

Well I claimed the change only affects pipes for which it's not a problem and it avoids touching anything else. I also noted this matches Linux precision for pipe timestamps.

However, looking around it seems the comment is more true than I thought. Namely the modified counters here are only ever accessed by pipe_stat and only for anonymous pipes, meaning this really does not change anything with respect to named pipes.

Named pipes never see the in-pipe updates, actual writes/reads never call anything fs-specific to denote them and when someone stats on they get redirected to the fs-specific stat routine.

Thus the change really only affects de facto timestamps of anonymous pipes, despite the code operating also on named pipes. I think the latter is a bug orthogonal to the change at hand and the precision change localized to anon pipes is perfectly harmless (with a win in vms).

This revision was automatically updated to reflect the committed changes.