vfs: Fix runningspace tuning after maxphys was bumped
ClosedPublic
Actions

Authored by markj on Nov 1 2024, 11:56 PM.

Details

Reviewers

kib
imp

Commits

rGe03a056de0ba: vfs: Fix runningspace tuning after maxphys was bumped

Summary

The previous maximum value for the upper watermark was based on the old
value of MAXPHYS. Raise it to allow more parallel I/O on large systems.

This is still a rather flawed mechanism since it's applied without
regard to the number of filesystems or block devices between which this
mechanism sits, but we might as well bump the limits at this point, as
they haven't been revised in quite a long time.

Fixes: cd8537910406 ("Make MAXPHYS tunable. Bump MAXPHYS to 1M.")

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 60340
Build 57224: arc lint + arc unit

Event Timeline

markj created this revision.Nov 1 2024, 11:56 PM

Herald added a subscriber: olce. · View Herald TranscriptNov 1 2024, 11:56 PM

markj requested review of this revision.Nov 1 2024, 11:56 PM

Harbormaster completed remote builds in B60340: Diff 145876.Nov 1 2024, 11:56 PM

We should scale this to be .1s of write bandwidth, partitioned per device... but that's a much bigger project

This revision is now accepted and ready to land.Nov 2 2024, 12:27 AM

In D47398#1081034, @imp wrote:

We should scale this to be .1s of write bandwidth, partitioned per device... but that's a much bigger project

More generally, I don't see why the buffer cache is the right place to enforce a limit at all. If an async write is blocked by the runningspace mechanism, its pages are still wired and unreclaimable. Is the idea to avoid starving synchronous writes (which are not subject to this limit)?

I tend to think that this responsibility belongs to CAM or geom_disk.

In D47398#1081109, @markj wrote:

In D47398#1081034, @imp wrote:

We should scale this to be .1s of write bandwidth, partitioned per device... but that's a much bigger project

More generally, I don't see why the buffer cache is the right place to enforce a limit at all. If an async write is blocked by the runningspace mechanism, its pages are still wired and unreclaimable. Is the idea to avoid starving synchronous writes (which are not subject to this limit)?

I tend to think that this responsibility belongs to CAM or geom_disk.

Buffer cache does need to provide write throttling.

Also, some buffer writes requires reads before to be able to happen. E.g., a new buffer needs to get a physical block allocated, for which we might need to read the indirect block into a buffer. There is currently no other way to ensure that some buf header is available to make write op to be able to update metadata.

I am somewhat skeptical of the change, since e.g. UFS SU does not block writes: if a write buffer is created, it can be written immediately. But it certainly should not hurt.

In D47398#1081109, @markj wrote:

In D47398#1081034, @imp wrote:

We should scale this to be .1s of write bandwidth, partitioned per device... but that's a much bigger project

More generally, I don't see why the buffer cache is the right place to enforce a limit at all. If an async write is blocked by the runningspace mechanism, its pages are still wired and unreclaimable. Is the idea to avoid starving synchronous writes (which are not subject to this limit)?

I tend to think that this responsibility belongs to CAM or geom_disk.

Well, it's a shared responsibility. CAM / geom_disks get requests and by then the resources are in use. They have to do *Something* to create back pressure up the stack so that (a) any 'failed' I/O is tried again and (b) the producers of I/Os do it at a rate that matches to slightly exceeds the capacity of the underlying storage. So the buffer cache has to do something to retry the stalled writes in a timely manner as well as limit the number of requests that are bounced back from the lower layers. Also, the lower layers have no notion of sync vs async calls. geom_disk might be able to do some of this, but not all of it. CAM certainly can just say 'this write would be delayed more than X ms' if we had a flag that told it to make such estimates. And even then, it's only probabilistic: we have to have really really good statistics to be able to opine on the expected delay of a write. So I partially agree: buffer cache isn't the right place to ENFORCE a limit, but neither can it just ignore that limits exist and it needs to respect them to some degree to keep the system writes from overwhelming all other traffic (which is why it exists in the first place, as kib alluded to in his response). Runningbuf is a blunt tool to make sure that we can always read something, even if that read takes a while.

Having said that, we run our systems at work with runningbuf very close to completely nerf'd. It acts only in the worst write flooding events to ensure that we can do the required read traffic to avoid deadlock.

Closed by commit rGe03a056de0ba: vfs: Fix runningspace tuning after maxphys was bumped (authored by markj). · Explain WhyNov 19 2024, 11:51 PM

This revision was automatically updated to reflect the committed changes.

markj added a commit: rGe03a056de0ba: vfs: Fix runningspace tuning after maxphys was bumped.