Page MenuHomeFreeBSD

nfsclient: eliminate ncl_writebp()
ClosedPublic

Authored by kib on Dec 30 2023, 6:50 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, May 7, 8:58 AM
Unknown Object (File)
Fri, Apr 26, 7:21 PM
Unknown Object (File)
Apr 6 2024, 2:20 PM
Unknown Object (File)
Mar 17 2024, 9:45 AM
Unknown Object (File)
Mar 17 2024, 9:44 AM
Unknown Object (File)
Mar 14 2024, 5:02 PM
Unknown Object (File)
Feb 10 2024, 9:42 PM
Unknown Object (File)
Feb 9 2024, 8:23 AM
Subscribers

Details

Summary

Use plain bufwrite() instead. ncl_writebp() evolved to mostly repeat bufwrite() code with some ommisions.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kib requested review of this revision.Dec 30 2023, 6:50 PM

This looks ok to me and testing with the patch has not
found any regression.

However, I am not that familiar with the buffer cache
code, so I cannot tell if the replacement results in any
semantics change.
As such, it would be better if someone more familiar
with the buffer cache code reviews it.
(If there is no one, I can click "reviewed" on it, if you'd like?)

Since this is NFS, and since you read and agree with the patch, I would very much want you to accept the revision.

I will ask Peter Holm to test the patch set in the end.

Looks ok to me, although I have limited knowledge w.r.t.
the buffer cache.

Testing with the patch has not found any issues sofar.

This revision is now accepted and ready to land.Dec 31 2023, 9:21 PM

The main difference between this function and bufwrite() appears to be that bufwrite() implements the runningbufspace mechanism. I can't see any problem with enabling that.

Running bufs is a useful moderator of traffic. However its way undersized for larger systems that have become common... we should think of some way of sanely scaling it. Nics and modern hbas can handle a lot more than the defaults

Although I have not found any correctness issue with this patch,
I am seeing a little performance degradation.
For example, a kernel build over NFS takes about 5% longer
on the setup/hardware I have.

I do not know if 5% is worth worrying about?

Although I have not found any correctness issue with this patch,
I am seeing a little performance degradation.
For example, a kernel build over NFS takes about 5% longer
on the setup/hardware I have.

I do not know if 5% is worth worrying about?

Are you hitting the runningbufs limit?

vfs.hirunningspace: 16777216
vfs.lorunningspace: 11206656
vfs.runningbufspace: 0

The first two are control, and the last one is the currently used space...

It might be not the limit, but simply the cost of the runningbufspace management. There are two atomics per buffer write, what is your CPU?

Also there are RACCT-related things, which overhead depends on the kernel options and runtime config.

Anyway, IMO if spending efforts on optimization, it should be done on the generic code, after removing unnecessary divergence between the fs and generic routine.

In D43249#987015, @imp wrote:

Although I have not found any correctness issue with this patch,
I am seeing a little performance degradation.
For example, a kernel build over NFS takes about 5% longer
on the setup/hardware I have.

I do not know if 5% is worth worrying about?

Are you hitting the runningbufs limit?

vfs.hirunningspace: 16777216
vfs.lorunningspace: 11206656
vfs.runningbufspace: 0

The first two are control, and the last one is the currently used space...

I will have to run it again to see, but since both the src and obj trees are on the
NFS mount for a "make buildkernel", I suspect it might.

It might be not the limit, but simply the cost of the runningbufspace management. There are two atomics per buffer write, what is your CPU?

Core i5 (4 cores) on an old Dell laptop.

Also there are RACCT-related things, which overhead depends on the kernel options and runtime config.

Anyway, IMO if spending efforts on optimization, it should be done on the generic code, after removing unnecessary divergence between the fs and generic routine.

Yes, I agree it makes sense to use the generic routine and look at tuning it.
(One possiblility would be to make the runningbug management optional in it, based on "yet another" B_xxx flag?)

If we are hitting the runningbuf limits, maybe we should consider making them more dynamically tuned:

* Note: The 16 MiB upper limit for hirunningspace was chosen
* arbitrarily and may need further tuning. It corresponds to
* 128 outstanding write IO requests (if IO size is 128 KiB),
* which fits with many RAID controllers' tagged queuing limits.
* The lower 1 MiB limit is the historical upper limit for
* hirunningspace.

Maybe we're to the point of further tuning will be helpful?

I routinely NERF this on systems I'm running on because otherwise things get too sluggish when I'm doing a lot of writes.

We've since raised MAXPHYS to 1MB suggesting we should bump.

Given the complex interrelationships between all the parameters set in bufinit(), I'm not sure what to suggest as a better value. But a ham-fisted *8 to 128MB or even another doubling to 256MB wouldn't even be noticed by most systems with large numbers of disks. systems with 100 drives are now quite common and raid controllers have advanced somewhat since 2010 when we last bumped it from 1MB. We're already scaled to 1/64 of the buffer space, maybe that's fine w/o an arbitrary limit. Certainly not one based speculatively on using raid cards from 15 years ago....

(One possiblility would be to make the runningbug management optional in it, based on "yet another" B_xxx flag?)

you can already exempt processes from the runningbuf limit, which I think is also a mistake....

This revision was automatically updated to reflect the committed changes.
In D43249#987333, @imp wrote:

(One possiblility would be to make the runningbug management optional in it, based on "yet another" B_xxx flag?)

you can already exempt processes from the runningbuf limit, which I think is also a mistake....

What do you mean? If you refer to TDP_NORUNNINGBUFSPACE, it is about syncer and buffer daemon where the ratelimiting for writes is effectively done by other means, and where adding runningbufspace limiter would cause additional deadlock risk.

Otherwise, the runningbufspace is crusial to allow the VM and BIO subsystems to not deadlock by overschedule writes.

It is a known issue that runningbufspace is better to be scaled by the system write throughput capacity (or alternatively, the mechanism is too simple-minded). But it is quite hard to predict at the top bio layer where the writes end up. For instance, a slow usb stick can easily accumulate writes in flight that exhaust the running space, blocking processes wanting to write to fast storage not related to USB. But proper solution is not obvious.

That said, I do believe that NFS participating in the limiting properly would improve the overall system resilience to overbooking writes, and is a good thing. This is why I went ahead and committed the change.

In D43249#987491, @kib wrote:
In D43249#987333, @imp wrote:

(One possiblility would be to make the runningbug management optional in it, based on "yet another" B_xxx flag?)

you can already exempt processes from the runningbuf limit, which I think is also a mistake....

What do you mean? If you refer to TDP_NORUNNINGBUFSPACE, it is about syncer and buffer daemon where the ratelimiting for writes is effectively done by other means, and where adding runningbufspace limiter would cause additional deadlock risk.

Yea, but it's also set in the UFS snapshot code, the UFS flush code, g_journal and dev/md. The last one makes some sense because we know md is fast.... Hence I'm uncomfortable about it.

Otherwise, the runningbufspace is crusial to allow the VM and BIO subsystems to not deadlock by overschedule writes.

I agree. I think we've grown far too conservative about setting it though....

It is a known issue that runningbufspace is better to be scaled by the system write throughput capacity (or alternatively, the mechanism is too simple-minded). But it is quite hard to predict at the top bio layer where the writes end up. For instance, a slow usb stick can easily accumulate writes in flight that exhaust the running space, blocking processes wanting to write to fast storage not related to USB. But proper solution is not obvious.

Yea, but why should we guard against that on systems where slow USB sticks will never happen? And why do we justify the current extremely low limits with what is no longer true about RAID HBA and is even less true about nvme drives?

That said, I do believe that NFS participating in the limiting properly would improve the overall system resilience to overbooking writes, and is a good thing. This is why I went ahead and committed the change.

Yea, I think it would be good, but also suffers from starvation issues, so it's a bit of a double edged sword.

I also agree this is a hard problem. I looked at 100% eliminating this mechanism in favor of a mechanism that would return up the stack an estimate of how much write throughput each element could handle. In the geom layers I had to do some hacks for gmirror, gstripe, etc to fan in data in intelligent ways from the different devices. I also then passed this up to the bufobj that's scheduling the writes and hacked bufwrite to try to not schedule more than that 100ms of writes (so the wait move from after to before).... But I never got it stable, and the patches then decayed with other changes in vfs_bio and I never returned to it. The key, I think, is to provide some back pressure signal up the I/O stack so that the top level bufobjs don't write more than a fraction of the underlying disks / etc can handle in some small time slice. But I agree with your statement that the actual solution isn't clear, and I'm no closer to presenting one than i was several years ago when I started down this path... and I'll not have the time it needs to try again for a while :(. There might be a simpler way to do this as well based on allowing some writes through initially, but then backing off if they conjest, but it would likely need to be per-bufobj driven rather than globally (though I acknowledge that at the top end of doing this dynamically there's other limiting factors that I've not fully comprehended).

It's also one reason I think it would be generally safe to raise it from 16MB to 64MB (very safe) or 1GB (a lot riskier on smaller memory machines) or something inbetween. We've had 13 years since we did this, and a 3 year per doubling of memory size suggest 4 doubling (so bump it to 256MB), especially since most USB drives have gotten much faster than they were 13 years ago. OR bumping it to 128MB based on the factor of 8 increase in MAXPHYS. Well, my failure isn't a reason for this, but just the growth of memory suggests the arbitrary 16MB limit should be re-justified every so often.

but again, I think this was a good change, and maybe there's a better place to have this discussion since we're space limited here and fragmented.

One of possible design could be the split of the runningbufspace global into per-geom counter and limit, with geom dutiable to set the limit. Indeed, a pointer from bufobj to the pair could be arranged, as you noted. For bio subsystem, it is relatively easy transition, with the only negative effect is addition of the pointer to each bufobj, that increases the vnode size. Filesystems would get the initial pointer on geom_vfs_open(), and then clone it into each instantiated vnode.

For geom, IMO this is where all complications lie. First you need to somehow estimate reasonable limits per HBA (perhaps by experiment), then transforming geoms like raids need to apply some type-specific formula, then e.g. gpart should share the underlying limit etc.

I agree that it is reasonable to bump current limit to say 64M.

I ran a test cycle mounting the client laptop locally
(which is all I can do at this time) and with another
experimental patch removed and...
-> I see no performance degradation and do not

go anywhere need the runningbuf limit (it never
exceeded 1Mbyte, from what I saw).

So, it was either the other experimental patch or
a very slow server on a 100Mbps link.

Sorry about the noise, although it did result in
an interesting discussion.
(I will rerun the original test without that other
experimental patch, but it won't happen for weeks.)

In D43249#987508, @imp wrote:

I also agree this is a hard problem. I looked at 100% eliminating this mechanism in favor of a mechanism that would return up the stack an estimate of how much write throughput each element could handle. In the geom layers I had to do some hacks for gmirror, gstripe, etc to fan in data in intelligent ways from the different devices. I also then passed this up to the bufobj that's scheduling the writes and hacked bufwrite to try to not schedule more than that 100ms of writes (so the wait move from after to before).... But I never got it stable, and the patches then decayed with other changes in vfs_bio and I never returned to it. The key, I think, is to provide some back pressure signal up the I/O stack so that the top level bufobjs don't write more than a fraction of the underlying disks / etc can handle in some small time slice. But I agree with your statement that the actual solution isn't clear, and I'm no closer to presenting one than i was several years ago when I started down this path... and I'll not have the time it needs to try again for a while :(. There might be a simpler way to do this as well based on allowing some writes through initially, but then backing off if they conjest, but it would likely need to be per-bufobj driven rather than globally (though I acknowledge that at the top end of doing this dynamically there's other limiting factors that I've not fully comprehended).

but again, I think this was a good change, and maybe there's a better place to have this discussion since we're space limited here and fragmented.

In D43249#987860, @kib wrote:

One of possible design could be the split of the runningbufspace global into per-geom counter and limit, with geom dutiable to set the limit. Indeed, a pointer from bufobj to the pair could be arranged, as you noted. For bio subsystem, it is relatively easy transition, with the only negative effect is addition of the pointer to each bufobj, that increases the vnode size. Filesystems would get the initial pointer on geom_vfs_open(), and then clone it into each instantiated vnode.

For geom, IMO this is where all complications lie. First you need to somehow estimate reasonable limits per HBA (perhaps by experiment), then transforming geoms like raids need to apply some type-specific formula, then e.g. gpart should share the underlying limit etc.

I agree that it is reasonable to bump current limit to say 64M.

For lack of a better place, I'm adding thoughts here for the time being, so that they are not lost.

The approach you propose and tried to implement at some point seems very reasonable to me and a logical direction for investigation and implementation. At the same time, I also feel it's a refinement rather than a base, let me explain why.

As always, the difficult part is encountering and enumerating different problematic use cases and see what they have in common and what they do not, but also trying to evaluate the importance of the different use cases. I have no doubt that you and kib@ currently have a much better view of these and the current architecture of the system than I have. However, I think I have a slightly different use case that may not be completely addressed by what you've evoked here so far. It is the malfunctioning USB key, whose throughput varies during use, and possibly completely stalls for some LBAs. Of course, you want to stay away of such hardware as much as possible, but reality is that you have to deal with it from time to time. kib@ evoked above a simpler variant where someone uses a functioning but always slow USB key.

What happens in both scenarios is that exercising some UFS filesystem can use up all buffers and stall filesystem-level I/O as soon as this filesystem is on a USB key that stops responding, sporadically or for long period of times (and sometimes, it's just a matter of having bad connectors or having inadvertently bound the key). And I do think there may be a simple solution here to avoid what I consider to be the most pressing problem, i.e., the complete I/O stall. It is to reserve a very small amount of buffers per mounted filesystem (this can be refined, e.g., to be per actual hardware device/GEOM leaf) that is always available to it. Of course, this amount won't be enough to sustain the maximum performance of, e.g., NVME devices, or perhaps any decently fast device for that matter, but at least it would allow the system to continue making progress in the face of a long raw I/O stall towards some device. Based on my experience, this would already make a big difference (the biggest, actually). Would it do so as well for the use cases you have in mind? If not, why?

Then, as a refinement, it would be nice that each leaf GEOM gives some relative indication of their speed and/or reliability, and use that to adjust the buffer reserve evoked in the previous paragraph, so that all devices stop getting the same amount of reserve buffers, but one that is proportional to the speed/reliability measure. I think this is similar to what kib@'s evoked above in spirit. IIUC, the difference would be that I only propose that for the "reserve buffers", but not for the total amount of buffers that can exist in the system (the limit).

Finally, and as a second refinement, these indications of speed/reliability would be dynamically reported, and the reserve of buffers changed in accordance (if possible). I've not studied this idea in enough detail for a clear stance, but intuitively I don't think it's going to be that beneficial to apply this scheme to the total amount of buffers outside of a smaller "reserve", simply because if all buffers are used dynamic adaptations are impossible, so the "reserve" can't simply be made to cover all buffers without other mechanisms (such as, perhaps, a minimal amount per mount or GEOM leaf, which anyway is already required for the first refinement). This refinement would improve addressing the more complex USB key scenario I mentioned above, and I have a feeling it would also improve some of your use cases as well.