Details

Reviewers

mckusick
chs
kib

Commits

rS355836: Add BIO_SPEEDUP signalling to UFS
rS355837: Implement bio_speedup
rS355835: Add BIO_SPEEDUP

Summary

Add BIO_SPEEDUP bio command and g_io_speedup wrapper. It tells the
lower layers that the upper layers are dealing with some shortage
(dirty pages and/or disk blocks). The lower layers should do what they
can to speed up anything that's been delayed.

The first use will be to tell the CAM I/O scheduler that any TRIM
shaping should be short-circuited because the system needs
blocks. We'll also call it when there's too many resources used by
UFS.

Add BIO_SPEEDUP signalling to UFS

When we have a resource shortage in UFS, send down a BIO_SPEEDUP to
give the CAM I/O scheduler a heads up that we have a resource shortage
and that it should bias its decisions knowing that.

Implement bio_speedup

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

imp created this revision.Nov 27 2018, 12:45 AM

Harbormaster completed remote builds in B21216: Diff 51144.Nov 27 2018, 12:48 AM

Generally this looks reasonable.

In the places where you use:

cp = (struct g_consumer *)ump->um_devvp->v_bufobj.bo_private;

you should be using

cp = ump->um_cp;

Indeed you can probably just eliminate the copy into cp and just use ump->um_cp.

This revision now requires changes to proceed.Nov 27 2018, 6:10 AM

Minor typos in comments.

sys/geom/geom_io.c
342 ↗	(On Diff #51144)	s/IBO/BIO
343 ↗	(On Diff #51144)	s/has/have

I never saw a situation where SU workitems were the cause of the memory shortage. I understand that slow trims can cause filesystem write hiccups and that the change would somewhat lessen this.

Since, as noted above, the SU workitems shortage does not cause system instability, only performance problems, I would prefer that the abort of requested trims was controlled by an option, perhaps even per-mount. For instance, on my w/s I never have low memory condition but often do large metadata-intensive ops and I do not want to loose the useful hints which manage the SSDs lifespan.

Address review feedback.

Harbormaster completed remote builds in B21225: Diff 51211.Nov 27 2018, 3:14 PM

Again with the changes, this time including all the changes.

Harbormaster completed remote builds in B21226: Diff 51212.Nov 27 2018, 3:15 PM

In D18351#389989, @kib wrote:

I never saw a situation where SU workitems were the cause of the memory shortage. I understand that slow trims can cause filesystem write hiccups and that the change would somewhat lessen this.

Since, as noted above, the SU workitems shortage does not cause system instability, only performance problems, I would prefer that the abort of requested trims was controlled by an option, perhaps even per-mount. For instance, on my w/s I never have low memory condition but often do large metadata-intensive ops and I do not want to loose the useful hints which manage the SSDs lifespan.

In conversations with Kirk, and reading the code, I concluded speedup is called when there is no journal space on the drive. The eliminated BIO_DELETEs are for blocks that are waiting to be re-used. Since they will be re-used shortly, the benefit from doing the trim is slim to none. TRIMs are only useful for LBAs that lay free long enough for the drive's garbage collection to kick in to reduce the write amp that would otherwise happen when un-used blocks are copied forward because the SSD doesn't know the block isn't used. So in this case, the eliminated TRIMs are for blocks that will be used right away, so they are safe to eliminate. We already do similar BIO_DELETE elimination in UFS after Kirk's recent trim consolidation work.

In D18351#390022, @imp wrote:

In D18351#389989, @kib wrote:

I never saw a situation where SU workitems were the cause of the memory shortage. I understand that slow trims can cause filesystem write hiccups and that the change would somewhat lessen this.

Since, as noted above, the SU workitems shortage does not cause system instability, only performance problems, I would prefer that the abort of requested trims was controlled by an option, perhaps even per-mount. For instance, on my w/s I never have low memory condition but often do large metadata-intensive ops and I do not want to loose the useful hints which manage the SSDs lifespan.

In conversations with Kirk, and reading the code, I concluded speedup is called when there is no journal space on the drive. The eliminated BIO_DELETEs are for blocks that are waiting to be re-used. Since they will be re-used shortly, the benefit from doing the trim is slim to none. TRIMs are only useful for LBAs that lay free long enough for the drive's garbage collection to kick in to reduce the write amp that would otherwise happen when un-used blocks are copied forward because the SSD doesn't know the block isn't used. So in this case, the eliminated TRIMs are for blocks that will be used right away, so they are safe to eliminate. We already do similar BIO_DELETE elimination in UFS after Kirk's recent trim consolidation work.

I do not see that the speedup is called only for no journal space. More, from the pho testing of the code surrounding the places where the calls are added, I know that these fragments are executed for non-J case as well, esp. ast_cleanup(). That said, even if they were called for full journal only, I do not see what guarantees that all in-flight TRIMs are for the journal.

Kirk sent me this (patch has been removed, but integrated into next update, but I've not yet internalized it enough to comment tonight):

There are two functions that get called to clean up:
softdep_request_cleanup() and request_cleanup().

The softdep_request_cleanup() function only gets called when the
filesystem appears to have run out of space and wants to flush
blocks so as to be able to allocate new files (rather than report
ENOSPC). This is a rare condition and dumping TRIMs to help avoid
the error is likely a good idea.

The request_cleanup() function is only called from the
softdep_ast_cleanup_proc() function. The AST gets scheduled for
three types of shortages:
(1) block removal overload,
(2) too many new blocks being allocated, and
(3) too many inodes with metadata updates.

Throwing away TRIM commands for these latter two cases would not
help move things along. So it is really only sensible to do the
TRIM dumping when calling request_cleanup() for FLUSH_BLOCKS. So,
I would not put the TRIM dump call in softdep_ast_cleanup_proc(),
but rather only in the FLUSH_BLOCKS case in request_cleanup().

The comment that preceeds the switch statement in request_cleanup()
says that we cannot do the flushing here because `we probably hold
an inode locked at this point'. Since we moved request_cleanup()
into an AST (specifically so we *could* take action), the pause to
wait for the syncer should be replaced with a call to clear_remove() in
the FLUSH_BLOCKS case and to clear_inodedeps() in the FLUSH_INODES case.

My specific changes along these lines are given below.

There is the question of how many TRIMs to dump before calling
clear_remove(). In my diff below, I have a global variable "trimclear"
with the number of TRIM blocks to dump each time this happens. It
could be a sysctl to be global for all UFS filesystems, or it could
be a tuneable stored in the superblock to be per-filesystem, or it
could be set as a mount option. Not sure what is most sensible here.

I should have put this into the Phabicator thread, but I don't
know how to add my diffs without taking it over. So, please feel
free to put it in the Phabricator thread as I would like other
to be able to see this discussion.

Fold Kirk's suggestions in, but only his diff, not the one to remove one of my calls.

Harbormaster completed remote builds in B21234: Diff 51241.Nov 28 2018, 5:57 AM

Ignore write speedup requests.

Harbormaster completed remote builds in B21235: Diff 51242.Nov 28 2018, 6:11 AM

Update based on Kirk's comments and some thinking about the problem.

Harbormaster completed remote builds in B21245: Diff 51296.Nov 28 2018, 5:20 PM

Fixup after rebase oops

Harbormaster completed remote builds in B21246: Diff 51298.Nov 28 2018, 5:27 PM

OK. I've updated this in a few ways:

First, I've signaled that we should speedup trim '0' blocks where kirk has the trimfree stuff. I'll shortly be changing the meaning of this in the I/O scheduler to mean that trims shouldn't be delayed for a little bit (currently planned ~10 scheduler quanta). We do this right before the vnode flushing which is supposed to give us more blocks. So any trims that come in while the speedup is in place are skipped. This may result in a tiny write-amp increase if those blocks aren't the right ones, but the effects on the drive of skipping a few trims is in the noise as far as improved lifetime goes. This will, however, allow us to recycle blocks more quickly when in speedup. When we don't know what a good number is, this seems like a reasonable thing to do. The 0 change hans't been coded up yet (though currently it means flush them all, which I agree is too much).

Next, I've added a speedup write to the top of softdep_speedup(). This is just an instruction to the I/O scheduler to forego write limiting activities for a short time. I've not coded this up, but for the dynamic scheduler, it will stop doing read-bias during that time, and it will also ignore any rate limiting that's been setup. This should help us clear the clogged system better. I plan on doing this for the same ~10 quanta. It may cause new I/O to be scheduled right away, or its effects may be deferred to when the next I/O completes and we go to select the right I/O to do. Since it's clear we're doing a lot of flushing of dirty things during speedup, this should help that process along by turning off the performance biases one would normally have in place with the I/O scheduler and having it revert to a more-fair scheduler (the 'fair' part is why I haven't coded this up yet, still thinking about that).

I think that these two changes will address Kostik's concerns about dumping too many TRIMs in low memory situations.

First pass at turning off write limiting when in speedup mode.

Harbormaster completed remote builds in B21247: Diff 51300.Nov 28 2018, 6:02 PM

This now looks reasonable to me.

At the moment during block shortages we request speedup of retiring TRIMs. But if there are no TRIMs (or we are on a disk that does not support TRIMs) then it seems to me that we should fall back to doing write speedup. Is this a change that we should make at the filesystem level, or should the drive recognize that if it is asked to speedup TRIMs and there are no TRIMs to speed up, then it should fall back to speeding up writes?

This revision is now accepted and ready to land.Nov 29 2018, 12:49 AM

In D18351#390605, @mckusick wrote:

This now looks reasonable to me.

At the moment during block shortages we request speedup of retiring TRIMs. But if there are no TRIMs (or we are on a disk that does not support TRIMs) then it seems to me that we should fall back to doing write speedup. Is this a change that we should make at the filesystem level, or should the drive recognize that if it is asked to speedup TRIMs and there are no TRIMs to speed up, then it should fall back to speeding up writes?

If I threaded through the code right, and that's a big if with this code, we'll have already done so. If I've not threaded through the code right, then we can request both write and trim speedups at that point in the code. The changes I made to the I/O scheduler will allow that. It seems to me that speeding up writing will help ensure that we've gotten the blocks out that will result in being able to free blocks as well, so maybe I should just make the change.

imp marked an inline comment as done.Nov 30 2018, 7:07 AM

imp added inline comments.

sys/ufs/ffs/ffs_softdep.c
13400 ↗	(On Diff #51300)	I think this is the spot Kirk is talking about. Would speeding up writes here and also down a few lines where we start to push items in the worklist make sense?

And I still want my knob to not loose any trim :(.

sys/ufs/ffs/ffs_softdep.c
13650 ↗	(On Diff #51300)	You are changing the execution model from stopping processes at kernel->user boundary and waiting for syncer to catch up, to all processes suddenly trying to do cleanups in their contexts. Did you evaluated the effect of this on a system where a lot of threads modify fs metadata ? It sounds like an introduction of the known 'too many workers' issue to me.

In D18351#391315, @kib wrote:

And I still want my knob to not loose any trim :(.

That doesn't belong at the FS layer. It does in the io sched layer. I'll ad a max discarded trim sysctl and respin the change.

sys/ufs/ffs/ffs_softdep.c
13650 ↗	(On Diff #51300)	I brought this in from Kirk's patch, so I'll have to defer to him for analysis.

mckusick added inline comments.Dec 2 2018, 11:49 PM

sys/ufs/ffs/ffs_softdep.c
13650 ↗	(On Diff #51300)	I don't think that the 'too many workers' issue is a problem here. Each worker will pick a different vnode on which to work. If all the available vnodes are being worked on then they will just wait. The common use case for this condition is 'make installworld'. Many of the files in the typically small root filesystem are being replaced. There are a large set of blocks from the old files waiting to be released. So many in fact that there are not enough free blocks to finish the install. So there is a large set of work items for multiple processes to work on cleaning up.

mckusick added inline comments.Dec 3 2018, 1:16 AM

sys/ufs/ffs/ffs_softdep.c
13400 ↗	(On Diff #51300)	Looking at this more closely, the above code should request a speedup for writes. The speedup for TRIMs should appear at as the first statement in the next 'if' statement (after the big block comment). This first case is the one that usually solves the problem, so we just want to accelerate writes. If it fails, then we are in deeper trouble and want to pull out all the stops including flushing some TRIMs.

Update the pacing base that Kirk sent me to cope better with multiple workers.

This revision now requires review to proceed.Dec 4 2018, 10:24 PM

Harbormaster completed remote builds in B21360: Diff 51588.Dec 4 2018, 10:27 PM

Has this been resolved / committed?

rebase

Harbormaster completed remote builds in B28057: Diff 65452.Dec 10 2019, 7:35 AM

In D18351#445683, @mckusick wrote:

Has this been resolved / committed?

Not committed. I lost track of this, but have started back up after all the hardening work has landed.

also note, there's a lot here, but for the commit I'll break it down smaller.

imp marked 2 inline comments as done.Dec 10 2019, 6:33 PM

Fix typos in comments from review.

Harbormaster completed remote builds in B28062: Diff 65461.Dec 10 2019, 6:35 PM

A couple of minor nits noted inline.

sys/ufs/ffs/ffs_softdep.c
13379 ↗	(On Diff #65461)	Above should be cp = ump->um_cp; (and you likely don't need the cp variable).
13596 ↗	(On Diff #65461)	Above should be cp = ump->um_cp

update, per kirk

Harbormaster completed remote builds in B28119: Diff 65576.Dec 13 2019, 4:19 AM

juan.molina_club.fr removed a subscriber: juan.molina_club.fr.Dec 13 2019, 8:16 AM

kib added inline comments.Dec 13 2019, 3:53 PM

sys/ufs/ffs/ffs_softdep.c
13593 ↗	(On Diff #65576)	I still not convinced that making so many calls to speedup is the right thing. IMO it should come from the su work thread when needed. Or, to be explicit, put speedup into check_clear_deps() when there is a request for cleanup.

imp added inline comments.Dec 13 2019, 9:50 PM

sys/ufs/ffs/ffs_softdep.c
13593 ↗	(On Diff #65576)	so you're suggesting that I push this down into request_cleanup instead? and that this is too high in the loop so will get called too often?

imp added inline comments.Dec 13 2019, 10:13 PM

sys/ufs/ffs/ffs_softdep.c

13593 ↗

(On Diff #65576)

eg, something more like

diff --git a/sys/ufs/ffs/ffs_softdep.c b/sys/ufs/ffs/ffs_softdep.c
index 0a7db7fa96c..43da1b7a8d2 100644
--- a/sys/ufs/ffs/ffs_softdep.c
+++ b/sys/ufs/ffs/ffs_softdep.c
@@ -13590,7 +13589,6 @@ softdep_ast_cleanup_proc(struct thread *td)
                        return;
                if (ffs_own_mount(mp) && MOUNTEDSOFTDEP(mp)) {
                        ump = VFSTOUFS(mp);
-                       g_io_speedup(0, BIO_SPEEDUP_TRIM, &resid, ump->um_cp);
                        for (;;) {
                                req = false;
                                ACQUIRE_LOCK(ump);
@@ -13826,6 +13825,7 @@ clear_remove(mp)
                        BO_UNLOCK(bo);
                        vput(vp);
                finish_write:
+                       g_io_speedup(0, BIO_SPEEDUP_TRIM, &resid, ump->um_cp);
                        vn_finished_write(mp);
                        ACQUIRE_LOCK(ump);
                        return;

Is that what you're suggesting?

kib added inline comments.Dec 13 2019, 10:21 PM

sys/ufs/ffs/ffs_softdep.c
13593 ↗	(On Diff #65576)	It is closer but still not quite. I do not see a point in calling bio_speedup for each individual workitem Besides clear_remove(), clear_inodedep() also should participate in the cleaning. So I would call it in check_clear_deps() before calling any of three specific cleanup routines.

Update per kib. This also looks like exactly the right place to speed
up writes for when iosched implements that.

Harbormaster completed remote builds in B28155: Diff 65662.Dec 14 2019, 7:15 PM

kib added inline comments.Dec 14 2019, 7:34 PM

sys/ufs/ffs/ffs_softdep.c
13761 ↗	(On Diff #65662)	I think this should be done before clear_inodedeps() which is called for suspended journal help, as well. In other words, move this at the beginning of the function.