Details

Reviewers

mckusick
markj

Commits

rG50acaaef54b4: ffs_softdep: force sync if journal is low in journal_check_space
rG2126f103e043: ffs_softdep.c: add journal_check_space() helper
rG64b494a1050a: softdep_prelink(): only do sync if other thread changed the vnode metadata…
rGf7565466622a: ufs_rename(): only do softdep_prerename() when other thread changed a vnode
rGd4d289cd5107: ffs: mark block (re-)allocations as seqc writes
rG5eacde3eb83d: ufs_rename(): softdep_prerename() does something only for SU+J
rGd0929a990c80: ffs: reduce number of dvp relocks in softdep_prelink()
rGb2b40b28b178: ufs_vnops.c: style
rG190110f2eba1: unionfs: do not use bare struct componentname
rG802cf4ab0ee9: namei: add NDPREINIT() macro
rG71faea93070e: seqc: add symbolic name for in-modify bit

Summary

We call into softdep_prerename() and softdep_prelink() when there is low free space in the journal. Functions sync all vnodes participating in the VOP, in the hope that this would reduce journal utilization. But if the vnodes are already synced, doing sync would only spend writes, journal is filled not due to the records from modifications of our vnodes.

Remember original seqc numbers for vnodes, and only initiate syncs when seqc changed.

Tested by: pho

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib requested review of this revision.Apr 29 2021, 10:43 PM

kib created this revision.

Such treatment probably should be done for all callers of softdep_prelink(), but there are much more of them and loops are not local. So for first try this should be good enough.

kib updated this revision to Diff 88479.May 1 2021, 9:57 PM

kib retitled this revision from ufs_rename: avoid unneeded calls to softdep_prerename() to ufs: avoid unneeded calls to softdep_prerename() and softdep_prelink().

kib edited the summary of this revision. (Show Details)

This change does accurately determine when a vnode does not need to be sync'ed. Does it help solve the stalling problem?

In D30041#677509, @mckusick wrote:

This change does accurately determine when a vnode does not need to be sync'ed. Does it help solve the stalling problem?

I did not received any feedback from the reporter.

In fact, I am not sure that the change is completely correct for vp argument (leaf vnodes) as opposed to dvp. I need to do some more code reading for this, which is why I did not asked for your opinion about this approach yet.

I have been working with folks on these two bug reports which I suspect are caused by this problem:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255473
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255799

The problem may be solved by simply having a bigger journal.

The size of the journal should be based on the number of operations that are expected to be done on it in a one-minute period. Unfortunately, tunefs has no direct way to compute this operation count since it knows neither how fast the disk can do transactions nor the capabilities of the system on which it will run. Tunefs is reduced to using the size of the filesystem as an indicator of how big to make the journal.

The journal size ranges from 4Mb (SUJ_MIN) to 32Mb (SUJ_MAX) where the maximum is hit for filesystems of 4Gb and above. The maximum journal size was set more than a decade ago and was based on the number of operations expected to be done in a one-minute period on systems of that era. By my estimates it needs to be about 10x larger to meet that criterion today.

The easiest way to test this theory would be to bump up SUJ_MAX to 350Mb in sys/ufs/ffs/fs.h and then recompile tunefs. Using the recompiled tunefs, disable journaled soft updates, remove the old .sujournal, then reenable journaled soft updates to get a bigger journal.

The problem is that if we either still under-estimate the journal size, or journal write-out is stalled for any reason (not untypical with e.g. USB umass kind of hardware), then machine still gets into the dead loop. Journal is not flushed, we start modifying some directory that even does not have any related journaled entries, and then we create more writes from useless attempts to flush this directory. In the worst case writer cannot cope with ERELOOKUP loop creating more useless writes, which causes livelock with constant stream of writes.

So I do believe that this change, which avoids trying to fsync inodes that in principle cannot add to the journal state, is both improving the eternal writes problem, and simply optimizes UFS. Note that on first entry to ufs_makeentry() or softdep_prelink/softdep_prerename, we do write the directory inode if journal is full. But then, we avoid further entries or writes if we can prove that the directory did not changed since the last flush.

If you can make some of the reporter to try this patch and report back, I believe it would be quite useful. I do not think that Peter can reproduce the issue.

Expanding the journal is just to reduce the incidence of running into this problem. It is still necessary to fix the problem and this looks to be a viable approach.

I have requested the reporter of

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255473

to try running with your code in this review to see if it helps the performance of their system.

bz added a subscriber: bz.May 20 2021, 10:27 AM

FYI I am running the current version now as well. I'll see if my occasional 100% Syncer for a few minutes go away (or get better).
Do you want to track this here or should I follow-up on the PR?

In D30041#682108, @bz wrote:

FYI I am running the current version now as well. I'll see if my occasional 100% Syncer for a few minutes go away (or get better).
Do you want to track this here or should I follow-up on the PR?

No need to add to the PR.

What would be helpful would be helpful is to try running with this patch and see if it mitigates your 100% Syncer pauses. Separately, it would also be helpful to disable journaling as described in comment #10 in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255473 to see if that mitigates the problem.

To make it completely clear, this patch could change something only for +J mounts. On the other hand, I am almost sure that all reports of 'eternal writes' come only about +J.

ffs: mark block (re-)allocations as seqc writes

In D30041#682273, @mckusick wrote:

In D30041#682108, @bz wrote:

FYI I am running the current version now as well. I'll see if my occasional 100% Syncer for a few minutes go away (or get better).
Do you want to track this here or should I follow-up on the PR?

No need to add to the PR.

What would be helpful would be helpful is to try running with this patch and see if it mitigates your 100% Syncer pauses. Separately, it would also be helpful to disable journaling as described in comment #10 in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255473 to see if that mitigates the problem.

A simple rm -rf on an extracted src tree still gets me to
38 root 16 - 0B 16K syncer 4 6:19 96.78% syncer
It doesn't seem to hang as easily though. I'll try to see if I can run a few of this in parallel and how things will behave. I'll also try to keep a file or two open from within them before starting the delete. For me it's hard to say where the real "hangs" came from so (unless you have one) finding a highly reproducible test case would also be good.

Patch as it was tested by pho. Peter was able to reliably reproduce some situation with infinite writes and confirmed that his case was fixed by the patch.

In D30041#688536, @kib wrote:

Patch as it was tested by pho. Peter was able to reliably reproduce some situation with infinite writes and confirmed that his case was fixed by the patch.

That's good news. I found this while waiting for some obj directories to clear before updating machines; I'll apply the new one.

In D30041#688587, @bz wrote:

In D30041#688536, @kib wrote:

Patch as it was tested by pho. Peter was able to reliably reproduce some situation with infinite writes and confirmed that his case was fixed by the patch.

That's good news. I found this while waiting for some obj directories to clear before updating machines; I'll apply the new one.

I can say that the latest version does improve the situation; syncer doesn't go up to 100% tops anymore and rm is a lot faster on suj;

I'll go and disable suj next and see how much more of a difference that makes and then might also play with the journal size (but the latter probably is a different problem).
@mckusick so disabling jounraling (still with the patch applied) makes syncer completely disappear from the radar and rm is faaaaast (as one was used to from UFS in the old days); it's like seconds vs. 10 minutes. So the next question is: should I try again with a journal size 10 times we have currently?
(just to say; before the patch with suj it did less than 3 dirs in 1000+ seconds; now with patch and without journal rm did 8 dirs while I was typing this comment).

In D30041#688788, @bz wrote:

In D30041#688587, @bz wrote:

In D30041#688536, @kib wrote:

Patch as it was tested by pho. Peter was able to reliably reproduce some situation with infinite writes and confirmed that his case was fixed by the patch.

That's good news. I found this while waiting for some obj directories to clear before updating machines; I'll apply the new one.

I can say that the latest version does improve the situation; syncer doesn't go up to 100% tops anymore and rm is a lot faster on suj;

I'll go and disable suj next and see how much more of a difference that makes and then might also play with the journal size (but the latter probably is a different problem).
@mckusick so disabling jounraling (still with the patch applied) makes syncer completely disappear from the radar and rm is faaaaast (as one was used to from UFS in the old days); it's like seconds vs. 10 minutes. So the next question is: should I try again with a journal size 10 times we have currently?
(just to say; before the patch with suj it did less than 3 dirs in 1000+ seconds; now with patch and without journal rm did 8 dirs while I was typing this comment).

Ok, I kept the patch and I bumped the journal size to 512M and that's also still very bearable.
Deleting 6 "work" directories in under 1 minute is probably as was with patch but w/o journal.
Someone should run a real benchmark with multiple runs comparing the 2-3 cases on "non-production" hardware ;-)

In D30041#682108, @bz wrote:

FYI I am running the current version now as well. I'll see if my occasional 100% Syncer for a few minutes go away (or get better).
Do you want to track this here or should I follow-up on the PR?

No need to add to the PR.

In D30041#688830, @bz wrote:

In D30041#688788, @bz wrote:

In D30041#688587, @bz wrote:

In D30041#688536, @kib wrote:

Patch as it was tested by pho. Peter was able to reliably reproduce some situation with infinite writes and confirmed that his case was fixed by the patch.

That's good news. I found this while waiting for some obj directories to clear before updating machines; I'll apply the new one.

I can say that the latest version does improve the situation; syncer doesn't go up to 100% tops anymore and rm is a lot faster on suj;

I'll go and disable suj next and see how much more of a difference that makes and then might also play with the journal size (but the latter probably is a different problem).
@mckusick so disabling jounraling (still with the patch applied) makes syncer completely disappear from the radar and rm is faaaaast (as one was used to from UFS in the old days); it's like seconds vs. 10 minutes. So the next question is: should I try again with a journal size 10 times we have currently?
(just to say; before the patch with suj it did less than 3 dirs in 1000+ seconds; now with patch and without journal rm did 8 dirs while I was typing this comment).

Ok, I kept the patch and I bumped the journal size to 512M and that's also still very bearable.
Deleting 6 "work" directories in under 1 minute is probably as was with patch but w/o journal.
Someone should run a real benchmark with multiple runs comparing the 2-3 cases on "non-production" hardware ;-)

Thanks very much for doing this testing. First, it is helpful confirming that the problem is triggered by the extra work required to handle the journaling. Journaling will always run a few percent slower because you are doing extra work to (dramatically) reduce the running time of fsck after a crash. But we obviously need to bound its extra cost.

Second, thanks for testing out the bigger journal. With big enough bursts of metadata (create, remove, rename, etc) we will always be able to exhaust eithrer the journal space or the amount of memory that we are willing to commit to dependency information. But it is good to know that expanding the journal size did mitigate the problem. What it tells me is that what was considered an excessive burst size 20 years ago is a common burst size today. I will raise the maximum journal size to be more appropriate.

emaste added a subscriber: emaste.Jun 8 2021, 2:35 PM

This looks good and based on testing appears to mitigate effectively. I also recommend changing SUJ_MAX in sys/ufs/ffs/fs.h to 512Mb or at least 256Mb.

This revision is now accepted and ready to land.Jun 10 2021, 4:46 AM

People experiencing this problem report that running the sync(8) command cleared up the slowdown. The reason that it helps is because the problem is triggered when the journal is low on space to make new entries. The journal is written as a circular queue, so running low on space means that the head of the list is nearing the tail of the list. The only way to make more space in the journal is to flush the oldest entries in it so that the tail can advance. What we do in softdep_prelink() is to flush out the entry that we have at hand. But it is almost never causes the oldest entry in the journal to be flushed. So the problem persists and we are just going to keep hitting it until some event causes the oldest entry to be flushed. The reason that sync(8) is so helpful is that it causes the older entries in the journal to be cleared which makes the problem go away.

My conclusion is that when we find that journal space is low we should do both our current action and also a sync. This will penalize the process that happens to find the problem, but will relieve the pressure on the filesystem going forward.

markj added inline comments.Jun 14 2021, 4:08 PM

sys/sys/namei.h
235 ↗	(On Diff #90444)	Shouldn't the last parameter be just `offsetof(struct nameidata, ni_dvp_seqc)`? Otherwise we are just filling the first 8 bytes.
sys/ufs/ffs/ffs_softdep.c
3410 ↗	(On Diff #90444)	unionfs does not pass a full nameidata to VOP_MKDIR in unionfs_mkshadowdir(). There is a similar problem with unionfs_mkwhiteout() and unionfs_vn_create_on_upper().
3414 ↗	(On Diff #90444)	I think it is worth adding a comment explaining the problem solved by this check.

Fix NDINIT_PREFILL.
Use nameidata in unionfs.
Add comment explaining the seqc check in softdep_prelink().

This revision now requires review to proceed.Jun 14 2021, 7:10 PM

Good catch on Mark's part.

Inline suggestion to do sync to try to push the oldest part of the journal.

sys/ufs/ffs/ffs_softdep.c
3464 ↗	(On Diff #90889)	I suggest that you add FREE_LOCK(ump); VFS_SYNC(mp, MNT_NOWAIT); ffs_sbupdate(ump, MNT_WAIT, 0); ACQUIRE_LOCK(ump); As I suggested in my comment this will likely clear the oldest entries in the journal which will avoid causing us to suspend the filesystem. Note that you you will need to save mp above.