Print more info when super blocks don't match, implement ability to ignore mismatches.
ClosedPublic
Actions

Authored by imp on Jul 13 2017, 10:41 PM.

Details

Reviewers

mckusick
kib
scottl

Commits

rS322861: Note change brought on by 322860 MFC.
rS322860: MFC of 276737, 322200, 322201, 322271, and 322297
rS322806: MFC of 322200, 322201, 322271, and 322297
rS322298: Add an entry to UPDATING for r322297 which restores the ability
rS322297: Since the switch to GPT disk labels, fsck for UFS/FFS has been
rS322201: In debug mode, print the differences between the superblock and
rS322200: Make it possible to ignore superblock mismatch. This will not fix such

Summary

In debug mode, print the differences between the superblock and
alternate superblock when the values disagree and we're going to
reject it. Remove looking for alternate super blocks, for now.

Konstantin: I'm finally getting around to fixing this. Sorry it took so long...

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

Updating D11589: Print more info when super blocks don't match, implement ability to

ignore mismatches.

Harbormaster completed remote builds in B10484: Diff 30753.Jul 13 2017, 10:42 PM

imp retitled this revision from Print more info when super blocks don't match, implement ability to ignore mismatches. to Print more info when super blocks don't match, implement ability toignore mismatches..Jul 13 2017, 10:44 PM

imp edited the summary of this revision. (Show Details)

imp added reviewers: mckusick, kib, scottl.

imp retitled this revision from Print more info when super blocks don't match, implement ability toignore mismatches. to Print more info when super blocks don't match, implement ability to ignore mismatches..

Can you remind me the context where the previous discussion occur ? I do not remember a bit about it, sorry.

Overall this looks technically good, except I do not understand the problem you are trying to solve.

In D11589#240154, @kib wrote:

Can you remind me the context where the previous discussion occur ? I do not remember a bit about it, sorry.

After commit r276737 you rightly told me that proto was now being used uninitialized. I told you at the time I'd take care of it, but it has taken until now to do so.
originally said 'preen' here due to autocowrecked text

Overall this looks technically good, except I do not understand the problem you are trying to solve.

We have had an issue, still not completely root caused, where the last alternate supreblock is bad (all zeros, either due to enemy action, or a newfs run gone awry). We needed some way to proceed with the fsck anyway to help track down the problem. In looking at that problem, the unused proto became obvious, so that was fixed as well. So the CHK changes are to give more data about how they don't match, and the prompt is to allow fsck to proceed in the face of this error to find out if it is localized to that one block, or more systemic. I have the two logical commits separated out in my git tree that I'll commit from.

In D11589#240165, @imp wrote:

We have had an issue, still not completely root caused, where the last alternate supreblock is bad (all zeros, either due to enemy action, or a newfs run gone awry). We needed some way to proceed with the fsck anyway to help track down the problem. In looking at that problem, the unused proto became obvious, so that was fixed as well. So the CHK changes are to give more data about how they don't match, and the prompt is to allow fsck to proceed in the face of this error to find out if it is localized to that one block, or more systemic. I have the two logical commits separated out in my git tree that I'll commit from.

But why did you unconditionally removed reading of the alternative superblocks ? Isn't a prompt to agree/disagree with the action enough ? Also, for your needs, you might add a flag to disable alternative reads, from the command line.

The verbose comparision changes are fine.

In D11589#240652, @kib wrote:

In D11589#240165, @imp wrote:

We have had an issue, still not completely root caused, where the last alternate supreblock is bad (all zeros, either due to enemy action, or a newfs run gone awry). We needed some way to proceed with the fsck anyway to help track down the problem. In looking at that problem, the unused proto became obvious, so that was fixed as well. So the CHK changes are to give more data about how they don't match, and the prompt is to allow fsck to proceed in the face of this error to find out if it is localized to that one block, or more systemic. I have the two logical commits separated out in my git tree that I'll commit from.

But why did you unconditionally removed reading of the alternative superblocks ? Isn't a prompt to agree/disagree with the action enough ? Also, for your needs, you might add a flag to disable alternative reads, from the command line.

The data to find the alternate super block no longer is returned from the kernel since we transitioned to gpart a decade ago. And in most cases, it now no longer even exists with the transition to GPT. So fsck has no way of knowing where the alternate superblocks actually live...

The prompt to proceed is to ignore the mismatch and soldier on.

Being able to search for alternate superblocks can be very helpful. As discussed by Warner and myself in the email thread below, to do so we need to be able to store a single 32-bit number somewhere external to the filesystem. I really hope that someone on this thread has an idea on where it could be placed.

From: Kirk McKusick <mckusick@chez.mckusick.com>
To: Warner Losh <imp@bsdimp.com>
Subject: Re: fsck (again)
Date: Thu, 13 Jul 2017 15:29:41 -0700

From: Warner Losh <imp@bsdimp.com>
Date: Thu, 13 Jul 2017 08:13:34 -0600
Subject: Re: fsck (again)
To: Kirk McKusick <mckusick@mckusick.com>

On Wed, Jul 12, 2017 at 8:23 PM, Kirk McKusick <mckusick@mckusick.com> wrote:

It actually is useful to have fsck able to search for alternate
superblocks. So, the "correct" solution is to find an alternate
way to the deprecated ioctls to get the information needed to
populate the proto superblock.

I agree it's useful. However, I'm wondering how we can do that when the
IOCTLs have been removed from the kernel and are impossible to implement,
as far as I can tell, in GPT where the various parameters need aren't saved
to the GPT label. Geom removed this ability long before I GC'd the code. I
agree it would be useful, and I'm open to any ideas you might have on how
it could be restored.

Barring that, I'll proceed with my cleanup in the area...

Warner

We need just two pieces of information to able to find the alternate
superblocks. The filesystem size (and the size of the partition is
good enough for that purpose) and the blocks per cylinder group.
So, if there is some place that we can store a single 32-bit integer
in a GPT label (a name field or elsewhere), we could keep this most
useful feature.

~Kirk

-=-

From: Kirk McKusick <mckusick@chez.mckusick.com>
To: Warner Losh <imp@bsdimp.com>
Subject: Re: fsck (again)
Date: Thu, 13 Jul 2017 15:49:20 -0700

From: Warner Losh <imp@bsdimp.com>
Date: Thu, 13 Jul 2017 16:29:37 -0600
Subject: Re: fsck (again)
To: Kirk McKusick <mckusick@mckusick.com>

I'll look into that. If we have the size, is there no way we can
guess that parameter based on what newfs would do by default?

Warner

We can guess based just on the size, but changing almost any parameter
during newfs will make that guess wrong. While having only BPG will
not cover all possible variations, it covers all of the commonly
changed things like block size, fragment size, inode-per-group, etc.

Also, if someone created a filesystem with default parameters, then
decided they really wanted something different and recreated it with
different parameters, all those original superblocks will still be
on the disk and until every one of them is over-written we would find
them and think it valid. The result would not be a pretty sight.

~Kirk

-=-

From: Kirk McKusick <mckusick@chez.mckusick.com>
To: Warner Losh <imp@bsdimp.com>
Subject: Re: fsck (again)
Date: Thu, 13 Jul 2017 16:08:06 -0700

From: Warner Losh <imp@bsdimp.com>
Date: Thu, 13 Jul 2017 16:53:13 -0600
Subject: Re: fsck (again)
To: Kirk McKusick <mckusick@mckusick.com>

OK. Fair enough. I'd contemplated doing a brute force scan, but this
is a good reason not to do that.

Brute force was my original implementation and after several spectacular
meltdowns with out-of-date superblocks, I came up with the plan of saving
key parameters in the disk label. Originally we needed the disk geometry
details, but newfs no longer uses any of that, so just a single number
(BPG) will nearly always suffice. And when it does not, we are unlikely
to find bogus superblocks.

BTW, https://reviews.freebsd.org/D11589 has the first round of fixes (which
is cleaning up the now moribund feature). I'll work the gpt angle a bit and
see if it's possible to save that data.

Warner

If you can find a way to store BPG, I'll volunteer to figure out how
to get calcsb to use it.

~Kirk

In D11589#242174, @mckusick wrote:

Being able to search for alternate superblocks can be very helpful. As discussed by Warner and myself in the email thread below, to do so we need to be able to store a single 32-bit number somewhere external to the filesystem. I really hope that someone on this thread has an idea on where it could be placed.

If you can find a way to store BPG, I'll volunteer to figure out how
to get calcsb to use it.

Kirk, thank you very much. This (indirectly) answers the question that I cannot answer myself and which blocked my understanding of the patch. Am I right in the following statements:

If newfs was used with default settings, and the partition was not resized, then alternate superblocks can be found by fsck using the same calculation as was done by newfs.
The problematic case is the e.g. resizing of the partition and the filesystem after initial filesystem creation, since alternate superblocks with stale content would be found.

In the resizing case, wouldn't checking sb.fs_size against the truncated partition size give enough confidence in validity, in addition to the signature check ?

In other words, alternative superblock search must become more 'timid' and grow some additional checks, possibly with even more verbose explanation to user, e.g. mentioning real size vs. fs_size. But it should be kept even if BPG cannot provided by the storage subsystem.

If newfs was used with default settings, and the partition was not resized, then alternate superblocks can be found by fsck using the same calculation as was done by newfs.
The problematic case is the e.g. resizing of the partition and the filesystem after initial filesystem creation, since alternate superblocks with stale content would be found.

If I newfs the filesystem again with non-default parameters, I'll find one or more of the old super blocks with the bogus values in them. Since we don't re-write the whole volume, but just lazily splat the cg and super blocks in the new places, the old superblocks won't be over-written.

In the resizing case, wouldn't checking sb.fs_size against the truncated partition size give enough confidence in validity, in addition to the signature check ?

For the case you are suggesting, it may be. However, there are more cases than just that which are problematic. When we use the problematic superblock, we'll destroy the filesystem.

In other words, alternative superblock search must become more 'timid' and grow some additional checks, possibly with even more verbose explanation to user, e.g. mentioning real size vs. fs_size. But it should be kept even if BPG cannot provided by the storage subsystem.

Given the risks for fairly common cases, I'm not sure we can guess at BPG to get the location of the superblocks. I can't think of any additional tests that will be reliable at ensuring safety, though the tests you have identified will ensure the known unsafe blocks will be ignored. When we know the last known BPG, we have good confidence any alternate blocks we find have a very high likelihood of being legit. When we go guessing at it, the problems multiply...

In D11589#242322, @imp wrote:

Given the risks for fairly common cases, I'm not sure we can guess at BPG to get the location of the superblocks. I can't think of any additional tests that will be reliable at ensuring safety, though the tests you have identified will ensure the known unsafe blocks will be ignored. When we know the last known BPG, we have good confidence any alternate blocks we find have a very high likelihood of being legit. When we go guessing at it, the problems multiply...

I am fine with not guessing BPG and alternate superblock locations in default mode, but would very much preferred to have a mode where fsck would make a pass with the guessed value, enabled by an option. For instance, all my machines have default layouts for UFS volumes, so default BPG is correct for all my use cases [1].

[1] Assuming that the default block size did not changed since the newfs time.

Date: Sat, 22 Jul 2017 10:25:09 +0000
To: mckusick@freebsd.org
From: "kib (Konstantin Belousov)" <phabric-noreply@FreeBSD.org>
Subject: [Differential] D11589: Print more info when super blocks don't match, implement ability to ignore mismatches.

kib added a comment.

In https://reviews.freebsd.org/D11589#242174, @mckusick wrote:

Being able to search for alternate superblocks can be very
helpful. As discussed by Warner and myself in the email thread
below, to do so we need to be able to store a single 32-bit number
somewhere external to the filesystem. I really hope that someone
on this thread has an idea on where it could be placed.

If you can find a way to store BPG, I'll volunteer to figure
out how to get calcsb to use it.

Kirk, thank you very much. This (indirectly) answers the question
that I cannot answer myself and which blocked my understanding of
the patch. Am I right in the following statements:

If newfs was used with default settings, and the partition was not
resized, then alternate superblocks can be found by fsck using the
same calculation as was done by newfs.

If newfs defaults are used, then fsck can correctly calculate where
the alternate superblocks are located. If it was resized then there
are two cases:

The original filesystem was built smaller than the partition. This is not really a case because the newfs defaults were not used as a smaller filesystem size was specified. Using the newfs defaults to calculate alternate superblocks would look for some superblocks that do not exist.

The partition was made bigger then the filesystem was grown into it. In this case using the newfs defaults to calculate alternate superblocks would work provided that the growth fully utilized the new space. If it did not then the problem with looking for non-existent superblocks in case (1) would exist.

The problematic case is the e.g. resizing of the partition and the
filesystem after initial filesystem creation, since alternate
superblocks with stale content would be found.

The alternate superblocks in the original part of the filesystem
would have the old filesystem size stored in them as you note.
This could be trivially fixed if growfs were to update all the
old alternate superblocks (which it probably should do).

In the resizing case, wouldn't checking sb.fs_size against the
truncated partition size give enough confidence in validity, in
addition to the signature check ?

I am not sure what you mean by "checking sb.fs_size against the
truncated partition size" since we don't know what the old partition
size used to be. Assuming you mean that if we find that sb.fs_size
is less than the current partition size, that we should assume that
it is stale and that the filesystem actually is the size of the
partition. This will work if the filesystem was not created smaller
than the partition when it was newfs'ed. We can likely catch this
case as there will not be alternate superblocks past sb.fs_size
unless they are left over from a previous filesystem creation.

In other words, alternative superblock search must become more
'timid' and grow some additional checks, possibly with even more
verbose explanation to user, e.g. mentioning real size vs. fs_size.
But it should be kept even if BPG cannot provided by the storage
subsystem.

This is where it gets tricky. Suppose a filesystem is created using
the default newfs parameters which will lay out superblocks where we
expect to find them. But now the system administrator decides that
they actually want a filesystem with say 64K blocks instead of the
default 32K blocks. So they run newfs again with this change. With
64K blocks cylinder group maps can be bigger which means cylinder
groups are bigger and the alternate superblocks will be laid out
in completely different places. Other common non-standard changes
to newfs like ratio of frags to blocks (32K/8K or 32K/32K say) or
the number of bytes per inode gets increased will change the size
of the cylinder groups.

The case where they first ran newfs with defaults, then ran it with
changes that affect the size of cylinder groups is common. And fatal
if we find and use the default newfs superblocks to try to recover the
filesystem. The default superblocks will pass all of our tests, yet
they are wrong for the filesystem that we actually have. This is why
we really need to know the number of blocks per cylinder group to have
a viable chance of finding correct alternate superblocks.

I am hoping that we can store that number either somewhere in the disk
label or perhaps at the end of the area reserved at the beginning of the
filesystem for boot blocks.

~Kirk

In D11589#242376, @mckusick wrote:

I am not sure what you mean by "checking sb.fs_size against the
truncated partition size" since we don't know what the old partition
size used to be.

I mean that UFS does not use whole partition due to cg alignment requirements, which causes it to use truncated partition size. I did not mean the old size, but the new with applied alignment requirement for the end of fs.

But it should be kept even if BPG cannot provided by the storage
subsystem.

This is where it gets tricky.

I do not disagree. But I want an option for sysadmin to specify when he knows what he does.

I would posit that fsck is the wrong place to replicate newfs' calculations, unless there's some value that's added that I'm not seeing.

I'd suggest adding it to newfs or maybe fsdb: nfwfs --list-alt-sblock <other parameters> /dev/foo would be preferable, especially if you could have it list only the Nth one so you could feed it on the command line to fsck -b.

In D11589#242393, @imp wrote:

I would posit that fsck is the wrong place to replicate newfs' calculations, unless there's some value that's added that I'm not seeing.

I'd suggest adding it to newfs or maybe fsdb: nfwfs --list-alt-sblock <other parameters> /dev/foo would be preferable, especially if you could have it list only the Nth one so you could feed it on the command line to fsck -b.

I can tell you a real story that occur to me on my previous job. We had relatively large RAID 6+1 served by mfi(4). By large I mean that full copy of the volume cannot be stored on less than 15 max-sized desktop grade HDDs of that time, which makes non-destructive experiments in case of any problem rather unfeasible. It was formatted as UFS.

One day we were bitten by (most likely) a LSI firmware bug which caused corruption of several hundred blocks at the start of the volume. It was not critical to loose several thousands of files stored on the volume, but loosing all the content would had terminated the whole project that the storage served. It was quite a joy when single fsck run recovered the problem in almost automatic mode. We lost cg 0 and all inodes stored at it, but it was acceptable. Of course, main superblock was damaged, but I did not even noticed this.

I am not sure that I would be capable of correctly use the non-destructive newfs to get the list of alternate superblocks in this stressful situation. While fsck just worked. I want a switch that allows that, if default behavior changed.

Every UFS filesystem has a chunk of space reserved at the front for a boot block. Historically 8K, later 64K, now 256K. This size is known, so if we put a structure at the end of it with the BPG size, then fsck can find it and with great reliability find alternative superblocks. If the boot block is *entirely* filled, then we will lose that info, but it almost never is completely filled. As kib points out, administrators are running fsck when they are trying to fix their filesystems. It is far easier if fsck just deals with the problem rather than sending them off on some wild goose chase with other utilities to find alternate superblocks. Especially because this is a rare situation, but nearly always critical when it happens. We should not be lazy here. We should fix this issue.

In D11589#242603, @mckusick wrote:

Every UFS filesystem has a chunk of space reserved at the front for a boot block. Historically 8K, later 64K, now 256K. This size is known, so if we put a structure at the end of it with the BPG size, then fsck can find it and with great reliability find alternative superblocks. If the boot block is *entirely* filled, then we will lose that info, but it almost never is completely filled. As kib points out, administrators are running fsck when they are trying to fix their filesystems. It is far easier if fsck just deals with the problem rather than sending them off on some wild goose chase with other utilities to find alternate superblocks. Especially because this is a rare situation, but nearly always critical when it happens. We should not be lazy here. We should fix this issue.

newfs already has a -N flag which will tell you the alternate blocks, so we're not totally screwed at the moment.

I agree the problem should be solved, but not in the context of this code review. I plan to move forward on this.

As for boot blocks, that's an interesting idea. Let's see what we can do there.

we only use /boot/boot on BSD labeled 'a' partitions
/boot/boot has, on x86, jmp main at the start, and we can look for the EB byte as a magic number
No other platforms have boot2 (though sparc64 has boot1 which lives in the sunlabel and starts with a call ofw_init, also code we can recognize, if we care about sparc64). powerpc does have boot1.chrp, but it's plopped into it's own partition and loaded. There's some weirdness in arm land as well, but we don't start partitions at the start of the drive to avoid issues like this.
All GPT partitions never touch this area
/boot/boot on x86 has 108 bytes free today, but compiler technology makes that a constant battle

So there might be room here for a solution. However, any solution that puts data into the boot blocks (which we might be able to do) would need to know how to preserve it when new boot blocks are installed. And we'd have to deal with the common situation where boot blocks are installed after newfs and newfs having no clue as to their length or if it is present at all. It would be a bit of a pita to work through all the places we do this. This is a bigger undertaking than it sounds initially, I think...

So, unless there's an objection to cleaning up what already is functionality that's lost, I'd like to move forward on this.

A thought in the shower this morning...

To fix this in most cases, and not reinvent the wheel too much, maybe it makes sense to say the ufs2 boot block limit is 62k, say, and just make 4 extra copies before the current primary one. For older systems, those copies wouldn't be there, and would likely be zeros. This would give us 5 copies of the super block at fixed locations that we could read. It suffers a little from issues if multiple sectors in a row near the start of the disk fail, but I would wager that isn't a common failure mode anymore, and if we see it the odds for recovery is super small anyway.

This would help the largest number of systems, could easily be 'fixed' in fsck to do the searching and the extra writing (which could be done on a life system if need be, though a life system update would need to be an ioctl I think), wouldn't run into the boot block issues and would solve the problem on the vast majority of systems (leaving out only old, UFS1 BSD labeled disks).

Closed by commit rS322200: Make it possible to ignore superblock mismatch. This will not fix such (authored by imp). · Explain WhyAug 7 2017, 9:24 PM

This revision was automatically updated to reflect the committed changes.

emaste added a subscriber: emaste.Aug 8 2017, 6:20 PM

emaste added inline comments.

head/sbin/fsck_ffs/setup.c
61	This needs to be removed as well, I'll take care of it in a moment.

From: Warner Losh <imp@bsdimp.com>
Date: Mon, 7 Aug 2017 23:56:36 -0600
Subject: Re: [Differential] D11589: Print more info when super blocks don't match, implement ability to ignore mismatches.
To: Kirk McKusick <mckusick@mckusick.com>
Cc: Warner Losh <imp@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>, Scott Long <scottl@netflix.com>

On Mon, Aug 7, 2017 at 9:05 PM, Kirk McKusick <mckusick@mckusick.com> wrote:

I think you are making this more complicated than it needs to be.
Attached is a proposed diff that has newfs store the needed information
at the end of the boot block area and it is used by fsck to find
alternate superblocks. I simplified it so it only works with UFS2.

If the recovery data is lost, it falls back to the current behavior
(advising you to use -b to specify alternate superblocks).

I have also added some code to allow a system administrator to add
the recovery information to an existing filesystem when running fsck
in manual (non-preen) mode. If the filesystem does not have the
parameters saved away, they are given the opportunity to have them
written.

Do either of you see any reason why I should not add these features?

tl;dr: I'm cool with this, but consider the following....

This is encroaching into the ufs2 boot block area. I thought it would be a
big deal. Or at least worthy of a minor footnote. One could, in theory,
create boot blocks that are up to 64k. But you'd have to try really hard,
because the standard tools (gpart and friends) enforce an 8k limit on how
big you can write (making basically a pessimal assumption of UFS1). Anybody
with these crazy boot blocks is DDing them today onto disks that are small
enough to fit BSD labels (eg, non GPT which ignores this area entirely). So
I doubt there's anybody but really crazy folks that have > 8k boot blocks.
So this would be safe totally safe to do for the overwhelming majority of
our user base (and those that contrived to do this today would know how to
cope). So as a technique, you're right, I was making it too hard....

I had a though in the shower this morning: wouldn't it be simpler to just
write just to write two extra super blocks in the same slot? Then they
could be read directly if the primary one failed since they'd also be at
know offsets? Seems like a more direct way to get two redundant copies of
the superblock... You couldn't add them to the search list directly (since
one of the validity checks is an offset check), but they could be the first
SBs searched instead of the superblock in the last cylinder group. And
since we have 56k, we could store 28 copies of the bootblock w/o anybody
noticing.... fsck or tunefs could easily be hacked to write these extra
copies (whatever the number) if it was safe to do so (eg, if the sblock
offset was really 64k) depending on how automatic you'd want to make it
(all the time, only when asked via a flag, etc). This way, you wouldn't
need to calculate the location because they are in a known location that's
invariant. Since you're writing new stuff anyway, why not shoot the moon,
eh? Nobody's using the space...

But in reality, I'm easy either way though... Both produce similar outcomes
and both have the same issues with clustered failure. I think extra copies
is cleaner (the 28 copy one being more resilient), but you have code in
hand for the other way.

Warner{F1552362}

Revision Contents
Changeset List

Path

Size

head/

sbin/

fsck_ffs/

setup.c

49 lines

Diff 31725

View Options

head/sbin/fsck_ffs/setup.c

Print more info when super blocks don't match, implement ability to ignore mismatches.ClosedPublicActions