Page MenuHomeFreeBSD

vfsmountroot: Add a limit to waiting for root holds to be released
AcceptedPublic

Authored by imp on Jul 2 2025, 3:44 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Oct 12, 3:49 AM
Unknown Object (File)
Tue, Oct 7, 11:59 PM
Unknown Object (File)
Fri, Oct 3, 11:22 AM
Unknown Object (File)
Thu, Oct 2, 12:01 PM
Unknown Object (File)
Tue, Sep 30, 8:14 AM
Unknown Object (File)
Sat, Sep 27, 1:14 PM
Unknown Object (File)
Wed, Sep 17, 3:55 AM
Unknown Object (File)
Aug 29 2025, 11:24 PM
Subscribers

Details

Reviewers
olce
Summary

For UFS filesystems, we don't wait for root holds at all
anymore. Instead, if the device is available, we proceed to mount
root. We wait up to 3 seconds for the device to become available, by
default.

For ZFS, we always wait for the root holds to clear. On systems that
have lots of hard disks, this can be several minutes for some
pathological situations (greater than 15 miniutes has been
observed).

Since the disks that take this long to be discovered or taste are not
typically root, and in fact are typically garbage that we don't want to
use, there's no harm in not waiting for them. Introduce a new
hold_timeout variable in both the mount.conf(5) scripts and globally as
vfs.mountroot.hold_timeout. It defaults to 10 seconds. If these values
are set to 0, no waiting is done. If set to -1, it will wait forever
(which was FreeBSD 14 and earlier behavior, to varying degrees).

We already wait for the root holds to be released, with a 30 second
timeout, in /etc/rc.d in a few spots, so this should dovetail with that
feature nicely.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 65168
Build 62051: arc lint + arc unit

Event Timeline

imp requested review of this revision.Jul 2 2025, 3:44 AM
share/man/man5/mount.conf.5
244

Nit: Doc typo; should read "is non-zero."

There's a small typo. I would add a bit more explanations in the man page, along the lines suggested in the inline comments. And perhaps factor out the code of parse_dir_timeout()/parse_dir_hold_timeout().

share/man/man5/mount.conf.5
139–140

Not directly related with the change here, but would add a reference to the explanation of root holds under the new .hold_timeout below.

146–150

I would explain here what "root holds" are (which, BTW, are more "filesystem holds" then "root" ones), as I have not found some notes elsewhere on that topic (maybe I missed something), and the specific cases of ZFS, NFS and P9FS.

243

Typo.

This revision is now accepted and ready to land.Jul 2 2025, 5:48 PM

Root holds are an old notion that says 'I've probed all my devices now' and when all devices that can have a root partition are probed, we can proceed to mounting /.
This was updated a few years ago. UFS ignore it unless it can't find the device we're mounting quickly enough... It's more of a 'hold for all disks to arrive' and ignore the usually short (but sometimes not) delay that tasting introduces after the last disk arrives and when we can actually mount it. Sometimes that's caught by g_wait_idle, sometimes no. But in this case, we wait for root holds (or 30s) before importing ZFS and a few other places. In practice this is generally good.

Where things go off the rails, though, is if you have a system with 40 drives and 1 of them is failing in certain ways. We've tried to filter most of the failures out, but there are still some that are known to cause minutes worth of delay before that initial scanning of devices is complete. There's no notion of "I started the scan, but it's taking too long, move on" in some SIM drivers, and some of the firmware they talk to make this hard. It's rare enough that these SIMs haven't been hardened against this (mps is the one I've fought most recently).

But for ZFS, P9FS, etc we have no fixed disk that we can lookup, so we wait. This may be silly in the NFS case and maybe the P9FS case too. For ZFS you gotta wait for something... And it's unclear what to wait for since you usually don't know the name of the root pool, so it can't just be the first pool. It also has strongish assumptions of importing the disks at boot which the marginal disks violate.

We do 'failure in place' and have about a dozen of these disks in our fleet of several thousand. But tolerating this condition means we save the RMA and equipment replacement costs which can range to 4 or 5 figures. It doesn't take too many 'saves' to be worth the hassle.

Also, it turns out this is bogus, or our handling of the breaking assumption is bogus. If we just go after so many seconds, CAM might not have finished the initial scan, which means none of the periphs have had a chance to probe so the devices we need for zpool aren't there. For UFS, we fall back to waiting so it's not a problem there. for ZFS we just fail. So better logic would help here as well. But the need to switch back and forth between the two makes doing this tricky since some of the 'quick' timeouts later in the boot assume we're basically done and not arbitrarily delayed.

So with that background, now you know why I likely will abandon this change... It's insufficient.