libsa/zfs: prefer a vdev with newer txg to a stale one
ClosedPublic
Actions

Authored by glebius on Jul 9 2025, 5:55 AM.

Details

Reviewers

imp
dfr
tsoome
mm
mav
kevans
allanjude
manu

Commits

rG6dd0803ffd31: libsa/zfs: prefer a vdev with newer txg to a stale one

Summary

Before this change, the first probed member of a pool would initialize
vdev tree for the pool.  Now, imagine a situation when a machine has a
disk that has been removed from the pool, but the ZFS label was not
erased.  That's a typical scenario - disk goes offline, it is replaced
with a spare, no data changes written to the gone disk.  Then, disk
appears back at boot time and it is the first one to be probed by the
loader.  It has the same pool GUID as all other members and naive loader
would not see a conflict.  Then the disk will be used as source of truth
to read the bootenv.

To fix that, provide vdev_free() that allows to rollback the already
prebuilt vdev tree, so that a new one can be built from scratch.  Upon
encountering a newer configuration for already known top level part of a
pool, call vdev_free() and let vdev_probe() to build a new one.

The change has been tested with loader_lua and userboot.so, but it should
have same effect on the legacy boot1 loader.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

glebius created this revision.Jul 9 2025, 5:55 AM

Herald added a reviewer: manu. · View Herald TranscriptJul 9 2025, 5:55 AM

Herald added subscribers: dab, delphij. · View Herald Transcript

glebius requested review of this revision.Jul 9 2025, 5:55 AM

Harbormaster completed remote builds in B65290: Diff 158204.Jul 9 2025, 5:55 AM

imp accepted this revision.Jul 9 2025, 6:28 AM

This revision is now accepted and ready to land.Jul 9 2025, 6:28 AM

Thanks!

I wonder what happen if system crash after updating one label on only one disk out of several. In such case vdev_label_read_config() will return config of that vdev for that TXG. But no other vdev have config for that TXG yet. So if we already probed some other vdevs, the added vdev_free(spa->spa_root_vdev); will wipe root vdev and all the children configs for the pool and restart it from scratch, populating only vdev where the currently probed disk belongs. But already probed vdevs I guess will not be re-probed, leaving pool-wide configuration incomplete. It seems not implemented here, looking on UINT64_MAX always passed to vdev_label_read_config(), but IIRC normal pool import code tries to find maximum common TXG where it has sufficient quorum. It might not be the maximum TXG, but the previous one, or theoretically allowed even one before that, if something went very wrong.

In D51219#1169548, @mav wrote:

I wonder what happen if system crash after updating one label on only one disk out of several. In such case vdev_label_read_config() will return config of that vdev for that TXG. But no other vdev have config for that TXG yet. So if we already probed some other vdevs, the added vdev_free(spa->spa_root_vdev); will wipe root vdev and all the children configs for the pool and restart it from scratch, populating only vdev where the currently probed disk belongs. But already probed vdevs I guess will not be re-probed, leaving pool-wide configuration incomplete. It seems not implemented here, looking on UINT64_MAX always passed to vdev_label_read_config(), but IIRC normal pool import code tries to find maximum common TXG where it has sufficient quorum. It might not be the maximum TXG, but the previous one, or theoretically allowed even one before that, if something went very wrong.

You are right, the loader doesn't collect all txgs to make a later conclusion, like kernel does. So, you are right, a situation when loader selects an incorrect vdev (now with the largest, but invalid, txg) is still possible. btw, it is also questionable if it is invalid for purposes of a loader? This was also possible before my change, too. Anyway, the change is improvement. Situation of deletion of a disappeared disk that later reappears is orders of magnitude more probable than situation of a crash during label update.

Situation of deletion of a disappeared disk that later reappears is orders of magnitude more probable than situation of a crash during label update.

The labels are updated on each TXG commit, that may be many times per second. So the races are quite probable there, and unlike resurrecting old disk are completely unpredictable and also harder to recover by just removing that disk.

In D51219#1169554, @mav wrote:

The labels are updated on each TXG commit, that may be many times per second. So the races are quite probable there, and unlike resurrecting old disk are completely unpredictable and also harder to recover by just removing that disk.

Just for my education, at what events does txg of a pool is updated? I see that with normal filesystem I/O it is not updated. Is it on pool properties changes only?

Anyway, my point is that:

a vdev with high, but not committed to other vdevs, txg could have been chosen before my patch.
I'm not sure that selecting a not fully committed txg for a loader is mistake at all. E.g. I got unstable, panicing boot environment, that I want to switch to a different one ASAP. As I run bectl, system panics updating only one vdev. I'd prefer actually the loader to grab the new setting from this vdev. So for read-only purposes of a loader, that could be the right choice.

In D51219#1169581, @glebius wrote:

In D51219#1169554, @mav wrote:

The labels are updated on each TXG commit, that may be many times per second. So the races are quite probable there, and unlike resurrecting old disk are completely unpredictable and also harder to recover by just removing that disk.

Just for my education, at what events does txg of a pool is updated? I see that with normal filesystem I/O it is not updated. Is it on pool properties changes only?

TXGs are committed once there is some minimal amount of dirty data reached, but no less than every 5 seconds if there were any changes. Under constant data stream TXGs are committed as fast as underlying storage can handle metadata updates and cache flushes, which may vary between some fractions of a second to multiple seconds. Many zpool and zfs commands do it forcefully. zpool sync does just that, for example, if you need up to date stats, etc.

Anyway, my point is that:

a vdev with high, but not committed to other vdevs, txg could have been chosen before my patch.

I'm not sure that selecting a not fully committed txg for a loader is mistake at all. E.g. I got unstable, panicing boot environment, that I want to switch to a different one ASAP. As I run bectl, system panics updating only one vdev. I'd prefer actually the loader to grab the new setting from this vdev. So for read-only purposes of a loader, that could be the right choice.

The problem may be not so much that you take the latest TXG, but that you wipe all the previous configuration. Configuration in vdev label due to its limited size covers only that specific top-level vdev. So you are booting from a pool with more than one top-level vdev, after vdev_free(spa->spa_root_vdev) you may end up in situation when you have no idea about half of your pool. I wonder if it could work if you replace only that specific top-level vdev, not the root vdev. But then you'd need to track the TXG for each top-level vdev, not only the whole pool.

In D51219#1169582, @mav wrote:

TXGs are committed once there is some minimal amount of dirty data reached, but no less than every 5 seconds if there were any changes. Under constant data stream TXGs are committed as fast as underlying storage can handle metadata updates and cache flushes, which may vary between some fractions of a second to multiple seconds. Many zpool and zfs commands do it forcefully. zpool sync does just that, for example, if you need up to date stats, etc.

I tried zpool sync and after this in zdb I observe the same pool txg that has been there for the last hours.

The problem may be not so much that you take the latest TXG, but that you wipe all the previous configuration. Configuration in vdev label due to its limited size covers only that specific top-level vdev. So you are booting from a pool with more than one top-level vdev, after vdev_free(spa->spa_root_vdev) you may end up in situation when you have no idea about half of your pool. I wonder if it could work if you replace only that specific top-level vdev, not the root vdev. But then you'd need to track the TXG for each top-level vdev, not only the whole pool.

I see. That will happen if we got multi top level vdev pool, where one of the top level vdevs have no redundancy left, and the very last member of top-level vdev has txg lower than members of other top-level vdevs. What would kernel do in that case?

In D51219#1169594, @glebius wrote:

I tried zpool sync and after this in zdb I observe the same pool txg that has been there for the last hours.

Good observation. I see the labels are not updated each time, but only when needed. In other cases only individual uberblocks are written. So the TXG numbers in labels match the pool TXG at the time of last configuration change. In case some disk failed and was replaced, the configuration on other disks should be updated and so the TXG numbers there. This way if the old disc resurrects at some point, its configuration for that top-level vdev is no longer relevant if there is some newer available.

I see. That will happen if we got multi top level vdev pool, where one of the top level vdevs have no redundancy left, and the very last member of top-level vdev has txg lower than members of other top-level vdevs. What would kernel do in that case?

As I see, the real ZFS takes the best available configuration for each top-level vdev, unless it is told explicitly to take the older one as part of rollback attempt. Older TXG in label by itself does not mean the configuration is useless if we have nothing better. It can be that the only difference there was some insignificant device path change of something alike. See get_configs() in lib/libzutil/zutil_import.c and process_vdev_config() in module/os/freebsd/zfs/vdev_geom.c.

But as I understand for your code it might be even worse than some case of lost redundancy. Assume we have a striped pool of 2 vdevs A and B. We first read A and see TXG of 25. Then we read B and see TXG of 26. Even though those two vdevs are both top-level and independent, your vdev_free(spa->spa_root_vdev) will wipe out A's configuration, even though it should not. It should just replace configuration for that specific top-level vdev with the same ZPOOL_CONFIG_ID, if it has a newer version now.

As Alexander noted, a single label doesn't store information about the
entire pool, but only about one top-level vdev. Thus, we should blow
away only this part of the vdev tree, not entire tree.

Tested this with a stripe of mirrors.

This revision now requires review to proceed.Jul 11 2025, 3:40 AM

Harbormaster completed remote builds in B65331: Diff 158298.Jul 11 2025, 3:40 AM

glebius edited the summary of this revision. (Show Details)Jul 11 2025, 3:45 AM

Add missing 'break;'. We aren't using _SAFE iterator here, and we don't
expect any more matches.

Harbormaster completed remote builds in B65332: Diff 158299.Jul 11 2025, 3:49 AM

This seems super straight forward here, at least for ZFS code.
Just one question...

stand/libsa/zfs/zfsimpl.c
2135	why are no more matches expected?

glebius added inline comments.Jul 11 2025, 4:05 AM

stand/libsa/zfs/zfsimpl.c
2135	There shall not be two instances of the same top level vdev. Every time we create a top level vdev, we first check it already exists. In general any vdev on any v_children list shall have unique v_guid withing this list.