Page MenuHomeFreeBSD

zpool: cannot anymore import cache device which doesn't contain a previos ZFS vdev label
ClosedPublic

Authored by asomers on Nov 14 2017, 4:12 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mar 13 2024, 8:10 AM
Unknown Object (File)
Feb 1 2024, 10:25 AM
Unknown Object (File)
Dec 22 2023, 10:47 PM
Unknown Object (File)
Dec 21 2023, 11:36 PM
Unknown Object (File)
Nov 23 2023, 10:37 AM
Unknown Object (File)
Nov 16 2023, 2:49 AM
Unknown Object (File)
Sep 12 2023, 5:37 AM
Unknown Object (File)
Aug 27 2023, 11:36 PM

Details

Summary

Since the commit :

commit 116e85d29648e0ee6ef07be8f75760280b403d2a
    MFV r323790: 8567 Inconsistent return value in zpool_read_label

I'm no longer able to add to a zpool a cache device which doens't contain a previous vdev label:

[15:13 root@sh-3-4-fbsd][~]# zpool add -n testpool da126
internal error: out of memory

da126 being a new disk which was never used.
Note that the zpool "tespool" is big. It contains 65 vdevs/slog/caches .
If I try to add da126 in a smaller zpool it works and after that I'm able to swap in to my "testpool" zpool.

I'm attaching a patch that fix my issue, in my case I guess errno is setted to ENOMEM by a previous not related to the fstat64 call that happen in zpool_read_label.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I think since we are returning -1 (a failure), it makes more sense to set errno to 0 at the top of the function, so that we only zero it if it is from before this function was run.

But looking at the code flow, it likely does not matter. It might be worth trying to figure out where the ENOMEM error is coming from, and seeing if there is some undersized buffer or something.

I agree with Allan's suggestion. I suspect that ENOMEM could come as a signal to one of earlier ioctls to increase the buffer size for an output nvlist.
E.g. look at lzc_ioctl.
So, if the problem lies in first setting errno but not clearing it after a successful call, then the problem should be fixed at its source.

Thanks for your comments Allan and Andriy.
Concerning the errno set to 12 (ENOMEM), it's effectively set by the ioctl ZFS_IOC_POOL_STATS which is first called with a too small nvlist.
I'm attaching the conditional watchpoint where it happens.

Watchpoint 1 hit:
old value: 0
new value: 12
Process 11604 stopped
* thread #1, name = 'zpool', stop reason = watchpoint 1
    frame #0: 0x0000000800e951e9 libzfs.so.2`zcmd_ioctl(fd=3, request=5, zc=0x00007fffffffc640) at libzfs_compat.c:117
   114 	
   115 		if (ret == 0 && oldsize < zc->zc_nvlist_dst_size) {
   116 			ret = -1;
-> 117 			errno = ENOMEM;
   118 		}
   119 	
   120 		return (ret);
(lldb) bt
* thread #1, name = 'zpool', stop reason = watchpoint 1
  * frame #0: 0x0000000800e951e9 libzfs.so.2`zcmd_ioctl(fd=3, request=5, zc=0x00007fffffffc640) at libzfs_compat.c:117
    frame #1: 0x0000000800e87099 libzfs.so.2`zpool_refresh_stats(zhp=0x0000000802c30000, missing=0x00007fffffffd844) at libzfs_config.c:282
    frame #2: 0x0000000800e800e5 libzfs.so.2`zpool_open_canfail(hdl=0x0000000802c20000, pool="bld311") at libzfs_pool.c:995
    frame #3: 0x0000000800e802ef libzfs.so.2`zpool_open(hdl=0x0000000802c20000, pool=<unavailable>) at libzfs_pool.c:1051
    frame #4: 0x000000000040820c zpool`zpool_do_add(argc=3, argv=0x00007fffffffeb70) at zpool_main.c:540
    frame #5: 0x000000000040766f zpool`main(argc=<unavailable>, argv=0x00007fffffffeb60) at zpool_main.c:5642
    frame #6: 0x00000000004043ff zpool`_start + 399

Concerning your suggestion on how to fix it, I would rather suggest that to modify zpool_read_label() to return -1 in the case of a libc with errno == ENOMEM and return -2 in the case of other internal and logical ZFS "failure" ( no vdev label for exemple ).
In my opinion our real bug here is to check the value of errno after a call to an internal ZFS function and not after the libc call. Currently we are reading errno too late and we cannot easily differentiate a libc failure from a normal ZFS failure.

for example:

if (zpool_read_label(fd, &config) != 0 && errno == ENOMEM) {
        (void) no_memory(hdl);
        return (-1);
}

would be:

if (zpool_read_label(fd, &config) == -1) {
        (void) no_memory(hdl);
        return (-1);
}

What do you think about that ?

Also, note that this bug is present in stable/11 so basically if your zpool is too big you cannot add more (cache?) disks.

Thanks for your comments Allan and Andriy.
Concerning the errno set to 12 (ENOMEM), it's effectively set by the ioctl ZFS_IOC_POOL_STATS which is first called with a too small nvlist.
I'm attaching the conditional watchpoint where it happens.

Watchpoint 1 hit:
old value: 0
new value: 12
Process 11604 stopped
* thread #1, name = 'zpool', stop reason = watchpoint 1
    frame #0: 0x0000000800e951e9 libzfs.so.2`zcmd_ioctl(fd=3, request=5, zc=0x00007fffffffc640) at libzfs_compat.c:117
   114 	
   115 		if (ret == 0 && oldsize < zc->zc_nvlist_dst_size) {
   116 			ret = -1;
-> 117 			errno = ENOMEM;
   118 		}
   119 	
   120 		return (ret);
(lldb) bt
* thread #1, name = 'zpool', stop reason = watchpoint 1
  * frame #0: 0x0000000800e951e9 libzfs.so.2`zcmd_ioctl(fd=3, request=5, zc=0x00007fffffffc640) at libzfs_compat.c:117
    frame #1: 0x0000000800e87099 libzfs.so.2`zpool_refresh_stats(zhp=0x0000000802c30000, missing=0x00007fffffffd844) at libzfs_config.c:282
    frame #2: 0x0000000800e800e5 libzfs.so.2`zpool_open_canfail(hdl=0x0000000802c20000, pool="bld311") at libzfs_pool.c:995
    frame #3: 0x0000000800e802ef libzfs.so.2`zpool_open(hdl=0x0000000802c20000, pool=<unavailable>) at libzfs_pool.c:1051
    frame #4: 0x000000000040820c zpool`zpool_do_add(argc=3, argv=0x00007fffffffeb70) at zpool_main.c:540
    frame #5: 0x000000000040766f zpool`main(argc=<unavailable>, argv=0x00007fffffffeb60) at zpool_main.c:5642
    frame #6: 0x00000000004043ff zpool`_start + 399

Concerning your suggestion on how to fix it, I would rather suggest that to modify zpool_read_label() to return -1 in the case of a libc with errno == ENOMEM and return -2 in the case of other internal and logical ZFS "failure" ( no vdev label for exemple ).
In my opinion our real bug here is to check the value of errno after a call to an internal ZFS function and not after the libc call. Currently we are reading errno too late and we cannot easily differentiate a libc failure from a normal ZFS failure.

for example:

if (zpool_read_label(fd, &config) != 0 && errno == ENOMEM) {
        (void) no_memory(hdl);
        return (-1);
}

would be:

if (zpool_read_label(fd, &config) == -1) {
        (void) no_memory(hdl);
        return (-1);
}

Maybe instead:

if (zpool_read_label(fd, &config) != 0 && errno == ENOMEM) {
        (void) no_memory(hdl);
        /* After calling no_memory(), we can reset errno */
        errno = 0;
        return (-1);
}

Or, is the caller of ZFS_IOC_POOL_STATS seeing the ENOMEM and then calling it again with a bigger nvlist? In which case it should reset errno before calling it again.

What do you think about that ?

Also, note that this bug is present in stable/11 so basically if your zpool is too big you cannot add more (cache?) disks.

if (zpool_read_label(fd, &config) != 0 && errno == ENOMEM) {
        (void) no_memory(hdl);
        return (-1);
}

would be:

if (zpool_read_label(fd, &config) == -1) {
        (void) no_memory(hdl);
        return (-1);
}

Maybe instead:

if (zpool_read_label(fd, &config) != 0 && errno == ENOMEM) {
        (void) no_memory(hdl);
        /* After calling no_memory(), we can reset errno */
        errno = 0;
        return (-1);
}

No, it won't help. The bug is that you must not enter the if block for a disk without a vdev label, like it was before 116e85d29648e0ee6ef07be8f75760280b403d2a .

Or, is the caller of ZFS_IOC_POOL_STATS seeing the ENOMEM and then calling it again with a bigger nvlist? In which case it should reset errno before calling it again.

I think it's not a good idea. You are not supposed to have to reset errno, I have proposed to do it in the current patch because it was the quickest way to fix the issue.

The best fix would have been to check the errno value inside zpool_read_label() just after the libc calls, like all errno checks should be done. But it necessitates to add one more return value to zpool_read_label():

  • 0 == vdev label was read (continue the normal execution)
  • -1 == ENOMEM, do the no_memory() call and fail.
  • -2 == vdev label not found (don't enter the if block and continue the normal execution)

The best fix would have been to check the errno value inside zpool_read_label() just after the libc calls, like all errno checks should be done. But it necessitates to add one more return value to zpool_read_label():

  • 0 == vdev label was read (continue the normal execution)
  • -1 == ENOMEM, do the no_memory() call and fail.
  • -2 == vdev label not found (don't enter the if block and continue the normal execution)

I think changing the return value of zpool_read_label() would need to be coordinated with the upstream openzfs project. I would suggest posting the patch against https://github.com/openzfs/openzfs and see what they think of it.

Meanwhile other people start reporting similar or same problems:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226096

Surprised that Alan has never been added as he is the author of the zpool_read_label change in question.

This revision is now accepted and ready to land.Mar 2 2018, 9:05 AM

I think that solution is good for now. I have only a tiny bit of a doubt about setting errno to zero when returning an error indication (-1).
Maybe we should rather set it to ENOENT to mean that the label is not found.
I think that all C library functions that return a generic failure code like -1 (as opposed to a specific error code) set errno to non-zero.
I think that any errno value other than ENOMEM should work.

asomers requested changes to this revision.Mar 2 2018, 3:20 PM

I agree with @avg. Better to set ENOENT in this case. If the function is going to set errnos sometimes, then it should set them on every abnormal return.

This revision now requires changes to proceed.Mar 2 2018, 3:20 PM

I'm not sure that Nikita will update this review as he changed job and doesn't work on ZFS/FreeBSD now, maybe one of you could commander this review ?
Thanks

I see. I think that Alan would be the best person.
But if Alan is busy with something else then I can make the change.

asomers edited reviewers, added: nikita_elyzion.net; removed: asomers.

I can do it. I don't have the ability to run a full regression test on it. But this change is small enough that I think the risk is minimal.

This revision is now accepted and ready to land.Mar 2 2018, 4:48 PM
This revision was automatically updated to reflect the committed changes.

I can do it. I don't have the ability to run a full regression test on it. But this change is small enough that I think the risk is minimal.

I get a panic on boot after this commit.

panic: solaris assert: refcount_count(&spa->spa_refcount) > spa->spa_minref || MUTEX_HELD(&spa_namespace_lock), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c, line: 952
cpuid = 10
time = 1520207367
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe23f57a2420
vpanic() at vpanic+0x18d/frame 0xfffffe23f57a2480
panic() at panic+0x43/frame 0xfffffe23f57a24e0
assfail() at assfail+0x1a/frame 0xfffffe23f57a24f0
spa_close() at spa_close+0x5d/frame 0xfffffe23f57a2520
spa_get_stats() at spa_get_stats+0x481/frame 0xfffffe23f57a2700
zfs_ioc_pool_stats() at zfs_ioc_pool_stats+0x25/frame 0xfffffe23f57a2740
zfsdev_ioctl() at zfsdev_ioctl+0x76b/frame 0xfffffe23f57a27e0
devfs_ioctl() at devfs_ioctl+0xcb/frame 0xfffffe23f57a2830
VOP_IOCTL_APV() at VOP_IOCTL_APV+0x102/frame 0xfffffe23f57a2860
vn_ioctl() at vn_ioctl+0x124/frame 0xfffffe23f57a2970
devfs_ioctl_f() at devfs_ioctl_f+0x1f/frame 0xfffffe23f57a2990
kern_ioctl() at kern_ioctl+0x2c2/frame 0xfffffe23f57a29f0
sys_ioctl() at sys_ioctl+0x15c/frame 0xfffffe23f57a2ac0
amd64_syscall() at amd64_syscall+0x786/frame 0xfffffe23f57a2bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe23f57a2bf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80049afda, rsp = 0x7fffffffbd18, rbp = 0x7fffffffbd90 ---
KDB: enter: panic
[ thread pid 56 tid 100606 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db>

Not sure if related but I only updated like 1-2 days ago and then today and hit this, nothing else seems to have touched ZFS code.

I can do it. I don't have the ability to run a full regression test on it. But this change is small enough that I think the risk is minimal.

I get a panic on boot after this commit.

panic: solaris assert: refcount_count(&spa->spa_refcount) > spa->spa_minref || MUTEX_HELD(&spa_namespace_lock), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c, line: 952
cpuid = 10
time = 1520207367
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe23f57a2420
vpanic() at vpanic+0x18d/frame 0xfffffe23f57a2480
panic() at panic+0x43/frame 0xfffffe23f57a24e0
assfail() at assfail+0x1a/frame 0xfffffe23f57a24f0
spa_close() at spa_close+0x5d/frame 0xfffffe23f57a2520
spa_get_stats() at spa_get_stats+0x481/frame 0xfffffe23f57a2700
zfs_ioc_pool_stats() at zfs_ioc_pool_stats+0x25/frame 0xfffffe23f57a2740
zfsdev_ioctl() at zfsdev_ioctl+0x76b/frame 0xfffffe23f57a27e0
devfs_ioctl() at devfs_ioctl+0xcb/frame 0xfffffe23f57a2830
VOP_IOCTL_APV() at VOP_IOCTL_APV+0x102/frame 0xfffffe23f57a2860
vn_ioctl() at vn_ioctl+0x124/frame 0xfffffe23f57a2970
devfs_ioctl_f() at devfs_ioctl_f+0x1f/frame 0xfffffe23f57a2990
kern_ioctl() at kern_ioctl+0x2c2/frame 0xfffffe23f57a29f0
sys_ioctl() at sys_ioctl+0x15c/frame 0xfffffe23f57a2ac0
amd64_syscall() at amd64_syscall+0x786/frame 0xfffffe23f57a2bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe23f57a2bf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80049afda, rsp = 0x7fffffffbd18, rbp = 0x7fffffffbd90 ---
KDB: enter: panic
[ thread pid 56 tid 100606 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db>

Not sure if related but I only updated like 1-2 days ago and then today and hit this, nothing else seems to have touched ZFS code.

I don't think it's related. My change only affected userland, not the kernel. And it shouldn't have any effect for pools with fewer than 60 drives. Are you booting from a monster pool?

I don't think it's related. My change only affected userland, not the kernel. And it shouldn't have any effect for pools with fewer than 60 drives. Are you booting from a monster pool?

No and actually now I randomly get it on older r329841 too. I mailed it to list and am moving on as it randomly will boot sometimes.

I can do it. I don't have the ability to run a full regression test on it. But this change is small enough that I think the risk is minimal.

I get a panic on boot after this commit.

That panic is certainly not related to this commit.
The panic happens at random and it has been first reported quite some time ago:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210409

Alan, by the way, do you plan to open an upstream issue and a PR to upstream this change?

In D13088#306143, @avg wrote:

Alan, by the way, do you plan to open an upstream issue and a PR to upstream this change?

No. I no longer have any Illumos equipment. I could file an issue upstream, but my experience has been that Illumos doesn't merge patches from the issue tracker; you have to open a PR. And I don't have the ability to do that.

In D13088#306143, @avg wrote:

Alan, by the way, do you plan to open an upstream issue and a PR to upstream this change?

No. I no longer have any Illumos equipment. I could file an issue upstream, but my experience has been that Illumos doesn't merge patches from the issue tracker; you have to open a PR. And I don't have the ability to do that.

You don't need much beyond git and a github account to submit a PR.
The rest is done automatically from there, build, tests, etc.