Page MenuHomeFreeBSD

Fix vm_object_collapse <-> vm_fault race (again)
AbandonedPublic

Authored by cem on Nov 7 2015, 8:19 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 7, 10:13 PM
Unknown Object (File)
Dec 12 2024, 8:53 PM
Unknown Object (File)
Nov 18 2024, 11:01 PM
Unknown Object (File)
Oct 1 2024, 2:50 PM
Unknown Object (File)
Sep 30 2024, 3:09 PM
Unknown Object (File)
Sep 29 2024, 11:16 PM
Unknown Object (File)
Sep 23 2024, 3:53 AM
Unknown Object (File)
Sep 8 2024, 9:59 PM

Details

Reviewers
kib
alc
imp
Summary

This is a follow-up to the race closed in r221714. Due to locking order
in fault, an invalid page may be present in a VM object during collapse.

We can hit the same race as in the earlier revision, but with
OBSC_COLLAPSE_WAIT. In WAIT mode, we must drop the object locks, drain
the busy status of the invalid parent page, and try again from the top.

Test Plan

The symptom is "panic @ time 1446721794.625, thread 0xfffff8004cf09000: freeing
mapped page 0xfffff8007e1a2480".

I was able to reproduce this quickly with some very synthetic failpoints.

In particular,

@@ -168,10 +169,12 @@ unlock_and_deallocate(struct faultstate *fs)
 {

        vm_object_pip_wakeup(fs->object);
        VM_OBJECT_WUNLOCK(fs->object);
        if (fs->object != fs->first_object) {
+               KFAIL_POINT_CODE(DEBUG_FP, unlock_and_deallocate_relock,
+                   /* nop */(void)0);
                VM_OBJECT_WLOCK(fs->first_object);
                vm_page_lock(fs->first_m);
                vm_page_free(fs->first_m);
                vm_page_unlock(fs->first_m);
                vm_object_pip_wakeup(fs->first_object);

And

@@ -1650,11 +1659,16 @@ vm_object_backing_scan(vm_object_t object, int op)
                         *
                         * If the page was mapped to a process, it can remain
                         * mapped through the rename.
                         * vm_page_rename() will handle dirty and cache.
                         */
-                       if (vm_page_rename(p, object, new_pindex)) {
+                       rename_unlock = 0;
+                       KFAIL_POINT_CODE(DEBUG_FP, vmob_scan_rename_unlock,
+                           rename_unlock = 1);
+                       if ((rename_unlock &&
+                            (op & OBSC_COLLAPSE_NOWAIT) == 0) ||
+                           vm_page_rename(p, object, new_pindex)) {
                                if (op & OBSC_COLLAPSE_NOWAIT) {
                                        p = next;
                                        continue;
                                }
                                VM_OBJECT_WUNLOCK(backing_object);

By injecting a slight delay at debug.fail_point.unlock_and_deallocate_relock
(e.g. '50%sleep(10)') and some percentage of synthetic failures for
debug.fail_point.vmob_scan_rename_unlock (e.g. '20%return(1)'), I can hit this
within seconds (on OneFS). I haven't tried the failpoints on vanilla FreeBSD,
but I suspect it's an issue there as well.

After the attached patch is applied, the kernel no longer panics when these
failpoints are enabled. Instead, userspace programs' faults start failing,
which I suspect is a separate bug hidden by this one.

It would not be surprising if many more bugs exist in VM when radix tree node
UMA allocations fail (this is the original way we discovered the synthetic
vm_page_rename() error).

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 1087
Build 1090: arc lint + arc unit

Event Timeline

cem retitled this revision from to Fix vm_object_collapse <-> vm_fault race (again).
cem updated this object.
cem edited the test plan for this revision. (Show Details)
cem added reviewers: alc, imp, kib.
cem added a subscriber: eadler.

Can you convince us that swap space is being correctly managed? Note the difference between the _NOWAIT and _WAIT cases:

if (op & OBSC_COLLAPSE_NOWAIT) {
        if (!p->valid || vm_page_busied(p)) {
                p = next;
                continue;
        }
} else if (op & OBSC_COLLAPSE_WAIT) {
        if (vm_page_busied(p)) {

When the _NOWAIT case frees swap space, it does so with the knowledge that "p" is valid.

In D4103#86245, @alc wrote:

Can you convince us that swap space is being correctly managed?

I'm not sure what you mean, sorry. In particular, I don't know any of the details of how swap is implemented in the VM.

Note the difference between the _NOWAIT and _WAIT cases:

...

When the _NOWAIT case frees swap space, it does so with the knowledge that "p" is valid.

Should freeing of swap space be conditional on p being valid? Something like:

if (backing_object->type == OBJT_SWAP && p->valid != 0)
        swap_pager_freespace(backing_object,
            p->pindex, 1);

If we have made it to this point in the loop, we know p's pindex is within the parent object's range because of this earlier clause:

if (
    p->pindex < backing_offset_index ||
    new_pindex >= object->size
) {
        ...
        p = next;
        continue;
}
In D4103#86245, @alc wrote:

Can you convince us that swap space is being correctly managed? Note the difference between the _NOWAIT and _WAIT cases:

if (op & OBSC_COLLAPSE_NOWAIT) {
        if (!p->valid || vm_page_busied(p)) {
                p = next;
                continue;
        }
} else if (op & OBSC_COLLAPSE_WAIT) {
        if (vm_page_busied(p)) {

When the _NOWAIT case frees swap space, it does so with the knowledge that "p" is valid.

Hmm, in fact I cannot convince myself that the swap space if correctly managed in the stock HEAD. Why is the backing object swap is freed for the pp->valid == 0 case in the branch of the mlaier fix ? IMO this just causes the data loss, if the parent page is swapped out. I initially tought that this is impossible since scan cannot happen for the object with a swapping in progress, but I am wrong for the NOWAIT scan.

In D4103#86320, @kib wrote:
In D4103#86245, @alc wrote:

Can you convince us that swap space is being correctly managed? Note the difference between the _NOWAIT and _WAIT cases:

if (op & OBSC_COLLAPSE_NOWAIT) {
        if (!p->valid || vm_page_busied(p)) {
                p = next;
                continue;
        }
} else if (op & OBSC_COLLAPSE_WAIT) {
        if (vm_page_busied(p)) {

When the _NOWAIT case frees swap space, it does so with the knowledge that "p" is valid.

Hmm, in fact I cannot convince myself that the swap space if correctly managed in the stock HEAD. Why is the backing object swap is freed for the pp->valid == 0 case in the branch of the mlaier fix ? IMO this just causes the data loss, if the parent page is swapped out. I initially tought that this is impossible since scan cannot happen for the object with a swapping in progress, but I am wrong for the NOWAIT scan.

The best answer that I can give to your question is: Before Attilio's change (r254141) , it was easy to overlook the fact that swap space was being freed in mlaier's fix. In other words, I think that mlaier's fix was only accidentally freeing swap space.

To be clear, I agree with you. I don't think that it's correct to free the swap space in the _NOWAIT case.

In D4103#86355, @alc wrote:

To be clear, I agree with you. I don't think that it's correct to free the swap space in the _NOWAIT case.

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 0a3c2ef..846b368 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -1570,10 +1570,6 @@ vm_object_backing_scan(vm_object_t object, int op)
 			    (op & OBSC_COLLAPSE_NOWAIT) != 0 &&
 			    (pp != NULL && pp->valid == 0)
 			) {
-				if (backing_object->type == OBJT_SWAP)
-					swap_pager_freespace(backing_object, 
-					    p->pindex, 1);
-
 				/*
 				 * The page in the parent is not (yet) valid.
 				 * We don't know anything about the state of
cem edited edge metadata.

Drop a bogus swap_pager_freespace() that snuck in in the earlier fix.

In D4103#86374, @kib wrote:
In D4103#86355, @alc wrote:

To be clear, I agree with you. I don't think that it's correct to free the swap space in the _NOWAIT case.

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 0a3c2ef..846b368 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -1570,10 +1570,6 @@ vm_object_backing_scan(vm_object_t object, int op)
 			    (op & OBSC_COLLAPSE_NOWAIT) != 0 &&
 			    (pp != NULL && pp->valid == 0)
 			) {
-				if (backing_object->type == OBJT_SWAP)
-					swap_pager_freespace(backing_object, 
-					    p->pindex, 1);
-
 				/*
 				 * The page in the parent is not (yet) valid.
 				 * We don't know anything about the state of

Yes. This should be committed.

I'm also puzzled by the "if (!p->valid ..." existing in one case but not the other. Can you find anything in the commit logs that suggests why this check only exists in the _NOWAIT case?

It would make more sense to me for "if (!p->valid) {" to come right before the rename call and only free the physical page but not the swap space.

In D4103#86479, @alc wrote:
In D4103#86374, @kib wrote:
In D4103#86355, @alc wrote:

To be clear, I agree with you. I don't think that it's correct to free the swap space in the _NOWAIT case.

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 0a3c2ef..846b368 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -1570,10 +1570,6 @@ vm_object_backing_scan(vm_object_t object, int op)
 			    (op & OBSC_COLLAPSE_NOWAIT) != 0 &&
 			    (pp != NULL && pp->valid == 0)
 			) {
-				if (backing_object->type == OBJT_SWAP)
-					swap_pager_freespace(backing_object, 
-					    p->pindex, 1);
-
 				/*
 				 * The page in the parent is not (yet) valid.
 				 * We don't know anything about the state of

Yes. This should be committed.

I'm also puzzled by the "if (!p->valid ..." existing in one case but not the other. Can you find anything in the commit logs that suggests why this check only exists in the _NOWAIT case?

It would make more sense to me for "if (!p->valid) {" to come right before the rename call and only free the physical page but not the swap space.

Alternatively, the "if (!p->valid..." is redundant because we won't have an invalid page in the backing object that isn't busy. Remember that rename dirties the page, so if we ever renamed an invalid page (in the _WAIT case), then we would get an assertion failure.

In D4103#86479, @alc wrote:

Yes. This should be committed.

Ok, this change is unrelated to the wait fix.

I'm also puzzled by the "if (!p->valid ..." existing in one case but not the other. Can you find anything in the commit logs that suggests why this check only exists in the _NOWAIT case?

What is the 'other' case ? Do you mean WAIT/NOWAIT, or other if () in the stock HEAD code ?

It would make more sense to me for "if (!p->valid) {" to come right before the rename call and only free the physical page but not the swap space.

I do not understand this proposal, could you, please, explain it some more ?

IMO, the real case that should be explicitely handled is the situation when the parent page exists but is not valid. This is The test condition in the revision. But the handling for the WAIT state seems to be not quite right, since I do not think that the invariant that you mentioned in the other reply "invalid page is always busy' actually holds. E.g. invalidation+wiring definitely break this for vnode objects, and I am not sure how much would this leak besides the kernel stack objects.

If the waitable scan found non-busy invalid page in the parent, it should be removed to allow the shadowed page content to replace the invalid page.

For what it's worth, we've hit a related issue in vm_object_collapse while artificially failing "RADIX NODE" UMA zone M_NOWAIT allocations:

                        /*
                         * Discard backing_object.
                         *
                         * Since the backing object has no pages, no pager left,
                         * and no object references within it, all that is
                         * necessary is to dispose of it.
                         */
                        KASSERT(backing_object->ref_count == 1, (
"backing_object %p was somehow re-referenced during collapse!",
                            backing_object));
                        VM_OBJECT_WUNLOCK(backing_object);
                        vm_object_destroy(backing_object);

(We tripped the assert.)

In D4103#86547, @kib wrote:
In D4103#86479, @alc wrote:

Yes. This should be committed.

Ok, this change is unrelated to the wait fix.

I'm also puzzled by the "if (!p->valid ..." existing in one case but not the other. Can you find anything in the commit logs that suggests why this check only exists in the _NOWAIT case?

What is the 'other' case ? Do you mean WAIT/NOWAIT, or other if () in the stock HEAD code ?

I'm referring to this code:

if (op & OBSC_COLLAPSE_NOWAIT) {
        if (!p->valid || vm_page_busied(p)) {
                p = next;
                continue;
        }
} else if (op & OBSC_COLLAPSE_WAIT) {
        if (vm_page_busied(p)) {

Again, I don't see why we test p->valid in one case but not the other.

It would make more sense to me for "if (!p->valid) {" to come right before the rename call and only free the physical page but not the swap space.

I do not understand this proposal, could you, please, explain it some more ?

I'm suggesting that we remove the existing "!p->valid" test from the _NOWAIT case and either

  1. Decide that it's completely unnecessary. A non-busy page in a DEFAULT/SWAP backing object will be valid. That we don't see assertion failures in the _WAIT case when we call vm_page_rename() and it calls vm_page_dirty() strongly suggests that it's unnecessary.
  1. Decide that it's needed. In which case, I don't see how it can be needed in the _NOWAIT case but not the _WAIT case. However, I see no point in testing p->valid until just before the call to vm_page_rename(). In other words, we might as well see if either

    /*
    • Page is out of the parent object's range, we
    • can simply destroy it. */

or

/*
 * page already exists in parent OR swap exists
 * for this location in the parent.  Destroy
 * the original page from the backing object.
 *
 * Leave the parent's page alone
 */

apply and we can free p before checking if it's valid.

IMO, the real case that should be explicitely handled is the situation when the parent page exists but is not valid. This is The test condition in the revision. But the handling for the WAIT state seems to be not quite right, since I do not think that the invariant that you mentioned in the other reply "invalid page is always busy' actually holds. E.g. invalidation+wiring definitely break this for vnode objects, and I am not sure how much would this leak besides the kernel stack objects.

If the waitable scan found non-busy invalid page in the parent, it should be removed to allow the shadowed page content to replace the invalid page.

I agree with this last statement.

In D4103#86985, @cem wrote:

For what it's worth, we've hit a related issue in vm_object_collapse while artificially failing "RADIX NODE" UMA zone M_NOWAIT allocations:

                        /*
                         * Discard backing_object.
                         *
                         * Since the backing object has no pages, no pager left,
                         * and no object references within it, all that is
                         * necessary is to dispose of it.
                         */
                        KASSERT(backing_object->ref_count == 1, (
"backing_object %p was somehow re-referenced during collapse!",
                            backing_object));
                        VM_OBJECT_WUNLOCK(backing_object);
                        vm_object_destroy(backing_object);

(We tripped the assert.)

Can you change the assert to report the ref count?

In D4103#87037, @alc wrote:
In D4103#86985, @cem wrote:

For what it's worth, we've hit a related issue in vm_object_collapse while artificially failing "RADIX NODE" UMA zone M_NOWAIT allocations:

                        KASSERT(backing_object->ref_count == 1, (
"backing_object %p was somehow re-referenced during collapse!",

(We tripped the assert.)

Can you change the assert to report the ref count?

I have the machine in GDB, let me see if I can find it.

(gdb) set $bobj=(vm_object_t)0xfffff800534a1600
(gdb) p $bobj.ref_count
$3 = 2
(gdb) p *$bobj
$4 = {
  lock = {
    lock_object = {
      lo_name = 0xffffffff8093b196 "vm object",
      lo_flags = 90374144,
      lo_data = 0,
      lo_witness = 0x0
    },
    rw_lock = 18446735279433054016
  },
  object_list = {
    tqe_next = 0xfffff80053309500,
    tqe_prev = 0xfffff800765e5a20
  },
  shadow_head = {
    lh_first = 0xfffff8000b58d000
  },
  shadow_list = {
    le_next = 0xffffffffffffffff,
    le_prev = 0xffffffffffffffff
  },
  memq = {
    tqh_first = 0x0,
    tqh_last = 0xfffff800534a1648
  },
  rtree = {
    rt_root = 0,
    rt_flags = 0 '\000'
  },
  size = 38,
  rpid = 4 '\004',
  generation = 1,
  ref_count = 2,
  shadow_count = 1,
  memattr = 6 '\006',
  type = 0 '\000',
  flags = 4104,
  pg_color = 1595,
  paging_in_progress = 0,
  resident_page_count = 0,
  backing_object = 0x0,
  backing_object_offset = 0,
  pager_object_list = {
    tqe_next = 0xffffffffffffffff,
    tqe_prev = 0xffffffffffffffff
  },
  rvq = {
    lh_first = 0x0
  },
  cache = {
    rt_root = 0,
    rt_flags = 0 '\000'
  },
  handle = 0x0,
  un_pager = {
    vnp = {
      vnp_size = 0,
      writemappings = -8794695657776
    },
    devp = {
      devp_pglist = {
        tqh_first = 0x0,
        tqh_last = 0xfffff800534a16d0
      },
      ops = 0xffffffff80bb9ce8 <old_dev_pager_ops>,
      dev = 0xfffff80004bec000
    },
    sgp = {
      sgp_pglist = {
        tqh_first = 0x0,
        tqh_last = 0xfffff800534a16d0
      }
    },
    swp = {
      swp_tmpfs = 0x0,
      swp_bcount = 1397364432
    }
  },
  cred = 0xfffff8001d2fc800,
  charge = 155648
}

Edit: Note that flags = 4104, does not include OBJ_DEAD, 0x8.

In D4103#87036, @alc wrote:

I'm suggesting that we remove the existing "!p->valid" test from the _NOWAIT case and either ...

I tried to codify my understanding of the scan, adjusted to handle the invalid == busy pages, plus two fixes. D4146

In D4103#87037, @alc wrote:
In D4103#86985, @cem wrote:

For what it's worth, we've hit a related issue in vm_object_collapse while artificially failing "RADIX NODE" UMA zone M_NOWAIT allocations:

                        /*
                         * Discard backing_object.
                         *
                         * Since the backing object has no pages, no pager left,
                         * and no object references within it, all that is
                         * necessary is to dispose of it.
                         */
                        KASSERT(backing_object->ref_count == 1, (
"backing_object %p was somehow re-referenced during collapse!",
                            backing_object));
                        VM_OBJECT_WUNLOCK(backing_object);
                        vm_object_destroy(backing_object);

(We tripped the assert.)

Can you change the assert to report the ref count?

Oddly, we don't set OBJ_DEAD in the NOWAIT case. Can you think of a reason for that? In that case I'm not sure how the ref_count assertion is valid, since this object isn't DEAD and may continue to be referenced.

In D4103#87213, @cem wrote:

Oddly, we don't set OBJ_DEAD in the NOWAIT case. Can you think of a reason for that? In that case I'm not sure how the ref_count assertion is valid, since this object isn't DEAD and may continue to be referenced.

In NOWAIT case, we might be unable to terminate the backing_object, after all. E.g., due to a busy page or vm_page_rename() failure. Due to this, we cannot set OBJ_DEAD in advance.

OTOH, we do not drop the object lock for NOWAIT case. Or, do we ?

In D4103#87217, @kib wrote:
In D4103#87213, @cem wrote:

Oddly, we don't set OBJ_DEAD in the NOWAIT case. Can you think of a reason for that? In that case I'm not sure how the ref_count assertion is valid, since this object isn't DEAD and may continue to be referenced.

In NOWAIT case, we might be unable to terminate the backing_object, after all. E.g., due to a busy page or vm_page_rename() failure. Due to this, we cannot set OBJ_DEAD in advance.

Oh, right.

OTOH, we do not drop the object lock for NOWAIT case. Or, do we ?

The assertion that trips is in vm_object_collapse after the WAIT case of vm_object_backing_scan anyway. So it seems something is ref'ing objects with OBJ_DEAD set (while WAIT drops the lock).