Details

Reviewers

alc
markj

Summary

The collapse scan may drop the object lock for sleep. Apparently, this allows for two races: one between two parallel collapse scans, and second between page fault handler and scan. The r291576 amplified the effect, making the issues reported often by real users.

Assume that the scan was running, and the object lock was dropped, and a parallel second collapse is started. What I see in live was a thread performing fork and calling collapse from copy_entry. We get the first race then. The winning collapse would bump reference count on on of the backing objects, triggering the panic 'backing_object was somehow re-referenced' in the other scan. To fix this, I increment the paging in progress counter for the top object, then the (slightly adapted) check on the start of the scan prevents parallel scans.

Similarly, other thread may fault or wire the entry served by the scanned object. Then, since waitable scan sets OBJ_DEAD flag, the fault handler returns EFAULT, causing spurious SIGSEGV in the processes. Fix is to wait until the scan finishes. Unfortunately, the situation is similar to "vmo_de" sleep, the backing object reference by the top of the shadow list is what prevents the fs.object from becoming invalid, and we must not reference it directly to not trigger panic in the scan. So I cannot sleep precisely for the scan to finish, instead I retry the fault handler after pause.

As usual, I have no good idea what to do with the wired invalid page found by either fault handler or scan. The release_page() and the assert in scan were modified to keep such pages alone.

This review also includes the fix for avg' reported access after free to map entry issue. It touches related area and I suspect that failpoints used would trigger it as well.

Also, at least three trivial bugs in the improbable failure of vm_radix_insert handlers were fixed.

Test Plan

Peter wrote:

I have run all of the tests I have on amd64 + a buildworld. On i386 I have run a mix of tests and a buildworld. I have also run a buildworld on i386 with 256MB RAM and one CPU. I have not noticed any stray segmentation faults. All tests ran with the uma_zalloc_arg() fail point enabled.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib updated this revision to Diff 15571.Apr 25 2016, 12:01 PM

kib retitled this revision from to Collapse/vm_fault races and small fixes for radix insertion failures.

kib updated this object.

kib edited the test plan for this revision. (Show Details)

kib added a reviewer: alc.

kib set the repository for this revision to rS FreeBSD src repository - subversion.

kib added subscribers: pho, cem, avg.

Herald added a subscriber: imp. · View Herald TranscriptApr 25 2016, 12:01 PM

cem added a subscriber: rlibby.Apr 25 2016, 4:48 PM

cem added a reviewer: markj.Apr 25 2016, 4:55 PM

johan_duh.se added a subscriber: johan_duh.se.Apr 26 2016, 9:14 AM

Just for the record, I have tested the part of this patch that fixes the possible unlocked access to a map entry manifested as changing its read_ahead field.

shared_reviews.freebsd.org_twingly.com added a subscriber: shared_reviews.freebsd.org_twingly.com.May 16 2016, 1:22 PM

j-nitrology.com added a subscriber: j-nitrology.com.May 17 2016, 2:59 AM

We have been running the stable/10 iteration of this patch at Limelight on a number of boxes over the past few weeks. We have had no ill effects, and these machines have not hit the process stuck in vodead bug. We don't have a way to reliably reproduce, and we did hit it fairly rarely so this isn't conclusive. Since this review has been fairly quiet I did want to give some positive feedback though. Thanks!

The snippet from vm_page_alloc_contig() looks right. Commit it.

One chunk of vm_page.c diff committed.

Also contains the update to the avg' issue patch. Do not call vm_fault_dontneed() more than once per fault.

The snippet in vm_page_cache() looks correct. Please commit it.

alc added inline comments.May 26 2016, 5:53 AM

sys/vm/vm_object.c
1676–1677 ↗	(On Diff #16794)	Line 1700 below already implements this check. Is there a reason for duplicating it here?

kib added inline comments.May 26 2016, 6:03 AM

sys/vm/vm_object.c
1676–1677 ↗	(On Diff #16794)	I did not wanted to do pip_add() at all, if OBJ_DEAD is set, AFAIR. Indeed it looks excessive.

Remove excessive check for OBJ_DEAD.

alc added inline comments.May 26 2016, 6:32 AM

sys/vm/vm_object.c
1709 ↗	(On Diff #16893)	I don't believe that the object "bypass" case ever releases the object locks, so the pip_add/wakeup could be restricted to the object "collapse" case.

kib added inline comments.May 26 2016, 6:58 AM

sys/vm/vm_object.c
1709 ↗	(On Diff #16893)	You mean the if (backing_object->ref_count == 1) 'else' branch by the "bypass" case ? I confined pip_add to the 'then' part of the if. I left the style changes in, the code formatting is too ugly there (to be committed separately).

Confine pip_add to the collapse branch.

The changes to vm_object_collapse() look good. Please commit them.

Update the patch after commits of the vm_object_collapse() chunks.

Update after r300826 consumed one chunk by using vm_page_replace.

Two more fixes:

Lock object flag assignment in the _vm_object_allocate(). Despite the allocated object is not being used by VM, it is still reachable by the vmtotal() iterator. Since the iterator sets and clear flags (OBJ_ACTIVE), we might end up with corrupted flags for new object (found by pho).

In vnode_destroy_vobject() when OBJ_DEAD was already set for unreferenced object, do not forget to check for pending wakeups. Also, move zeroing of v_object to the place where it is needed, and for additional safety zero it under object lock.

In D6085#140313, @kib wrote:

Two more fixes:

Lock object flag assignment in the _vm_object_allocate(). Despite the allocated object is not being used by VM, it is still reachable by the vmtotal() iterator. Since the iterator sets and clear flags (OBJ_ACTIVE), we might end up with corrupted flags for new object (found by pho).

Wouldn't it suffice for vmtotal() to skip objects with the OBJ_DEAD flag set?

In D6085#140333, @alc wrote:

In D6085#140313, @kib wrote:

Two more fixes:

Lock object flag assignment in the _vm_object_allocate(). Despite the allocated object is not being used by VM, it is still reachable by the vmtotal() iterator. Since the iterator sets and clear flags (OBJ_ACTIVE), we might end up with corrupted flags for new object (found by pho).

Wouldn't it suffice for vmtotal() to skip objects with the OBJ_DEAD flag set?

Also, vm_object_zinit() would need to initialize the flags to OBJ_DEAD.

In D6085#140333, @alc wrote:

Wouldn't it suffice for vmtotal() to skip objects with the OBJ_DEAD flag set?

Is your intent to save lock/unlock in the _vm_object_allocate() ? Thing is, vmtotal() does several distinct locks of the object, so its accuracy would be even further diminished. E.g. we would be unable to clear some OBJ_ACTIVE flags, and unable to set them as well.

Also due to unlocked store of the flags, we might miss OBJ_DEAD cleanup. Hopefully we would not miss set OBJ_DEAD flag, since it must always be set under the object lock.

Instead of locking _vm_object_allocate(), avoid modifying OBJ_DEAD objects, as suggested by alc.

Add missed chunk for zinit.

In D6085#140343, @kib wrote:

In D6085#140333, @alc wrote:

Wouldn't it suffice for vmtotal() to skip objects with the OBJ_DEAD flag set?

Is your intent to save lock/unlock in the _vm_object_allocate() ? Thing is, vmtotal() does several distinct locks of the object, so its accuracy would be even further diminished. E.g. we would be unable to clear some OBJ_ACTIVE flags, and unable to set them as well.

Yes, I don't think that we should be adding locking overhead to a frequently called function if we can address the problem in the not so frequently called function.

Also due to unlocked store of the flags, we might miss OBJ_DEAD cleanup. Hopefully we would not miss set OBJ_DEAD flag, since it must always be set under the object lock.

I don't think that the former case is a problem. The object initialization will also clear OBJ_ACTIVE.

Please consider whether the third phase of vmtotal(), where we iterate over the object list again should be modified. Specifically, should

if (object->flags & OBJ_ACTIVE) {

if ((object->flags & (OBJ_DEAD | OBJ_ACTIVE)) == OBJ_ACTIVE) {

I'm not sure.

In D6085#140351, @alc wrote:
Please consider whether the third phase of vmtotal(), where we iterate over the object list again should be modified. Specifically, should
if (object->flags & OBJ_ACTIVE) {
be
if ((object->flags & (OBJ_DEAD | OBJ_ACTIVE)) == OBJ_ACTIVE) {
I'm not sure.

I thought about adding OBJ_DEAD several lines above, to the OBJ_FICTITIOUS check, but then decided not to.

There are equal arguments for adding the check you suggested, and to not add it. OBJ_DEAD might be set after our OBJ_ACTIVE marking, in which case it is more or less fair to ignore the flag. Or it could indicate that we were not able to clear OBJ_ACTIVE initially. This is what I referenced in my initial response when I said that the approach would reduce the precision.

The ref_count == 0 check is done right before, so I do not think that checking for OBJ_DEAD there would give much more imprecision.

emaste added a subscriber: emaste.Jun 14 2016, 2:27 PM

I see two other related problems with vmtotal():

We allow concurrent calls to vmtotal(). The phase 1 clearing of OBJ_ACTIVE by the (temporally) second caller will undo the phase 2 setting of OBJ_ACTIVE by the (temporally) first caller.

There are a few other places where we manipulate the flags believing that it is safe because the object is thread private:

device_pager.c

if (object == NULL) {
        /*                                                                                   
         * Allocate object and associate it with the pager.  Initialize                      
         * the object's pg_color based upon the physical address of the                      
         * device's memory.                                                                  
         */
        mtx_unlock(&dev_pager_mtx);
        object1 = vm_object_allocate(tp, pindex);
        object1->flags |= OBJ_COLORED;
        object1->pg_color = color;

vm_fault.c

                /*                                                                                   
                 * Create the top-level object for the destination entry. (Doesn't                   
                 * actually shadow anything - we copy the pages directly.)                           
                 */
                dst_object = vm_object_allocate(OBJT_DEFAULT,
                    OFF_TO_IDX(dst_entry->end - dst_entry->start));
#if VM_NRESERVLEVEL > 0
                dst_object->flags |= OBJ_COLORED;
                dst_object->pg_color = atop(dst_entry->start);
#endif

vm_object.c

        if (source != NULL) {
                VM_OBJECT_WLOCK(source);
                LIST_INSERT_HEAD(&source->shadow_head, result, shadow_list);
                source->shadow_count++;
#if VM_NRESERVLEVEL > 0
                result->flags |= source->flags & OBJ_COLORED;
                result->pg_color = (source->pg_color + OFF_TO_IDX(*offset)) &
                    ((1 << (VM_NFREEORDER - 1)) - 1);
#endif
                VM_OBJECT_WUNLOCK(source);
        }

That said, what should we do?

Use an sx lock to serialize calls to vmtotal(). As long as vmtotal() uses a simple mark-and-sweep approach, we need to serialize the callers.

Stop using the object's flags field to mark active objects. I think that we have a spare byte or two in the object between the type and flags fields. The aforementioned sx lock would synchronize access to this field. (Do you see another way of implementing the active mark besides creating a new field?)

As an optimization, eliminate the initial clearing phase. We can clear the mark in the final sweep phase.

In D6085#144178, @alc wrote:

Use an sx lock to serialize calls to vmtotal(). As long as vmtotal() uses a simple mark-and-sweep approach, we need to serialize the callers.

Stop using the object's flags field to mark active objects. I think that we have a spare byte or two in the object between the type and flags fields. The aforementioned sx lock would synchronize access to this field. (Do you see another way of implementing the active mark besides creating a new field?)

As an optimization, eliminate the initial clearing phase. We can clear the mark in the final sweep phase.

I implemented all three suggestions, the patch will be updated in a moment. But could we consider just another definition of the active object ?

static boolean
is_object_active(vm_object_t obj)
{

	return (obj->ref_count > obj->shadow_count + ((obj->flags &
	    (OBJ_TMPFS_NODE | OBJ_DEAD)) == OBJ_TMPFS_NODE ? 1 : 0));
}

The definition is not precise, of course, since we could have transient object references for other reasons. But the current mechanism is not precise either, it misses the objects which became active, or stopped being active, after the mark pass. IMO the advisory nature of the vmmeter makes my definition acceptable.

With that definition, we can avoid marking objects as active at all.

Use new field, which utilizes padding in the struct vm_object, to mark object active in vm_meter. Implemented above suggestions.

I'm thinking about the proposed is_object_active().

Can you please email me a stand-alone patch for vmtotal() using is_object_active()? I'm looking to finalize that patch.

Update after r302063.

alc added inline comments.Jun 22 2016, 6:04 PM

sys/vm/vm_meter.c
212 ↗	(On Diff #17736)	Delete this. This was subsumed by the recent commit.
215 ↗	(On Diff #17736)	Ditto.

alc added inline comments.Jun 22 2016, 6:05 PM

sys/vm/vm_object.h
97 ↗	(On Diff #17736)	More to delete.
114 ↗	(On Diff #17736)	Ditto.

Fix mismerge.

The vm_page.c and vm_page.h changes look good. Please commit them.

sys/vm/vm_page.c
781 ↗	(On Diff #17769)	"succeed" -> "succeeds"
791 ↗	(On Diff #17769)	I would suggest moving this comment into vm_page_xunbusy_locked().
sys/vm/vm_page.h
555 ↗	(On Diff #17769)	You can shorten this a little by writing "/* Note: page m's lock must not be owned by the caller. */

Update after r302130, r302131.

Merge after r302236.

The only two changes left:

fix for avg bug
wake up any potential waiter for OBJ_DEAD vnode object if we unlocked object after OBJ_DEAD was set.

What was the trigger (or motivation) for the vnode_pager.c change?

In D6085#146922, @alc wrote:

What was the trigger (or motivation) for the vnode_pager.c change?

I remember that one of the Peter' logs with "vadead" hang has the vnode object with OBJ_DEAD AND OBJ_DISCONNECTWNT flags set.

The reasoning was that if we enter vnode_destroy_object() with OBJ_DEAD flags, then it was set by previous vm_object_terminate() call in the previous vm_object lock-owned region, and waiter for _this_ object could be never woken up.

The move of v_object = NULL into the else {} block and final assert IMO make the code cleaner.

Ok. Here is the bit that I don't understand. Whatever thread is performing vm_object_terminate should already be doing the OBJ_DISCONNECTWNT wakeup. Moreover, it does so late in vm_object_terminate, specifically, after we've done any unlock and relock of the object, so I don't think that it's possible for a late setting of OBJ_DISCONNECTWNT to be missed.

Thinking out loud, I'm wondering if some other problem was blocking the completion of vm_object_terminate, and so it never got to the OBJ_DISCONNECTWNT wakeup.

In D6085#147193, @alc wrote:

Ok. Here is the bit that I don't understand. Whatever thread is performing vm_object_terminate should already be doing the OBJ_DISCONNECTWNT wakeup. Moreover, it does so late in vm_object_terminate, specifically, after we've done any unlock and relock of the object, so I don't think that it's possible for a late setting of OBJ_DISCONNECTWNT to be missed.

Thinking out loud, I'm wondering if some other problem was blocking the completion of vm_object_terminate, and so it never got to the OBJ_DISCONNECTWNT wakeup.

It could be, e.g. not that vm_object_terminate() cannot finish, but that it executing a block that drops the object lock. A parallel vnode_destroy_vobject() must be correct in this case still: we clear vp->v_object in vnode_destroy_vobject(), so we must wake up the DISCONNECTWNT waiters.

In D6085#147197, @kib wrote:

In D6085#147193, @alc wrote:

Ok. Here is the bit that I don't understand. Whatever thread is performing vm_object_terminate should already be doing the OBJ_DISCONNECTWNT wakeup. Moreover, it does so late in vm_object_terminate, specifically, after we've done any unlock and relock of the object, so I don't think that it's possible for a late setting of OBJ_DISCONNECTWNT to be missed.

Thinking out loud, I'm wondering if some other problem was blocking the completion of vm_object_terminate, and so it never got to the OBJ_DISCONNECTWNT wakeup.

It could be, e.g. not that vm_object_terminate() cannot finish, but that it executing a block that drops the object lock. A parallel vnode_destroy_vobject() must be correct in this case still: we clear vp->v_object in vnode_destroy_vobject(), so we must wake up the DISCONNECTWNT waiters.

My reading of the code for vnode_pager_dealloc is that it will be unaffected by vnode_destroy_vobject clearing vp->v_object. In other words, vnode_pager_dealloc will still issue the needed wakeup.

There are two places in vm_object_terminate where the object is unlocked: a call to vm_object_pip_wait() and the block of code where we clean the pages and flush the buffers. I do, however, wonder if a concurrent call to vnode_destroy_object, which is going to conclude by clearing vp->v_object, isn't going to screw up something in the block that cleans and flushes. In other words, can that block tolerate vp->v_object being concurrently cleared?

In D6085#147469, @alc wrote:

My reading of the code for vnode_pager_dealloc is that it will be unaffected by vnode_destroy_vobject clearing vp->v_object. In other words, vnode_pager_dealloc will still issue the needed wakeup.

There are two places in vm_object_terminate where the object is unlocked: a call to vm_object_pip_wait() and the block of code where we clean the pages and flush the buffers. I do, however, wonder if a concurrent call to vnode_destroy_object, which is going to conclude by clearing vp->v_object, isn't going to screw up something in the block that cleans and flushes. In other words, can that block tolerate vp->v_object being concurrently cleared?

bufobj_invalbuf() is careful enough to check v_object for NULL before doing anything touching the object. But the only decisions from NULL v_object are skipping some actions, so in the end we might free dirty pages at worst. I do not think that the functions flushing buffer cache might block due to v_object == NULL.

Note that an argument there could be that all the lifecycle manipulations with the vm object for the vnode must occur with the vnode exclusively locked, but I think this is not true, generally. E.g. vnode might be shared-locked for create (from open for ZFS or for read on UFS), or note that vnode_destroy_vobject() asserts that the vnode is exclusively locked quite late, after the check for obj.

I can change the patch to replace check for OBJ_DISCONNECTWNT to assert that the flag is not set, in the OBJ_DEAD branch of vnode_destroy_vobject().

In D6085#147471, @kib wrote:

In D6085#147469, @alc wrote:

My reading of the code for vnode_pager_dealloc is that it will be unaffected by vnode_destroy_vobject clearing vp->v_object. In other words, vnode_pager_dealloc will still issue the needed wakeup.

There are two places in vm_object_terminate where the object is unlocked: a call to vm_object_pip_wait() and the block of code where we clean the pages and flush the buffers. I do, however, wonder if a concurrent call to vnode_destroy_object, which is going to conclude by clearing vp->v_object, isn't going to screw up something in the block that cleans and flushes. In other words, can that block tolerate vp->v_object being concurrently cleared?

bufobj_invalbuf() is careful enough to check v_object for NULL before doing anything touching the object. But the only decisions from NULL v_object are skipping some actions, so in the end we might free dirty pages at worst. I do not think that the functions flushing buffer cache might block due to v_object == NULL.

The object lock will be released during vm_object_page_clean(). What about that case?

In D6085#147471, @kib wrote:

I can change the patch to replace check for OBJ_DISCONNECTWNT to assert that the flag is not set, in the OBJ_DEAD branch of vnode_destroy_vobject().

I'm not sure. It certainly couldn't hurt to have Peter test such a patch and see if he can trigger the KASSERT.

This is really the first time that I've given much thought to vnode_destroy_vobject(), and I find it rather odd that vnode_destroy_vobject() ever clears vp->v_object, because this is going result in vp->v_object being cleared twice, once in vnode_destroy_vobject() and again either by the direct call to vm_pager_deallocate() in the ref count greater than zero case or indirectly by vm_object_terminate(). If not for the fact that we assert an exclusive lock on the vnode in vnode_destroy_vobject(), I would have said that it made more sense to me for the OBJ_DEAD case to wait for termination of the object to complete (before returning) rather than clearing vp->v_object. In other words, wait for vm_object_terminate() to call vnode_pager_dealloc(), which will clear vp->v_object.

In D6085#147511, @alc wrote:

I'm not sure. It certainly couldn't hurt to have Peter test such a patch and see if he can trigger the KASSERT.

Peter apparently testing the assert.

This is really the first time that I've given much thought to vnode_destroy_vobject(), and I find it rather odd that vnode_destroy_vobject() ever clears vp->v_object, because this is going result in vp->v_object being cleared twice, once in vnode_destroy_vobject() and again either by the direct call to vm_pager_deallocate() in the ref count greater than zero case or indirectly by vm_object_terminate(). If not for the fact that we assert an exclusive lock on the vnode in vnode_destroy_vobject(), I would have said that it made more sense to me for the OBJ_DEAD case to wait for termination of the object to complete (before returning) rather than clearing vp->v_object. In other words, wait for vm_object_terminate() to call vnode_pager_dealloc(), which will clear vp->v_object.

Yes, vnode_destroy_object() almost always is called from the VOP_RECLAIM() filesystem method, which must not drop the vnode lock, to avoid other threads finding and accessing half-deconstructed vnodes. As result, such wait would happen with the vnode exclusively locked, which probably prevent the termination from concluding.

Change vnode_pager.c patch to KASSERT, as tested by pho.

I'm happy with the proposed vnode_pager.c change. I would only suggest a comment before the OBJ_DISCONNECTWNT assertion that tries to explain why the assertion is correct.

Merge after the r302343. The only left patch is for avg' issue.

In D6085#148213, @kib wrote:

Merge after the r302343. The only left patch is for avg' issue.

I started thinking about the remaining piece over the weekend.

Re-merge after r302399.

I've been mulling this over for a while now, and I'd like to argue that we should refactor the code into two distinct parts, that piece which should happen once (nera/"dont need") and that piece which should happen on every get pages call (ahead/behind). I'll email you a patch shortly.

sys/vm/vm_fault.c
598	Without the initialization of nera (on the old line 584), this "+=" appears problematic.

kib abandoned this revision.Jul 18 2016, 4:25 AM

Collapse/vm_fault races and small fixes for radix insertion failures
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 18217

sys/vm/vm_fault.c

Collapse/vm_fault races and small fixes for radix insertion failuresAbandonedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 18217

sys/vm/vm_fault.c

Collapse/vm_fault races and small fixes for radix insertion failures
AbandonedPublic
Actions

Revision Contents
Changeset List