Page MenuHomeFreeBSD

Handle the race between fork/vm_object_split() and faults.
ClosedPublic

Authored by kib on May 4 2018, 9:04 AM.

Details

Summary

If fault started before vmspace_fork() locked the map, and then during fork, vm_map_copy_entry()->vm_object_split() is executed, it is possible that the fault instantiate the page into the original object when the page was already copied into the new object (see vm_map_split() for the orig/new objects terminology). This can happen if split found a busy page (e.g. from the fault) and slept dropping the objects lock.

Handle this by checking that there is no page at the index, if there is, we can ignore the additional page in the original object.

Note that vmspace_fork() already bumped the timestamp of the map before calling split. This forces the fault handlers to try to lock the map and to retry fault handling. The later is needed to ensure that proper page from the new object is installed into the MMU tables. Otherwise userspace might see older incorrect copy.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

sys/vm/vm_object.c
1437 ↗(On Diff #42130)

Is this condition inverted?

Fix bug pointed by avg.

sys/vm/vm_map.c
3291 ↗(On Diff #42131)

vmspace_fork() calls vm_map_lock() on src_map, just spelled differently as old_map. vm_map_lock already increments src_map's timestamp, so I don't see why this increment is necessary.

sys/vm/vm_map.c
3291 ↗(On Diff #42131)

Indeed, I missed this. This chunk is excessive.

kib marked 2 inline comments as done.

Remove unneeded chunk.

We tested the earlier version of this patch and didn't get any regression.
We also didn't see any recurrence of the problem, but it was very rare without the patch too.

Can't we simply modify the restart so that it doesn't find_least from the original pindex, but instead at the busy page's pindex? In other words, ...

Index: vm/vm_object.c
===================================================================
--- vm/vm_object.c      (revision 334544)
+++ vm/vm_object.c      (working copy)
@@ -1411,8 +1411,9 @@ vm_object_split(vm_map_entry_t entry)
                    ("orig_object->charge < 0"));
                orig_object->charge -= ptoa(size);
        }
+       foo = offidxstart;
 retry:
-       m = vm_page_find_least(orig_object, offidxstart);
+       m = vm_page_find_least(orig_object, foo);
        for (; m != NULL && (idx = m->pindex - offidxstart) < size;
            m = m_next) {
                m_next = TAILQ_NEXT(m, listq);
@@ -1426,6 +1427,7 @@ retry:
                 */
                if (vm_page_busied(m)) {
                        VM_OBJECT_WUNLOCK(new_object);
+                       foo = m->pindex;
                        vm_page_lock(m);
                        VM_OBJECT_WUNLOCK(orig_object);
                        vm_page_busy_sleep(m, "spltwt", false);
In D15293#332786, @alc wrote:

Can't we simply modify the restart so that it doesn't find_least from the original pindex, but instead at the busy page's pindex? In other words, ...

Don't we need to handle the pages that could be instantiated at the already scanned addresses while the object lock was dropped ?

vm_object_split() is (and must be) performed with the map exclusively locked, so the hypothesized vm_fault() that starts before vm_object_split() must be dropping the map lock in order to perform I/O. Otherwise, the map lock would serialize vm_object_split() and vm_fault(), i.e., their executions could not overlap.

Given that, I think that there are two possible cases.

  1. The I/O is being performed on a backing object, not the top-level object. In the case of a read fault, the busy page inserted by vm_fault() into the top-level object is simply serving as a synchronization device, and it will be reclaimed at the end of the fault. In the case of a write fault, the busy page will eventually have the code/data from the backing object page that is undergoing I/O copied into it. In other words, this is a COW fault where the source page is not resident at the time of the fault. In this case, the busy page must persist and move to the new object in vm_object_split(). I don't think that there is a possible problem in this case. In other words, I think that the existing code handles this case, even if there were multiple page faults of this type in progress before vm_object_split() started. Once vm_object_split() has started, the map lock is going to block any further faults from inserting new busy pages into the top-level object.
  2. The I/O is being performed on an OBJT_SWAP top-level object. I think that this is the problematic case. Specifically, suppose that the fault occurs at pindex in the top-level object and that pages at offsets just before pindex were resident. I think that the problematic case involves vm_fault() dropping the map lock, vm_object_split() transferring all of the pages up to pindex and then blocking on that busy page, vm_fault() calling swap_pager_getpages(), which bogusly allocates new physical pages at pindices less than pindex and fills them with data, and finally vm_object_split() wakes up, find_least() returns one of these bogusly allocated and filled pages and tries to transfer it. Both of our proposed changes will stop the transfer of these bogusly allocated pages. That said, I don't think either "fix" is the right solution. We're not exactly leaking the bogusly allocated pages, because they should be reclaimed by the page daemon, but their allocation is a waste, but we're both masking the problem not fixing it.

I don't think that there is ever a problem with ahead pages. They can't be transferred by vm_object_split() until the page at pindex is unbusied. Take a look at swap_pager_getpages(). I think the problem arises because the so-called "Clip the readahead and readbehind ranges to exclude resident pages." snippet is happening after the vm object lock has been dropped and reacquired. Essentially, vm_object_split() runs while the vm object lock is dropped, and so the clipping happens too late. Moreover, you probably needed to block on getpbuf().

To be clear, the clipping of ahead and behind after the object lock is reacquired should likely remain, but we should also clip behind before the object lock is released. It's also worth considering whether this new clipping should occur in vm_fault_hold(). Currently, I believe that only the swap pager is vulnerable, but it still might be argued that vm_fault_hold() should perform the new clipping.

It could also be argued that vm_object_split() should wait out the paging-in-progress indication on the top-level object by vm_fault_hold().

In D15293#333055, @alc wrote:

I now think that I agree with your arguments in the sense that it really not critical that we left the unused pages behind the split range in vm_map_entry shadow chain. So if you prefer your fix (perhaps because it performs less work if raced), I am fine with it.

In D15293#333070, @alc wrote:

To be clear, the clipping of ahead and behind after the object lock is reacquired should likely remain, but we should also clip behind before the object lock is released. It's also worth considering whether this new clipping should occur in vm_fault_hold(). Currently, I believe that only the swap pager is vulnerable, but it still might be argued that vm_fault_hold() should perform the new clipping.

It could also be argued that vm_object_split() should wait out the paging-in-progress indication on the top-level object by vm_fault_hold().

I do not quite understand this. If we already copied the page from the split object into the new one, the page content is no longer our problem. The only issue is that other parallel faults must re-evaluate the final page, and this is done by 1) map generation bump 2) the fact that split thread owns map lock exclusively and vm_fault retry would block there.

So why do we need to do anything else ?

Here is my current patch.

Index: vm/swap_pager.c
===================================================================
--- vm/swap_pager.c     (revision 334544)
+++ vm/swap_pager.c     (working copy)
@@ -1103,6 +1103,19 @@ swap_pager_getpages(vm_object_t object, vm_page_t
 
        reqcount = count;
 
+       /*
+        * Obtain a buffer, and handle a possible race.  While the object is
+        * unlocked, a concurrent call to vm_object_split() could transfer
+        * pages at indices up to ma[0]->pindex from the object to another
+        * object.  Adjust *rbehind before unlocking the object so that these
+        * pages are not reallocated below.
+        */
+       if (rbehind != NULL && *rbehind > 0 &&
+           (mpred = TAILQ_PREV(ma[0], pglist, listq)) != NULL) {
+               pindex = ma[0]->pindex - mpred->pindex - 1;
+               if (pindex < *rbehind)
+                       *rbehind = pindex;
+       }
        VM_OBJECT_WUNLOCK(object);
        bp = getpbuf(&nsw_rcount);
        VM_OBJECT_WLOCK(object);
In D15293#333078, @kib wrote:
In D15293#333070, @alc wrote:

To be clear, the clipping of ahead and behind after the object lock is reacquired should likely remain, but we should also clip behind before the object lock is released. It's also worth considering whether this new clipping should occur in vm_fault_hold(). Currently, I believe that only the swap pager is vulnerable, but it still might be argued that vm_fault_hold() should perform the new clipping.

It could also be argued that vm_object_split() should wait out the paging-in-progress indication on the top-level object by vm_fault_hold().

I do not quite understand this. If we already copied the page from the split object into the new one, the page content is no longer our problem. The only issue is that other parallel faults must re-evaluate the final page, and this is done by 1) map generation bump 2) the fact that split thread owns map lock exclusively and vm_fault retry would block there.

So why do we need to do anything else ?

Let me try to explain my motivation for the last comment about paging in progress. Currently, we only use pip_wait when we are destroying an object. Essentially, I'm thinking out loud here, questioning if vm_object_split() shouldn't be treated akin to object destruction. It is a rather weird operation. However, as a practical matter, I think that a pip_wait call at the start of vm_object_split() would produce a deadlock between the fork and the page fault, because the page fault handler doesn't release its pip hold on the object until the fault completes and fork doesn't release the map lock until vm_object_split() completes.

For what it's worth, I believe that this race is fallout from the pager interface changes for sendfile. Before those changes, we would have adjusted readbehind and performed the page allocations before calling the pager and releasing the object lock.

In D15293#333432, @alc wrote:

For what it's worth, I believe that this race is fallout from the pager interface changes for sendfile. Before those changes, we would have adjusted readbehind and performed the page allocations before calling the pager and releasing the object lock.

I mostly agree with this. But I do not understand your change.

So we looked at the page index in the object queue, right before our ma[0], and claim that we can safely (WRT vm_object_split()) allocate pages between that index and ma[0] after relock. I do not see why. After we dropped the lock, why cannot a page be instantiated in this range by other means, even if split is going, and then split processed it, all while we are blocked on phys buf allocation and not yet owning the object lock.

In D15293#333441, @kib wrote:
In D15293#333432, @alc wrote:

For what it's worth, I believe that this race is fallout from the pager interface changes for sendfile. Before those changes, we would have adjusted readbehind and performed the page allocations before calling the pager and releasing the object lock.

I mostly agree with this. But I do not understand your change.

So we looked at the page index in the object queue, right before our ma[0], and claim that we can safely (WRT vm_object_split()) allocate pages between that index and ma[0] after relock. I do not see why. After we dropped the lock, why cannot a page be instantiated in this range by other means, even if split is going, and then split processed it, all while we are blocked on phys buf allocation and not yet owning the object lock.

In some cases, the exclusive map lock held by fork would block other allocations, e.g., another page fault. However, if some code path gets a reference to the object by some other means than the locked map, then I agree that nothing is stopping an (re)allocation of a page that precedes ma[0] by that code path. Moreover, I don't have an argument that no such code path exists.

My original proposed patch has a flaw. If the swap pager getpages function legitimately allocated behind pages prior to ma[0], in other words, pages that were not already transferred by vm_object_split(), then vm_object_split() would need to back up and copy those pages. Your most recent proposed patch doesn't have that flaw. However, underlying flaw is really in the swap page getpages function, and I'd prefer to address it there rather than workaround it by doing extra work in vm_object_split().

At the heart of the problem is that we are not allocating all of the behind pages before releasing the object lock. We could rearrange the start of the swap pager getpages function such that it allocates all of the behind pages before releasing the object lock to obtain the buffer. Then, vm_object_split() is going to stop at the first of the behind pages rather than ma[0], and there shouldn't be any illegitimately reallocated pages followed by I/O on them. This approach takes us back to something closer to what we had before the pager interface changes, and IMHO easy to reason about.

If we delay the allocation of the ahead pages until after the buffer is allocated, in other words, after we have dropped and reacquired the object lock, then I believe that we can still use the maxahead result from the swap_pager_haspages() call that was performed before the object lock was dropped. If any of that swap space was invalidated while the object was unlocked, there should now be a resident page at that pindex that stops us from attempting to read from the now invalid swap space.

In D15293#333512, @alc wrote:

If we delay the allocation of the ahead pages until after the buffer is allocated, in other words, after we have dropped and reacquired the object lock, then I believe that we can still use the maxahead result from the swap_pager_haspages() call that was performed before the object lock was dropped. If any of that swap space was invalidated while the object was unlocked, there should now be a resident page at that pindex that stops us from attempting to read from the now invalid swap space.

But then, why not move the buffer allocation after we drop the object lock anyway ? Even with the current place, we sleep for the available pbuf with m[0] busied. So it should not matter much if we sleep with all readahead/behind pages busied ?

diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
index 9755e6cb064..413b790484a 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
@@ -1096,6 +1096,7 @@ swap_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int *rbehind,
     int *rahead)
 {
 	struct buf *bp;
+	vm_page_t bma[btoc(MAXPHYS)];
 	vm_page_t mpred, msucc, p;
 	vm_pindex_t pindex;
 	daddr_t blk;
@@ -1103,14 +1104,8 @@ swap_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int *rbehind,
 
 	reqcount = count;
 
-	VM_OBJECT_WUNLOCK(object);
-	bp = getpbuf(&nsw_rcount);
-	VM_OBJECT_WLOCK(object);
-
-	if (!swap_pager_haspage(object, ma[0]->pindex, &maxbehind, &maxahead)) {
-		relpbuf(bp, &nsw_rcount);
+	if (!swap_pager_haspage(object, ma[0]->pindex, &maxbehind, &maxahead))
 		return (VM_PAGER_FAIL);
-	}
 
 	/*
 	 * Clip the readahead and readbehind ranges to exclude resident pages.
@@ -1143,24 +1138,23 @@ swap_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int *rbehind,
 			if (p == NULL) {
 				/* Shift allocated pages to the left. */
 				for (j = 0; j < i - 1; j++)
-					bp->b_pages[j] =
-					    bp->b_pages[j + shift - i + 1];
+					bma[j] = bma[j + shift - i + 1];
 				break;
 			}
-			bp->b_pages[shift - i] = p;
+			bma[shift - i] = p;
 		}
 		shift = i - 1;
 		*rbehind = shift;
 	}
 	for (i = 0; i < reqcount; i++)
-		bp->b_pages[i + shift] = ma[i];
+		bma[i + shift] = ma[i];
 	if (rahead != NULL) {
 		for (i = 0; i < *rahead; i++) {
 			p = vm_page_alloc(object,
 			    ma[reqcount - 1]->pindex + i + 1, VM_ALLOC_NORMAL);
 			if (p == NULL)
 				break;
-			bp->b_pages[shift + reqcount + i] = p;
+			bma[shift + reqcount + i] = p;
 		}
 		*rahead = i;
 	}
@@ -1172,14 +1166,17 @@ swap_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int *rbehind,
 	vm_object_pip_add(object, count);
 
 	for (i = 0; i < count; i++)
-		bp->b_pages[i]->oflags |= VPO_SWAPINPROG;
+		bma[i]->oflags |= VPO_SWAPINPROG;
 
-	pindex = bp->b_pages[0]->pindex;
+	pindex = bma[0]->pindex;
 	blk = swp_pager_meta_ctl(object, pindex, 0);
 	KASSERT(blk != SWAPBLK_NONE,
 	    ("no swap blocking containing %p(%jx)", object, (uintmax_t)pindex));
 
 	VM_OBJECT_WUNLOCK(object);
+	bp = getpbuf(&nsw_rcount);
+	for (i = 0; i < count; i++)
+		bp->b_pages[i] = bma[i];
 
 	bp->b_flags |= B_PAGING;
 	bp->b_iocmd = BIO_READ;
In D15293#333522, @kib wrote:
In D15293#333512, @alc wrote:

If we delay the allocation of the ahead pages until after the buffer is allocated, in other words, after we have dropped and reacquired the object lock, then I believe that we can still use the maxahead result from the swap_pager_haspages() call that was performed before the object lock was dropped. If any of that swap space was invalidated while the object was unlocked, there should now be a resident page at that pindex that stops us from attempting to read from the now invalid swap space.

But then, why not move the buffer allocation after we drop the object lock anyway ? Even with the current place, we sleep for the available pbuf with m[0] busied. So it should not matter much if we sleep with all readahead/behind pages busied ?

If we don't think that it's crucially important to minimize the number of allocated, busy pages while we sleep waiting for the buffer (modulo the need to allocate the before pages to address the race), then I see no reason not to move the buffer allocation. And, to be clear, I don't know of any reason that minimization here is crucially important.

Can we avoid the bma array, and just remember the first page, and use iteration over the object's memq to fill b_pages[]? The object can't be destroyed because of its non-zero PIP count, and the pages can't be removed from the object because they are busy, so isn't the relevant part of the object's memq "stable" without holding the object lock?

Swap the patch for the swap_pager.c patch.

I like it.

sys/vm/swap_pager.c
1098–1099

Pick either "m" or "p" and not have both.

1145

I'd be tempted to "recycle" the variable maxbehind here, and eliminate shift. maxbehind actually fits what is going on here. Or just do as below where the for loop condition is "i < *rahead".

1179

We assert that the object is locked inside vm_page_next(), so you will need to use TAILQ_NEXT directly.

kib marked 3 inline comments as done.

Fixed bug with the iteration without lock. Implement other suggestions too.

sys/vm/swap_pager.c
1105–1106

How about adding a comment here to explain the changes:

"Determine the final number of read-behind pages and allocate them BEFORE releasing the object lock. Otherwise, there can be a problematic race with vm_object_split(). Specifically, vm_object_split() might first transfer pages that precede ma[0] in the current object to a new object, and then this function incorrectly recreates those pages as read-behind pages in the current object."

1150–1151

Consider doing

p->oflags |= VPO_SWAPINPROG;

here and in the rahead snippet below. Then, the final "for" loop before releasing the object lock would only have to iterate over the ma[] array.

Remove use of vm_page_next(). Add comment.

This revision is now accepted and ready to land.Jun 13 2018, 3:58 PM

Peter, could you, please, test this patch ? I do not think that it is feasible to try to reproduce the original problem, but the generic testing would be useful.

In D15293#333833, @kib wrote:

Peter, could you, please, test this patch ? I do not think that it is feasible to try to reproduce the original problem, but the generic testing would be useful.

Sure. Running tests on three hosts running amd64.

In D15293#333873, @pho wrote:
In D15293#333833, @kib wrote:

Peter, could you, please, test this patch ? I do not think that it is feasible to try to reproduce the original problem, but the generic testing would be useful.

Sure. Running tests on three hosts running amd64.

I ran all of the stress2 tests + a buildworld. No problems seen.