Page MenuHomeFreeBSD

Implement M_NEXTFIT.
ClosedPublic

Authored by markj on Sep 18 2018, 6:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 19, 1:41 AM
Unknown Object (File)
Sun, Nov 17, 11:46 AM
Unknown Object (File)
Sun, Nov 17, 10:40 AM
Unknown Object (File)
Sun, Nov 17, 10:39 AM
Unknown Object (File)
Sun, Nov 17, 10:29 AM
Unknown Object (File)
Sun, Nov 17, 10:28 AM
Unknown Object (File)
Sun, Nov 17, 10:26 AM
Unknown Object (File)
Sun, Nov 17, 10:25 AM
Subscribers

Details

Summary

This is described in the vmem paper: "directs vmem to use the next free
segment after the one previously allocated." The implementation adds a
new boundary tag type, M_CURSOR, which is linked into the segment list
and precedes the segment following the previous M_NEXTFIT allocation.
The cursor is used to locate the next free segment satisfying the
allocation constraints.

This implementation isn't O(1) since busy tags aren't coalesced, so we
may potentially scan the entire segment list during an M_NEXTFIT
allocation.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision is now accepted and ready to land.Sep 18 2018, 10:50 PM
sys/kern/subr_vmem.c
1029 ↗(On Diff #48193)

Consider performing a recursive call to vmem_xalloc() where M_NEXTFIT is replaced by M_FIRSTFIT. I believe that will gracefully handle M_WAITOK versus M_NOWAIT and the possibility of doing an import.

An edge case here is that the combined free space from prev and next satisfy the allocation request but vmem_fit() won't be able to discern this.

1044–1050 ↗(On Diff #48193)

It's possible that this coalesced boundary tag now corresponds to an entire span and the release function needs to be called. See vmem_xfree(). The cursor would have blocked an earlier call to vmem_xfree() from recognizing the entire span as being free.

1051–1057 ↗(On Diff #48193)

Why is a loop required here? Can't you just reinsert the cursor after the original t?

1204–1207 ↗(On Diff #48193)

While I'm here, I'll point out a bug in the quantum cache implementation. You can't use a quantum cache in conjunction with an arena that manages the address zero. If the cache returns zero, that zero is falsely interpreted here as an allocation failure. One of my students was bitten by this bug.

markj marked 4 inline comments as done and an inline comment as not done.Sep 19 2018, 7:04 PM
markj added inline comments.
sys/kern/subr_vmem.c
1029 ↗(On Diff #48193)

Regarding the edge case, how can we make use of prev to satisfy the allocation request without going backwards?

1051–1057 ↗(On Diff #48193)

I was trying to avoid depending on the internal details of vmem_clip(). In any case, reinserting after t isn't quite right: suppose vmem_clip() needs to split t into BUSY and FREE segments. The BUSY segment will be allocated and inserted before t in the seglist, so we'd want to insert the cursor before t.

markj marked an inline comment as done.
  • Address some comments.
This revision now requires review to proceed.Sep 19 2018, 7:25 PM
sys/kern/subr_vmem.c
1059 ↗(On Diff #48236)

May be add vmem_xalloc_locked() and do the call under the same locking region ?

sys/kern/subr_vmem.c
1059 ↗(On Diff #48236)

I'm on the fence about this. If we failed to find a free segment, then we've traversed the entire segment list with the arena lock held. Releasing the lock here would give other threads a chance to free resources before we block or attempt an import, so there is a chance we will be able to satisfy the allocation from the freelists. I can't really see an advantage to holding the lock across the vmem_xalloc() call.

1204–1207 ↗(On Diff #48193)

The only easy solution I can see is to ensure that we never import the address zero into the cache in the first place. That is, modify qc_import() to require a minimum address of 1 instead of 0.

sys/kern/subr_vmem.c
1080 ↗(On Diff #48236)

There is a subtle problem here. In vmem_xfree(), the given boundary tag to vmem_try_release() is not in the free lists, but here it is.

sys/kern/subr_vmem.c
1029 ↗(On Diff #48193)

I looked at illumos to see how it deals with this edge case. When it fails to find a free boundary tag that is sufficiently large, it moves the cursor forward by one boundary tag, attempts coalescing (and release) on the previous boundary tags, and checks to see if the previous boundary tag is now large enough to satisfy the allocation.

One goofy thing about the illumos code is that it could perform a release and then immediately re-import to satisfy the allocation.

markj marked an inline comment as done.
  • Ensure that vmem_try_release() removes the segment from the freelists if needed.
  • Attempt allocation from the coalesced segment before falling back to vmem_xalloc(M_FIRSTFIT).
sys/kern/subr_vmem.c
1102–1104 ↗(On Diff #48467)

I was having second thoughts about whether the ENOMEM case can be as simple as a call to vmem_xalloc(). Specifically, do we need to update the cursor? If the current allocation request is resolved by an import, then the next allocation request will be appropriately next fit if there is no intervening free request. But, a just freed range might fall between the unmoved cursor and the unallocated remains of the import, which would result in the just freed range being prematurely allocated.

Other than this, I don't have any other high-level concerns.

markj marked an inline comment as done.

Don't fall back to vmem_xalloc(). Instead, factor out handling of
resource shortages into vmem_try_fetch() and use that if the nextfit
search fails. This way we ensure that the cursor is updated on all
M_NEXTFIT allocations, so the policy is applied strictly.

markj marked 2 inline comments as done.Oct 9 2018, 4:20 PM
markj added inline comments.
sys/kern/subr_vmem.c
1102–1104 ↗(On Diff #48467)

I changed the implementation so that both vmem_xalloc() and vmem_xalloc_nextfit() use the same fallback logic. In particular, the latter no longer calls the former.

This solution is potentially very expensive because of the linear scanning we do. I can't see a way to avoid that except by augmenting boundary tags to point to the next free item in segment order. One possibility, to avoid bloating the btag struct, would be to disallow mixing M_NEXTFIT with other policies for a given arena. Then, M_NEXTFIT arenas don't need to maintain freelists, and we can use the u_freelist pointers in the btag to find the next free tag in constant time.

BTW, I plan to write some tests for this code to exercise corner cases, though we don't have much precedent for in-kernel tests.

sys/kern/subr_vmem.c
1035 ↗(On Diff #48938)

I would suggest placing the vm_releasefn test before anything else, i.e., do

if (vm->vm_releasefn == NULL)
        return (0);
markj marked an inline comment as done.
  • Check for a null releasefn earlier.
  • Update man page.
  • Update bt_type_string().
sys/kern/subr_vmem.c
1102–1104 ↗(On Diff #51122)

I am worried about the case where prev and next can coalesce and t == next. In that case, I don't think that we should coalesce.

1118–1119 ↗(On Diff #51122)

We might be here in a case where we succeeded in finding a fit. However, that fit t is not next, and removing the cursor has allowed us to coalesce prev and next. In such a case, we shouldn't be reinserting the cursor here.

95–96 ↗(On Diff #50925)

Given that this expression now spans two lines, it might as well start on the same line as the #define.

1042 ↗(On Diff #50925)

I would suggest making this the very first statement of the function, even before the if (vm->vm_releasefn == NULL).

1049–1050 ↗(On Diff #50925)

I realize that this makes no functional difference, but it has always struck me as weird that bt is used here rather than prev given that prev is the actual span tag.

1058 ↗(On Diff #50925)

Curiously, there is a preexisting stylistic inconsistency between this call and the call to the "reclaim function". The latter doesn't use the redundant (*funcptr)() syntax; it simply uses funcptr(). Let's pick one style.

1093 ↗(On Diff #50925)

We need to set error = 0; here. Otherwise, we might repeat the vmem_fit() below in the coalescing case.

1284 ↗(On Diff #50925)

Shouldn't we disallow M_NEXTFIT when there is a quantum cache? In other words, assert ... || (strat == M_NEXTFIT && vm->vm_qcache_max == 0));

1337 ↗(On Diff #50925)

In vmem_alloc(), 0 is spelled VMEM_ADDR_MIN.

markj added inline comments.
sys/kern/subr_vmem.c
1102–1104 ↗(On Diff #51122)

If I understand correctly, that becomes a bug only if I set error = 0 as you suggested above.

Suppose that t == next and t fits the allocation request, but requires that we clip off some part of the beginning of t. In that case, don't we still need to coalesce prev with the newly created free segment?

1049–1050 ↗(On Diff #50925)

I remember noticing that too, so I'll change it while here.

1058 ↗(On Diff #50925)

I like the latter, and cscope doesn't recognize the former as a call (so a "c" query for vm_releasefn doesn't return anything).

1093 ↗(On Diff #50925)

To be clear, the code still functions correctly in that case, right?

1284 ↗(On Diff #50925)

I was thinking about this. Suppose I wanted to try to reduce lock contention on the memguard arena by using a quantum cache for allocations but not for frees. If I specified NEXTFIT for the allocations, I believe it would get passed through to qc_import(), which would then fill the UMA cache using the nextfit policy. Then vmem_alloc(M_NEXTFIT) would still give a reasonable approximation to the nextfit policy, and many allocations could be satisfied without touching the vmem lock.

memguard operations are still serialized by the kernel object lock, so I'm not particularly tempted to actually try this, but do you think it's a reaonsable use-case for M_NEXTFIT+quantum caching?

markj marked 2 inline comments as done.
  • Address some but not all of Alan's comments.
sys/kern/subr_vmem.c
1093 ↗(On Diff #50925)

I would say, "No," that it doesn't. Here is an example. Suppose that [1, 4) and [6, 10) are free, and that the cursor sits between these two ranges. For example, we may have just allocated 5. Now, imagine that 5 is freed, and then an allocation is performed. After the above loop, t will point to [6, 10) as we would hope. In other words, we expect the allocation to return 6. However, as the code stands, the below conditions for coalescing will be met. So, after coalescing, we'll have a single free range [1, 10); we'll repeat the vmem_fit() on that coalesced range, and we'll wind up returning 1 rather than 6.

sys/kern/subr_vmem.c
1102–1104 ↗(On Diff #51122)

Yes. However, the code that currently follows the completion of the coalescing isn't going to handle this case correctly. In this case, we should neither repeat vmem_fit() nor call vmem_try_release().

markj marked 5 inline comments as done.
  • Attempt to handle all of the bugs that Alan pointed out.
markj added inline comments.
sys/kern/subr_vmem.c
1093 ↗(On Diff #50925)

I see now, thanks. I restructured things somewhat: we clip the segment before attempting coalescing. If we fail to find a suitable segment, and coalescing creates one, we will clip that one instead.

share/man/man9/vmem.9
170 ↗(On Diff #51565)

Once it wraps around, it is no longer monotonically increasing. :-)

I would define it as "Performs an address-ordered search for free addresses, beginning where the previous search ended."

sys/kern/subr_vmem.c
1069 ↗(On Diff #51565)

As a matter of style, I would suggest "bt" instead of "t". That is the common variable name used throughout this file.

1079–1080 ↗(On Diff #51565)

I would suggest using a "goto out;" here, and defining an "out:" label below. The value of error will be correct, even on a retry.

1121 ↗(On Diff #51565)

You could reduce the number of calls to vmem_clip() in this function from two to one by performing that one call inside the below if (error == 0).

1124–1126 ↗(On Diff #51565)

I still see one problematic edge case. To be clear, this is not it, but there was no better place to put my comment.

Suppose that an M_NOWAIT allocation fails and that we couldn't enter this block because prev and next can't coalesce. We could exit from this function without reinserting the cursor into the segment list.

I think that you should not remove the cursor from the segment list unconditionally above. Instead, remove it here, right before the TAILQ_INSERT_AFTER(), and again down below inside the if (error == 0). In other words, perform the removes in the same spots that you perform the inserts.

markj added inline comments.
sys/kern/subr_vmem.c
1121 ↗(On Diff #51565)

Wouldn't that reintroduce an earlier problem if next satisfies the allocation? We may coalesce before clipping, in which case we could return an address preceding the cursor.

1124–1126 ↗(On Diff #51565)

Hmm, if I do that I don't actually need the TAILQ_INSERT_AFTER()...

markj marked an inline comment as done.
  • Address most of the review feedback.

Upload the right version of the patch.

sys/kern/subr_vmem.c
1079 ↗(On Diff #51852)

Remove the VMEM_UNLOCK(vm); here.

sys/kern/subr_vmem.c
1104–1107 ↗(On Diff #51852)

As a micro-optimization, I would suggest expressing this as:

if ((next = TAILQ_NEXT(cursor, bt_seglist)) != NULL &&
    next->bt_type == BT_TYPE_FREE &&
    ((prev = TAILQ_PREV(cursor, vmem_seglist, bt_seglist)) != NULL &&
    prev->bt_type == BT_TYPE_FREE &&
    ...

Then, in the expected case, you will not have to compute prev, because vmem_clip() will have converted next to busy.

1121 ↗(On Diff #51565)

Actually, no, because vmem_clip() will use the address computed by vmem_fit(). That said, doing as I suggested could lead to a situation where vmem_try_release() could release the range that we want to allocate. So, I agree with your conclusion, i.e., that the vmem_clip() should not be deferred, but for a different reason.

markj marked 2 inline comments as done.
  • Address feedback.

I did a careful review of these changes a while ago, and I'm convinced that all of the issues that we discussed last fall are dealt with.

This revision is now accepted and ready to land.May 17 2019, 10:29 PM
This revision was automatically updated to reflect the committed changes.