Details

Reviewers

alc
kib
jeff
cem

Group Reviewers

manpages

Commits

rS347949: Implement the M_NEXTFIT allocation strategy for vmem(9).

Summary

This is described in the vmem paper: "directs vmem to use the next free
segment after the one previously allocated." The implementation adds a
new boundary tag type, M_CURSOR, which is linked into the segment list
and precedes the segment following the previous M_NEXTFIT allocation.
The cursor is used to locate the next free segment satisfying the
allocation constraints.

This implementation isn't O(1) since busy tags aren't coalesced, so we
may potentially scan the entire segment list during an M_NEXTFIT
allocation.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 19656
Build 19226: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

kib accepted this revision.Sep 18 2018, 10:50 PM

This revision is now accepted and ready to land.Sep 18 2018, 10:50 PM

alc added inline comments.Sep 19 2018, 5:31 PM

sys/kern/subr_vmem.c
1029	Consider performing a recursive call to vmem_xalloc() where M_NEXTFIT is replaced by M_FIRSTFIT. I believe that will gracefully handle M_WAITOK versus M_NOWAIT and the possibility of doing an import. An edge case here is that the combined free space from prev and next satisfy the allocation request but vmem_fit() won't be able to discern this.
1044–1050	It's possible that this coalesced boundary tag now corresponds to an entire span and the release function needs to be called. See vmem_xfree(). The cursor would have blocked an earlier call to vmem_xfree() from recognizing the entire span as being free.
1051–1057	Why is a loop required here? Can't you just reinsert the cursor after the original t?
1204–1207	While I'm here, I'll point out a bug in the quantum cache implementation. You can't use a quantum cache in conjunction with an arena that manages the address zero. If the cache returns zero, that zero is falsely interpreted here as an allocation failure. One of my students was bitten by this bug.

markj marked 4 inline comments as done and an inline comment as not done.Sep 19 2018, 7:04 PM

markj added inline comments.

sys/kern/subr_vmem.c
1029	Regarding the edge case, how can we make use of prev to satisfy the allocation request without going backwards?
1051–1057	I was trying to avoid depending on the internal details of vmem_clip(). In any case, reinserting after t isn't quite right: suppose vmem_clip() needs to split t into BUSY and FREE segments. The BUSY segment will be allocated and inserted before t in the seglist, so we'd want to insert the cursor before t.

Address some comments.

This revision now requires review to proceed.Sep 19 2018, 7:25 PM

Harbormaster completed remote builds in B19684: Diff 48236.Sep 19 2018, 7:25 PM

kib added inline comments.Sep 19 2018, 8:07 PM

sys/kern/subr_vmem.c
1059	May be add vmem_xalloc_locked() and do the call under the same locking region ?

markj added inline comments.Sep 20 2018, 3:37 PM

sys/kern/subr_vmem.c
1059	I'm on the fence about this. If we failed to find a free segment, then we've traversed the entire segment list with the arena lock held. Releasing the lock here would give other threads a chance to free resources before we block or attempt an import, so there is a chance we will be able to satisfy the allocation from the freelists. I can't really see an advantage to holding the lock across the vmem_xalloc() call.
1204–1207	The only easy solution I can see is to ensure that we never import the address zero into the cache in the first place. That is, modify qc_import() to require a minimum address of 1 instead of 0.

alc added inline comments.Sep 21 2018, 6:38 AM

sys/kern/subr_vmem.c
1080	There is a subtle problem here. In vmem_xfree(), the given boundary tag to vmem_try_release() is not in the free lists, but here it is.

alc added inline comments.Sep 22 2018, 5:00 AM

sys/kern/subr_vmem.c
1029	I looked at illumos to see how it deals with this edge case. When it fails to find a free boundary tag that is sufficiently large, it moves the cursor forward by one boundary tag, attempts coalescing (and release) on the previous boundary tags, and checks to see if the previous boundary tag is now large enough to satisfy the allocation. One goofy thing about the illumos code is that it could perform a release and then immediately re-import to satisfy the allocation.

Ensure that vmem_try_release() removes the segment from the freelists if needed.
Attempt allocation from the coalesced segment before falling back to vmem_xalloc(M_FIRSTFIT).

Harbormaster completed remote builds in B19818: Diff 48467.Sep 25 2018, 10:47 PM

alc added inline comments.Sep 28 2018, 5:52 PM

sys/kern/subr_vmem.c
1105–1107	I was having second thoughts about whether the ENOMEM case can be as simple as a call to vmem_xalloc(). Specifically, do we need to update the cursor? If the current allocation request is resolved by an import, then the next allocation request will be appropriately next fit if there is no intervening free request. But, a just freed range might fall between the unmoved cursor and the unallocated remains of the import, which would result in the just freed range being prematurely allocated. Other than this, I don't have any other high-level concerns.

Don't fall back to vmem_xalloc(). Instead, factor out handling of
resource shortages into vmem_try_fetch() and use that if the nextfit
search fails. This way we ensure that the cursor is updated on all
M_NEXTFIT allocations, so the policy is applied strictly.

Harbormaster completed remote builds in B20091: Diff 48938.Oct 9 2018, 4:14 PM

markj marked 2 inline comments as done.Oct 9 2018, 4:20 PM

markj added inline comments.

sys/kern/subr_vmem.c
1105–1107	I changed the implementation so that both vmem_xalloc() and vmem_xalloc_nextfit() use the same fallback logic. In particular, the latter no longer calls the former. This solution is potentially very expensive because of the linear scanning we do. I can't see a way to avoid that except by augmenting boundary tags to point to the next free item in segment order. One possibility, to avoid bloating the btag struct, would be to disallow mixing M_NEXTFIT with other policies for a given arena. Then, M_NEXTFIT arenas don't need to maintain freelists, and we can use the u_freelist pointers in the btag to find the next free tag in constant time.

BTW, I plan to write some tests for this code to exercise corner cases, though we don't have much precedent for in-kernel tests.

alc added inline comments.Nov 21 2018, 6:53 PM

sys/kern/subr_vmem.c
1038	I would suggest placing the vm_releasefn test before anything else, i.e., do if (vm->vm_releasefn == NULL) return (0);

Check for a null releasefn earlier.

Harbormaster completed remote builds in B21166: Diff 50925.Nov 22 2018, 7:47 PM

Update man page.
Update bt_type_string().

Harbormaster completed remote builds in B21207: Diff 51122.Nov 26 2018, 3:51 PM

alc added inline comments.Nov 26 2018, 4:40 PM

sys/kern/subr_vmem.c
95–97	Given that this expression now spans two lines, it might as well start on the same line as the #define.
1045	I would suggest making this the very first statement of the function, even before the `if (vm->vm_releasefn == NULL)`.
1052–1053	I realize that this makes no functional difference, but it has always struck me as weird that `bt` is used here rather than `prev` given that `prev` is the actual span tag.
1061	Curiously, there is a preexisting stylistic inconsistency between this call and the call to the "reclaim function". The latter doesn't use the redundant `(*funcptr)()` syntax; it simply uses `funcptr()`. Let's pick one style.
1096	We need to set `error = 0;` here. Otherwise, we might repeat the vmem_fit() below in the coalescing case.
1105–1107	I am worried about the case where prev and next can coalesce and `t == next`. In that case, I don't think that we should coalesce.
1121–1122	We might be here in a case where we succeeded in finding a fit. However, that fit `t` is not `next`, and removing the cursor has allowed us to coalesce `prev` and `next`. In such a case, we shouldn't be reinserting the cursor here.
1198	Shouldn't we disallow M_NEXTFIT when there is a quantum cache? In other words, assert `... \|\| (strat == M_NEXTFIT && vm->vm_qcache_max == 0));`
1247	In vmem_alloc(), 0 is spelled VMEM_ADDR_MIN.

markj marked 10 inline comments as done.Nov 26 2018, 11:30 PM

markj added inline comments.

sys/kern/subr_vmem.c
1052–1053	I remember noticing that too, so I'll change it while here.
1061	I like the latter, and cscope doesn't recognize the former as a call (so a "c" query for vm_releasefn doesn't return anything).
1096	To be clear, the code still functions correctly in that case, right?
1105–1107	If I understand correctly, that becomes a bug only if I set error = 0 as you suggested above. Suppose that `t == next` and `t` fits the allocation request, but requires that we clip off some part of the beginning of `t`. In that case, don't we still need to coalesce `prev` with the newly created free segment?
1198	I was thinking about this. Suppose I wanted to try to reduce lock contention on the memguard arena by using a quantum cache for allocations but not for frees. If I specified NEXTFIT for the allocations, I believe it would get passed through to qc_import(), which would then fill the UMA cache using the nextfit policy. Then vmem_alloc(M_NEXTFIT) would still give a reasonable approximation to the nextfit policy, and many allocations could be satisfied without touching the vmem lock. memguard operations are still serialized by the kernel object lock, so I'm not particularly tempted to actually try this, but do you think it's a reaonsable use-case for M_NEXTFIT+quantum caching?

Address some but not all of Alan's comments.

Harbormaster completed remote builds in B21214: Diff 51142.Nov 26 2018, 11:45 PM

alc added inline comments.Nov 28 2018, 7:17 AM

sys/kern/subr_vmem.c
1096	I would say, "No," that it doesn't. Here is an example. Suppose that [1, 4) and [6, 10) are free, and that the cursor sits between these two ranges. For example, we may have just allocated 5. Now, imagine that 5 is freed, and then an allocation is performed. After the above loop, `t` will point to [6, 10) as we would hope. In other words, we expect the allocation to return 6. However, as the code stands, the below conditions for coalescing will be met. So, after coalescing, we'll have a single free range [1, 10); we'll repeat the vmem_fit() on that coalesced range, and we'll wind up returning 1 rather than 6.

alc added inline comments.Nov 28 2018, 7:38 AM

sys/kern/subr_vmem.c
1105–1107	Yes. However, the code that currently follows the completion of the coalescing isn't going to handle this case correctly. In this case, we should neither repeat vmem_fit() nor call vmem_try_release().

Attempt to handle all of the bugs that Alan pointed out.

markj marked an inline comment as done.Dec 4 2018, 12:53 AM

markj added inline comments.

sys/kern/subr_vmem.c
1096	I see now, thanks. I restructured things somewhat: we clip the segment before attempting coalescing. If we fail to find a suitable segment, and coalescing creates one, we will clip that one instead.

Harbormaster completed remote builds in B21352: Diff 51565.Dec 4 2018, 12:54 AM

markj edited the summary of this revision. (Show Details)Dec 4 2018, 3:24 PM

alc added inline comments.Dec 10 2018, 5:21 PM

share/man/man9/vmem.9
170 ↗	(On Diff #51565)	Once it wraps around, it is no longer monotonically increasing. :-) I would define it as "Performs an address-ordered search for free addresses, beginning where the previous search ended."
sys/kern/subr_vmem.c
1073	As a matter of style, I would suggest "bt" instead of "t". That is the common variable name used throughout this file.
1083–1084	I would suggest using a "goto out;" here, and defining an "out:" label below. The value of error will be correct, even on a retry.
1125	You could reduce the number of calls to vmem_clip() in this function from two to one by performing that one call inside the below if (error == 0).
1128–1130	I still see one problematic edge case. To be clear, this is not it, but there was no better place to put my comment. Suppose that an M_NOWAIT allocation fails and that we couldn't enter this block because prev and next can't coalesce. We could exit from this function without reinserting the cursor into the segment list. I think that you should not remove the cursor from the segment list unconditionally above. Instead, remove it here, right before the TAILQ_INSERT_AFTER(), and again down below inside the if (error == 0). In other words, perform the removes in the same spots that you perform the inserts.

markj marked 6 inline comments as done.Dec 11 2018, 3:55 AM

markj added inline comments.

sys/kern/subr_vmem.c
1125	Wouldn't that reintroduce an earlier problem if `next` satisfies the allocation? We may coalesce before clipping, in which case we could return an address preceding the cursor.
1128–1130	Hmm, if I do that I don't actually need the TAILQ_INSERT_AFTER()...

Address most of the review feedback.

Harbormaster completed remote builds in B21487: Diff 51850.Dec 11 2018, 3:59 AM

Typo.

Harbormaster completed remote builds in B21488: Diff 51851.Dec 11 2018, 4:01 AM

Upload the right version of the patch.

Harbormaster completed remote builds in B21489: Diff 51852.Dec 11 2018, 4:02 AM

alc added inline comments.Dec 12 2018, 5:11 PM

sys/kern/subr_vmem.c
1083	Remove the `VMEM_UNLOCK(vm);` here.

alc added inline comments.Dec 12 2018, 5:42 PM

sys/kern/subr_vmem.c
1108–1111	As a micro-optimization, I would suggest expressing this as: if ((next = TAILQ_NEXT(cursor, bt_seglist)) != NULL && next->bt_type == BT_TYPE_FREE && ((prev = TAILQ_PREV(cursor, vmem_seglist, bt_seglist)) != NULL && prev->bt_type == BT_TYPE_FREE && ... Then, in the expected case, you will not have to compute `prev`, because vmem_clip() will have converted `next` to busy.
1125	Actually, no, because vmem_clip() will use the address computed by vmem_fit(). That said, doing as I suggested could lead to a situation where vmem_try_release() could release the range that we want to allocate. So, I agree with your conclusion, i.e., that the vmem_clip() should not be deferred, but for a different reason.