Use per-domain locks for the zone layer.
ClosedPublic
Actions

Authored by jeff on Feb 13 2020, 9:34 PM.

Details

Reviewers

rlibby
markj
mjg

Commits

rS358128: Use per-domain locks for the bucket cache.

Summary

I apologize for the length and complexity of this review. I did this in a flurry of commits over two days that were not well isolated and churned eachother. I don't think it's worth taking apart now.

The main thrust is providing independent locking for the domains in the zone layer. This improved allocator throughput on a machine with 50 domains by 2x.

This moves all of the per-zone domain logic into zone_fetch_bucket() and zone_put_bucket(). This really simplifies callers and makes it easier to do other restructuring.

It moves drain into bucket_free() so that every caller doesn't have to do both.

It attempts to avoid the lock on the alloc path if we don't think we'll succeed.

I restructured zone and moved fields around where I thought was appropriate and to reduce gaps.

I now calculate the zone domain address rather than fetching a pointer.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jeff created this revision.Feb 13 2020, 9:34 PM

Harbormaster completed remote builds in B29364: Diff 68283.Feb 13 2020, 9:34 PM

jeff edited the summary of this revision. (Show Details)Feb 13 2020, 9:41 PM

jeff added reviewers: rlibby, markj.

jeff set the repository for this revision to rS FreeBSD src repository - subversion.

Herald added a subscriber: imp. · View Herald TranscriptFeb 13 2020, 9:41 PM

One more things, round-robin domains use all zone domains now but load balance them. This is for lock contention reasons on large machines.

jeff added a reviewer: mjg.Feb 13 2020, 9:56 PM

jeff added inline comments.Feb 13 2020, 10:32 PM

sys/vm/uma_core.c
564 ↗	(On Diff #68283)	Needs a comment.
619 ↗	(On Diff #68283)	Should be an inline.
661 ↗	(On Diff #68283)	Should be an inline.
820 ↗	(On Diff #68283)	comment
833 ↗	(On Diff #68283)	This nicely puts the poll outside the lock.
1204 ↗	(On Diff #68283)	Mark if you could review this section in particular. I think what I did is equivalent but maybe not.
1270 ↗	(On Diff #68283)	I find this slightly questionable on SMR. It may block for a long time if we do preemptive sections. So this should probably poll first.
sys/vm/uma_int.h
504 ↗	(On Diff #68283)	This one pains me slightly since it is written so often for mbufs. Maybe it should just be with garbage at the end still.
525 ↗	(On Diff #68283)	I was hoping to fit this in 256b total but it just doesn't quite work. This means there's a full 64 byte unused cacheline between here and the cpu array :(
586 ↗	(On Diff #68283)	Unused

Fix some imax bugs and address my own feedback.

Harbormaster completed remote builds in B29416: Diff 68394.Feb 16 2020, 7:41 AM

I am mostly happy with this now and it is doing well on stress after a few small errors.

sys/vm/uma_core.c
630 ↗	(On Diff #68394)	this is a bug. cnt should be new. I made an error refactoring into the function.
899–900 ↗	(On Diff #68394)	I should just move this up a level and not use the *
2247 ↗	(On Diff #68394)	I'm unsure whether I will keep this. ROUNDROBIN domains can now block on allocations when there are free items in the bucket caches of other 'domains'. It's unlikely due to the highest/lowest loops but it is possible. Even with a single domain a thread can block in page alloc while another frees to the domain layer. IIRC earlier versions of the allocator would check for this condition before re-entering vm_page_alloc(). This feature is the only reason for zone_domain().

markj added inline comments.Feb 17 2020, 4:26 PM

sys/vm/uma_core.c
555 ↗	(On Diff #68394)	These three lines could be written as `lockfail = ZDOM_OWNED(zdom)`.
559 ↗	(On Diff #68394)	The lack of precision seems ok, but this race makes it possible to exceed bucket_size_max. In practice it'll probably only go over by 1, but it could be more if you're unlucky. We should have a comment explaining why this is acceptable (or consider sharding this field as well, or ...).
714 ↗	(On Diff #68394)	I don't think we need this assertion anymore now that this function handles cache overflow.
3460 ↗	(On Diff #68394)	So, the reason we do this here is because we might lose the race and put the bucket back in the cache. There are no other failure modes. Instead of waiting until you're certain of success, why can't we re-adjust imin when putting the bucket back in the cache after losing the race? This could cause the WSS estimate to be artificially inflated, but only temporarily, and it would simplify imin/imax management a lot. In fact, if you call zone_put_bucket() with ws = true instead of false, I think we'll get the desired result. You just need to handle uz_bkt_max. If imax/imin are reset while the lock is dropped, we will end up with a larger WSS estimate than we should, but that's a narrow race and the estimate decays anyway.
sys/vm/uma_int.h
491 ↗	(On Diff #68394)	I dislike the mix of uz_bkt_ and uz_bucket_ in field names. You might also amend the comment, "Maxium per-domain bucket cache size".

jeff added inline comments.Feb 17 2020, 7:55 PM

sys/vm/uma_core.c
559 ↗	(On Diff #68394)	I'd rather not shard it because it makes for some weird perf scenarios especially since round-robin may bounce between them. I can also load it once and store a calculated value so it never exceeds max.
714 ↗	(On Diff #68394)	I guess it is possible for the overflow to come via cross domain flush where you don't know the value of ws.
3460 ↗	(On Diff #68394)	As discussed on irc; The issue is not the race but that we can call zone_alloc_bucket() now without ever having acquired a domain lock. This becomes increasingly important as we get more cores per domain.
sys/vm/uma_int.h
491 ↗	(On Diff #68394)	I will go ahead and rename it. I thought it odd as well.