Page MenuHomeFreeBSD

Use per-domain locks for the zone layer.
Needs ReviewPublic

Authored by jeff on Thu, Feb 13, 9:34 PM.



I apologize for the length and complexity of this review. I did this in a flurry of commits over two days that were not well isolated and churned eachother. I don't think it's worth taking apart now.

The main thrust is providing independent locking for the domains in the zone layer. This improved allocator throughput on a machine with 50 domains by 2x.

This moves all of the per-zone domain logic into zone_fetch_bucket() and zone_put_bucket(). This really simplifies callers and makes it easier to do other restructuring.

It moves drain into bucket_free() so that every caller doesn't have to do both.

It attempts to avoid the lock on the alloc path if we don't think we'll succeed.

I restructured zone and moved fields around where I thought was appropriate and to reduce gaps.

I now calculate the zone domain address rather than fetching a pointer.

Diff Detail

Lint OK
No Unit Test Coverage
Build Status
Buildable 29416
Build 27298: arc lint + arc unit

Event Timeline

jeff created this revision.Thu, Feb 13, 9:34 PM
jeff edited the summary of this revision. (Show Details)Thu, Feb 13, 9:41 PM
jeff added reviewers: rlibby, markj.
jeff set the repository for this revision to rS FreeBSD src repository.
jeff added a comment.Thu, Feb 13, 9:42 PM

One more things, round-robin domains use all zone domains now but load balance them. This is for lock contention reasons on large machines.

jeff added a reviewer: mjg.Thu, Feb 13, 9:56 PM
jeff added inline comments.Thu, Feb 13, 10:32 PM

Needs a comment.


Should be an inline.


Should be an inline.




This nicely puts the poll outside the lock.


Mark if you could review this section in particular. I think what I did is equivalent but maybe not.


I find this slightly questionable on SMR. It may block for a long time if we do preemptive sections. So this should probably poll first.


This one pains me slightly since it is written so often for mbufs. Maybe it should just be with garbage at the end still.


I was hoping to fit this in 256b total but it just doesn't quite work.

This means there's a full 64 byte unused cacheline between here and the cpu array :(



jeff updated this revision to Diff 68394.Sun, Feb 16, 7:41 AM
jeff edited the summary of this revision. (Show Details)

Fix some imax bugs and address my own feedback.

jeff added a comment.Sun, Feb 16, 8:59 AM

I am mostly happy with this now and it is doing well on stress after a few small errors.


this is a bug. cnt should be new. I made an error refactoring into the function.


I should just move this up a level and not use the *


I'm unsure whether I will keep this.

ROUNDROBIN domains can now block on allocations when there are free items in the bucket caches of other 'domains'. It's unlikely due to the highest/lowest loops but it is possible. Even with a single domain a thread can block in page alloc while another frees to the domain layer. IIRC earlier versions of the allocator would check for this condition before re-entering vm_page_alloc().

This feature is the only reason for zone_domain().

markj added inline comments.Mon, Feb 17, 4:26 PM

These three lines could be written as lockfail = ZDOM_OWNED(zdom).


The lack of precision seems ok, but this race makes it possible to exceed bucket_size_max. In practice it'll probably only go over by 1, but it could be more if you're unlucky. We should have a comment explaining why this is acceptable (or consider sharding this field as well, or ...).


I don't think we need this assertion anymore now that this function handles cache overflow.


So, the reason we do this here is because we might lose the race and put the bucket back in the cache. There are no other failure modes.

Instead of waiting until you're certain of success, why can't we re-adjust imin when putting the bucket back in the cache after losing the race? This could cause the WSS estimate to be artificially inflated, but only temporarily, and it would simplify imin/imax management a lot.

In fact, if you call zone_put_bucket() with ws = true instead of false, I think we'll get the desired result. You just need to handle uz_bkt_max. If imax/imin are reset while the lock is dropped, we will end up with a larger WSS estimate than we should, but that's a narrow race and the estimate decays anyway.


I dislike the mix of uz_bkt_ and uz_bucket_ in field names.

You might also amend the comment, "Maxium per-domain bucket cache size".