Page MenuHomeFreeBSD

per-domain page queue free locking
ClosedPublic

Authored by jeff on Jan 20 2018, 7:44 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 5, 3:08 PM
Unknown Object (File)
Sun, Jan 5, 2:41 PM
Unknown Object (File)
Fri, Jan 3, 4:13 PM
Unknown Object (File)
Mon, Dec 9, 8:56 PM
Unknown Object (File)
Sun, Dec 8, 10:57 AM
Unknown Object (File)
Nov 25 2024, 6:53 AM
Unknown Object (File)
Nov 23 2024, 8:33 PM
Unknown Object (File)
Nov 21 2024, 7:19 PM

Details

Summary

This diff primarily impacts reservations and paging. I took the opportunity to do some light refactoring for consistency. I made the vm_domain more of a first class structure and renamed a couple of functions and added many new ones to indicate that they operate on properties of the domain. I made some variable names more consistent across files where I was already making changes. I tried to make the API around domains more coherent since I expect it to become more stable. I use the integer 'domain' when crossing module boundaries so that we don't need to pollute more code with vm structures and 'struct vm_domain *vmd' in functions associated with paging.

The paging changes remove all counters except for wire_count from vmmeter. The paging targets and control variables are all properties of the domain. They are summed for presentation to the user. The atomics on global queue counts are now gone. There are global bitmasks of low and severe domains that are used as allocation hints in a few places. VM_WAIT() calls wait for !low now. The only VM_WAIT calls that remain are fault, pte allocation, and some drm code which I believe could be modified to use WAITFAIL. Everything else sleeps on a specific domain now.

I did make laundry per-domain. I realize this may be somewhat controversial although i did discuss with Mark first. My opinion is that we already have multiple I/O threads and we're likely to also have multiple I/O devices on machines with many domains. The code is structurally much simpler using multiple laundry threads in any event. So I chose to go this way.

There are a few cases where code really wants to know a global free count. This is done by looping over all domains and summing. With only three exceptions, swap, and tmpfs, I removed this potentially expensive loop from performance sensitive code. ZFS uses it but I believe only in irregularly called functions.

The reservation changes are as previously discussed. I added some more comments to detail the locking. I resolved the unlocked rv access Mark pointed out. Briefly, this protected the reservation structures with the free lock from the domain they belong to. The object's rvq is protected with a reservation object lock array. This also keeps the object/pindex stable in a reservation which is on the rvq so either this or the free lock may be used to inspect those members.

There are a few places where we have to iterate and swap free locks when pages are not all from the same domain. This really suggests that we should look at the round-robin code and give it a stride so non-reservation allocations don't switch domains for every page. I will continue to refine this in future patches.

I have early perf results from mjg_ that show double the pagefault performance on a microbenchmark and 1/3rd less system time on a 4 socket buildkernel. I am just now starting to benchmark myself.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/vm/vm_meter.c
218–222

I forgot that I need to fix this. There are now many wait channels. I'm open to suggestions.

445

These should be consolidated into a single iterator.

sys/vm/vm_page.c
1898–1899

This should probably be a vm_object inline.

2544

I believe this is true since we are operating on a contiguous region of memory. Should probably assert below in the loop instead.

2804–2817

Please review this in particular.

2882–2886

Could be an inline now?

sys/vm/vm_pageout.c
1743–1744

I'm not sure this is even necessary. The code in vm_page_free_wakeup() should be sufficient.

1855

interrupt is a misnomer. As far as I can tell it's only used by M_USE_RESERVE

sys/vm/vm_pagequeue.h
77

There is probably some perf to be gained by sorting and aligning this structure further.

sys/vm/vm_phys.c
657

If we push a new lock to protect reservations we could split the free lock for synchronization of sleeps and paging from the lock that protects the actual free queues.

sys/vm/vm_phys.h
104

I really dislike this name. I think the new scheme makes sense.

sys/vm/vm_reserv.c
254

divide and modulo may be expensive

685

This section of code is similar to the one in vm_reserv_alloc_page() can could likely be consolidated.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
382

This probably deserves a comment.

sys/vm/vm_meter.c
218–222

If we counted sleepers in vm_wait_domain() using a per-domain var, we could write t_pw as a sum of vm_min_waiters, vm_severe_waiters and the per-domain sleeper count. Such a count might be useful elsewhere (OOM killer logic perhaps).

445

You could make use of the "arg" parameter to SYSCTL_PROC.

sys/vm/vm_page.c
2762

This should be vm_page_count_severe().

2875

This seems overly conservative, though maybe not a problem in practice. Suppose that a thread's allocation policy restricts it to a single domain. Then it only makes sense to check that domain's free count. We could use the thread and object domain allocation policies to derive the set of domains that we care about, and restrict out attention to that set.

Also, this is identical to vm_wait_min(); we might consider parameterizing the prio and wmesg.

sys/vm/vm_pageout.c
1855

VM_ALLOC_INTERRUPT seems to be used in quite a few places? I agree that the name doesn't make sense.

1911

This should be per-domain too.

sys/vm/vm_pagequeue.h
159

Kind of a weird name since this function has nothing to do with pagequeues.

sys/vm/vm_page.c
1690

Just to be pedantic, I think we should re-evaluate this after dropping the object lock, i.e., move this to after the "again" label.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
382

I just mimic'd the existing setting from pageout_wakeup_thresh. I don't know why zfs used it in the first place.

sys/vm/vm_meter.c
218–222

Yes, that's probably good. I'll make that change.

sys/vm/vm_page.c
2875

You are right. I think if we sleep until any domain transitions above min we will be close to the original intent. I can wakeup on any bit being cleared in the set instead of every. Long term I would like to totally eliminate naked vm_wait() calls for something with more information. We could do better than sleeping for min if we knew the original 'req' parameter. It is only used in vm_fault and pmap now where locking is too complex to sleep in vm_page_alloc.

vm_wait_min() still wants every.

sys/vm/vm_pagequeue.h
118–120

With a level of indirection here we could place the struct vm_domain within the domain it is governing.

159

Well people didn't like simply vm_page_domain() and vm_phys_domain(). I don't like domidx(). It makes sense to me that the physical domain is simply a number. All of the public APIs refer to this number. I took vm_pagequeue_domain() to mean 'the domain structure that holds the pagequeue structures' and in that way the function name made sense to me and was more descriptive.

Long term we may want to be able to have multiple vm_domain structures per physical domain for performance reasons.

sys/vm/vm_page.c
1963

The issue with the unfair allocation after the wait is still there, and perhaps it is even exaggerated. After we did full iteration over all domains and all domains failed to provide an allocatable page, we sleep and try to allocate from the initial domain, forever.

I believe that vm_domainset_lock and vm_min_domains bitset do allow to solve this correctly.

2764

Should this wait performed on the &vm_severe_domains ?

2837

I suggest changing the return type to bool.

sys/vm/vnode_pager.c
1170

A random place to put this note: previous condition could be outdated, but it was at least the correct state, some nearby time ago. Now, with the lock-less access to the vm_min_domains mask, potentially we do evaluation of several words, and the cumulative state might 'too' invalid.

If it is reasonable to expect that domain mask fits into a single word, may be we should assert this ?

jeff marked 2 inline comments as done.Jan 30 2018, 6:38 PM
jeff added inline comments.
sys/vm/vm_page.c
1963

What I would like to do is further eliminate calls to VM_WAIT() and then pass the failing object in on the few that remain. If that is not safe I will pass the failing domainset which is. I didn't want to deviate too much from the existing code with this patch. I will refactor this entirely when I bring in the pid controller since you can rely on different behavior from the pagedaemon at the same time.

sys/vm/vnode_pager.c
1170

Can you explain why a torn write here is a problem? The previous check is inherently racey. We may miss a setting or clearing of the condition in either case.

The bitmask will be a single word up to 64 domains.

sys/vm/vm_page.c
1963

I initially thought that implementing the fairness would be possible but somewhat intensive. Apparently, it seems to be quite easy.

If you only put the thread to sleep when iterator restarts, instead of sleeping on each step after the first full iteration, I believe the issue of fairness will be solved. I mean that the thread would only sleep once between attempts to allocate from all possible domains, which is the same behavior as it has now with the non-numa config.

jeff marked 3 inline comments as done.Jan 30 2018, 6:52 PM
jeff added inline comments.
sys/vm/vm_page.c
1963

I agree. The addition of the bitsets makes this possible and it makes it possible to do precise sleeps on sets of domains. There is a global lock on memory exhaustion but I believe proper pageout regulation will make this quite rare in cases that are not I/O bound. If you're I/O bound and out of memory a global lock is inconsequential.

I will make this change when I bring in the pid control so I can test the changes to the regulation mechanism in concert.

sys/vm/vnode_pager.c
1170

We check the first word, and noted that all our domains are at min or at severe. We move to the next word, but between that the domains recorded in the first word are cleared, while second word gets our domains marked as min/severe. As result, the thread is going to sleep, while if the read of domainset was raceless by any technique, we would notice a suitable ready domain.

jeff marked an inline comment as done.Jan 30 2018, 7:05 PM
jeff added inline comments.
sys/vm/vnode_pager.c
1170

The cases that sleep use the bitset lock to ensure sleep/wake synchronization which has the side-effect of giving them a stable look at the bits.

In this case we're just deciding whether we do a synchronous write or not. We may miss some state transition but I still assert the same was true before. This is just trying to limit page consumption by throttling slightly.

This revision is now accepted and ready to land.Feb 5 2018, 11:09 PM
markj added inline comments.
sys/vm/vm_page.c
726

Extra semicolon.

sys/vm/vm_reserv.c
1014

We should assert that the object is locked.