Page MenuHomeFreeBSD

amd64 pmap: rework di removing global mutex
ClosedPublic

Authored by kib on Mar 18 2019, 9:50 PM.
Tags
None
Referenced Files
F133191390: D19630.diff
Thu, Oct 23, 8:09 PM
Unknown Object (File)
Sun, Oct 19, 3:26 AM
Unknown Object (File)
Sat, Oct 18, 4:32 AM
Unknown Object (File)
Sat, Oct 18, 12:51 AM
Unknown Object (File)
Sat, Oct 11, 1:40 PM
Unknown Object (File)
Sat, Oct 11, 1:27 AM
Unknown Object (File)
Thu, Oct 9, 3:22 AM
Unknown Object (File)
Sun, Oct 5, 1:32 AM
Subscribers

Details

Summary

For machines having cmpxcgh16b instruction, i.e. everything but very early Athlons, provide lockless implementation of delayed invalidation.

The implementation maintains lock-less single-linked list with the trick from the T.L. Harris article about volatile mark of the elements being removed. Double-CAS is used to atomically update both link and generation. New thread starting DI appends itself to the end of the queue, setting the generation to the generation of the last element +1. On DI finish, thread donates its generation to the previous element. The generation of the fake head of the list is the last passed DI generation.
Basically, the implementation is queued spinlock without spinlock.

On 32-thread machine, running make -j 32 buildworld on tmpfs gives the following numbers:

root@r-freeb43:~ # sysctl vm.pmap | grep invl
vm.pmap.invl_wait: 6
vm.pmap.invl_max_qlen: 8
vm.pmap.invl_finish_restart: 15209
vm.pmap.invl_start_restart: 77830

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 24114

Event Timeline

Complete rework, now it is almost like locked version, but unlocked.

kib added reviewers: alc, markj, dougm.

I can't see any correctness problems, but I plan to reread the diff once or twice more.

Do I understand correctly that invl_started always uses a locked cmpxchg to iterate through the full list? So, isn't there a significant scalability penalty incurred by an atomic operation on the list head? This is different from, say, MCS queue locks, which maintain a pointer to the tail of the queue.

sys/amd64/amd64/pmap.c
626

Extra newline.

703

"between the generation write and the update of next."

745

Why exactly is the fence needed?

kib marked 2 inline comments as done.Apr 3 2019, 11:23 AM

Do I understand correctly that invl_started always uses a locked cmpxchg to iterate through the full list? So, isn't there a significant scalability penalty incurred by an atomic operation on the list head? This is different from, say, MCS queue locks, which maintain a pointer to the tail of the queue.

After your question, I do not see a need in full struct loads while walking the list. We can only follow the next pointer until it is either blocked or NULL. The selected element must be loaded with 16-byte atomic for gen/next consistency.

I also thought about caching the pointer to the tail element. I do not see how can it be made race-free, but I can detect and handle the case where the cached tail is invalid, by failing back to the full list walk. That said, I have doubts that it would be beneficial, below is the stats values for the updated patch with make -j 48 buildworld on 32-threads machine. You can see that the max length of the queue is modest, which makes me somewhat reluctant to add complexity for tail caching.

vm.pmap.invl_wait: 7
vm.pmap.invl_max_qlen: 6
vm.pmap.invl_finish_restart: 3603
vm.pmap.invl_start_restart: 359735
sys/amd64/amd64/pmap.c
745

It is somewhat formal fence between pmap_di_load_invl() and pmap_di_store_invl(). On one hand, you can say that pmap_di_load/store_invl() are sequentially consistent and provide full fence. On the other hand, you can not state this and explicitly fence them with almost nop (compiler memory barrier only).

I added comment and added explicit fence in the start_u() code.

Use plain loads to walk the di list.

In D19630#424768, @kib wrote:

Do I understand correctly that invl_started always uses a locked cmpxchg to iterate through the full list? So, isn't there a significant scalability penalty incurred by an atomic operation on the list head? This is different from, say, MCS queue locks, which maintain a pointer to the tail of the queue.

After your question, I do not see a need in full struct loads while walking the list. We can only follow the next pointer until it is either blocked or NULL. The selected element must be loaded with 16-byte atomic for gen/next consistency.

That makes sense to me.

I also thought about caching the pointer to the tail element. I do not see how can it be made race-free, but I can detect and handle the case where the cached tail is invalid, by failing back to the full list walk. That said, I have doubts that it would be beneficial, below is the stats values for the updated patch with make -j 48 buildworld on 32-threads machine. You can see that the max length of the queue is modest, which makes me somewhat reluctant to add complexity for tail caching.

vm.pmap.invl_wait: 7

The small value here makes me less concerned about the loss of priority propagation.

vm.pmap.invl_max_qlen: 6
vm.pmap.invl_finish_restart: 3603
vm.pmap.invl_start_restart: 359735

Did you try the will-it-scale benchmark that mmacy was using? To run it, install the hwloc package and clone github.com/scalebsd/will-it-scale. Build with gmake and run mmap2_processes -t <num concurrent tasks>.

In D19630#424768, @kib wrote:
vm.pmap.invl_wait: 7

The small value here makes me less concerned about the loss of priority propagation.

I am not aware of any race-free way to make the unlocked spins release the CPU. I added adaptive cpu spins to the start_u/finish_u restarts.

I also tried the caching of the list tail, which we mentioned at the previous round of discussion. Suprisingly it degrades the will-it-scale numbers.

Did you try the will-it-scale benchmark that mmacy was using? To run it, install the hwloc package and clone github.com/scalebsd/will-it-scale. Build with gmake and run mmap2_processes -t <num concurrent tasks>.

This is on 32-threads/16 cores/2 sockets IvyBridge

Unlocked DI
# ./mmap2_processes -t 32 -s 5
testcase:Separate file mmap/munmap of 128MB
warmup
min: 21656 max: 63097 total: 1297981
min: 22493 max: 60667 total: 1298617
min: 21492 max: 59808 total: 1297241
min: 22830 max: 63553 total: 1296002
measurement
min: 23669 max: 60940 total: 1298581
min: 23124 max: 61635 total: 1299444
min: 21438 max: 62695 total: 1300682
min: 22033 max: 61215 total: 1298821
min: 22801 max: 61973 total: 1298015
average: 1299108
# sysctl vm.pmap | grep invl
vm.pmap.invl_wait: 0
vm.pmap.invl_max_qlen: 8
vm.pmap.invl_finish_restart: 55959
vm.pmap.invl_start_restart: 41367043

# BASELINE
# ./mmap2_processes -t 32 -s 5
testcase:Separate file mmap/munmap of 128MB
warmup
min: 22845 max: 31327 total: 843743
min: 21436 max: 32087 total: 842081
min: 22448 max: 32550 total: 841662
min: 21023 max: 31826 total: 841665
measurement
min: 22058 max: 31586 total: 841373
min: 22530 max: 31514 total: 841913
min: 21018 max: 30832 total: 842161
min: 21874 max: 30873 total: 840331
min: 21836 max: 31089 total: 841830
average: 841521

Add lock_delay() when restarting DI list iterations.

In D19630#425050, @kib wrote:
In D19630#424768, @kib wrote:
vm.pmap.invl_wait: 7

The small value here makes me less concerned about the loss of priority propagation.

I am not aware of any race-free way to make the unlocked spins release the CPU. I added adaptive cpu spins to the start_u/finish_u restarts.

I also tried the caching of the list tail, which we mentioned at the previous round of discussion. Suprisingly it degrades the will-it-scale numbers.

Did you try the will-it-scale benchmark that mmacy was using? To run it, install the hwloc package and clone github.com/scalebsd/will-it-scale. Build with gmake and run mmap2_processes -t <num concurrent tasks>.

This is on 32-threads/16 cores/2 sockets IvyBridge

How does it compare with single-threaded performance, i.e., with -t 1?

sys/amd64/amd64/pmap.c
660

Not very important, but I was always confused by the use of the past tense, e.g., "started" instead of "start" in the interface names. Is there a reason for it?

725

Extra return statement.

This revision is now accepted and ready to land.Apr 24 2019, 6:49 AM

How does it compare with single-threaded performance, i.e., with -t 1?

On the same machine

stock
# ./mmap2_processes -t 1 -s 5
random: unblocking device.
testcase:Separate file mmap/munmap of 128MB
warmup
min: 429978 max: 429978 total: 429978
min: 414983 max: 414983 total: 414983
min: 428964 max: 428964 total: 428964
min: 428707 max: 428707 total: 428707
measurement
min: 429682 max: 429682 total: 429682
min: 429065 max: 429065 total: 429065
min: 429034 max: 429034 total: 429034
min: 429629 max: 429629 total: 429629
min: 429285 max: 429285 total: 429285
average: 429339
		
patched
# ./mmap2_processes -t 1 -s 5
testcase:Separate file mmap/munmap of 128MB
warmup
min: 420380 max: 420380 total: 420380
min: 444420 max: 444420 total: 444420
min: 443576 max: 443576 total: 443576
min: 444095 max: 444095 total: 444095
measurement
min: 443143 max: 443143 total: 443143
min: 443740 max: 443740 total: 443740
min: 442747 max: 442747 total: 442747
min: 443354 max: 443354 total: 443354
min: 426515 max: 426515 total: 426515
average: 439899

Do you mean you want to rename _started() to _start() and _finished() to _finish() ? I propose to do this after the current patch lands.
(return removed, I update the review later).

In D19630#430752, @kib wrote:

How does it compare with single-threaded performance, i.e., with -t 1?

On the same machine
...

I did some basic benchmarking with the patch. With the mutex removed the main bottleneck appears to be atomic updates of the global swap_reserved and credentials (crhold() and crunhold()); the pmap does not show up in my profiling.

Do you mean you want to rename _started() to _start() and _finished() to _finish() ? I propose to do this after the current patch lands.
(return removed, I update the review later).

Yes, thanks.

kib added a subscriber: pho.

Cuurent snapshot of the working branch, no real changes.

This revision now requires review to proceed.Apr 29 2019, 5:54 PM
sys/amd64/amd64/pmap.c
880

Why is the algorithm predicated on CPUID_CX8 and not CPUID2_CX16?

This revision is now accepted and ready to land.Apr 29 2019, 9:04 PM

Fix a priority-invertion kind of: if a thread sets the PMAP_INVL_GEN_NEXT_INVALID bit on the element in di invl_gen queue, it must be allowed to finish it work. Otherwise, if de-scheduled, other threads start spinning on the still queued element and prevent the owner from removing it. Basically, splat critical section around bit set and remove, or insert and bit clear, which is same as 'put us at highest priority'.

This version passed whole Peter Holm' stress2 run.

This revision now requires review to proceed.May 3 2019, 5:52 PM
sys/amd64/amd64/pmap.c
772

It seems undesirable to spin in the critical section, here and below.

sys/amd64/amd64/pmap.c
772

Are you suggesting to remove lock_delay()s ? I cannot drop critical section without clearing _INVALID, and clearing it seems to be worse than silently retry.

Note that retries at finish() occur quite rarely, comparing with start().

sys/amd64/amd64/pmap.c
772

I'm not sure I see the purpose of the lock delays. When used in the lock primitives, their purpose is to ensure that waiting threads do not all simultaneously cmpxchg the lock word when the lock is released by its owner. With the queue, different threads will be updating different elements of the queue, so we should not see the same kind of contention for the same cache line.

sys/amd64/amd64/pmap.c
772

Lock delays in start() are definitely needed, I initially saw worse performance in the version that did not have them. Reason is that all incoming threads contend on the tail of the queue, so backing off is useful to allow one of them to make progress.

For finish() the contention is much smaller, but there is a chance of finding invalid element on queue. The cache line for the element is loaded in shared state, and must be invalidated by updater to obtain exclusive state. So I believe there is some sense in backing off. But I also understand your point.

Remove lock_delay()s from finished_u().

markj added inline comments.
sys/amd64/amd64/pmap.c
772

I agree that the lock_delay()s in start() make sense.

For finish() I would also worry that the delays become transitive. e.g., suppose threads 1, 2, 3 are on the queue in that order, and finish their blocks in that order as well. If thread 2 spins on thread 1's _INVALID flag and accumulates a large delay, thread 3 cannot proceed until thread 2 has finished spinning and completes its own queue update.

This revision is now accepted and ready to land.May 4 2019, 10:51 PM

Restore lock_delay in finished_u(), by clearing _INVALID bit and ending critical section before starting the delay loop. Restructure the code to simplify retry.

This revision now requires review to proceed.May 5 2019, 5:00 PM
This revision is now accepted and ready to land.May 5 2019, 5:10 PM
  1. Optimization: in pmap_delayed_invl_finished_u(), do not mark curthread' invl_gen entry invalid until we found the prev pointer. This allows much more parallelism for starting threads.
  2. Unexpected consequence of the item 1 was that the search for the prev pointer could fail now, because thread might finish its DI and start a new one, which makes the search loop to skip all block between old and position of that thread. Allow the search to fail and restart it.
  3. Bump the thread priority to PVM when entering DI block, by lending. Without it, the variance in some stress2 tests was too high, up to 600%.
  4. Added ddb 'show di_queue' command.

This version seems to pass whole stress2 and there is no strange timing issues.

This revision now requires review to proceed.May 7 2019, 4:19 PM

I ran the full stress2 test using two hosts. No problems seen.

sys/amd64/amd64/pmap.c
803

I would add a __predict_false here, it should be quite rare with the priority "lending."

822

Suppose the thread owns a mutex and a waiter has already propagated its priority to curthread. Won't we lose the priority boost here? I suspect that the original priority must be saved and restored in order to preserve the boost.

kib marked an inline comment as done.May 15 2019, 7:33 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
822

I was unable to make lending work properly except by putting a fake turnstile into the thread td_contested list. This is actually quite tricky, but the worst part is that it would require taking global td_contested_lock, which makes the whole exercise of lockless DI pointless (OTOH thread lock is fine to take).

So I falled back to the low-tech prio bumping without lending. There should be no other priority manipulations inside DI-marked regions, except for turnstile business, so it should be fine.

Add __predict_false().
Switch to use sched_prio() instead of sched_lend_prio().

The subr_turnstile.c chunk is completely unrelated, it is seemingly useful code duplication removal which I extracted from my lending experiments.

This revision is now accepted and ready to land.May 15 2019, 9:13 PM
This revision was automatically updated to reflect the committed changes.