Reduce false sharing in UMA on amd64 by increasing padding to 128 bytes
ClosedPublic
Actions

Authored by mjg on May 8 2018, 12:52 AM.

Details

Reviewers

jeff
kib
jhb

Commits

rS333484: uma: increase alignment to 128 bytes on amd64

Summary

Current UMA internals are not suited for efficient operation in multi-socket environments. In particular there is very common use of MAXCPU arrays and other fields which are not always properly aligned and are not local for target threads (apart from the first node of course). Turns out the existing UMA_ALIGN macro can be used to mostly work around the problem until the code get fixed. The current setting of 64 bytes runs into trouble when adjacent cache line prefetcher gets to work.

Note the that UMA_ALIGN was already set to 128 bytes for quite some time until few months ago when CACHE_LINE_SIZE was rightfully lowered to 64, Bumping it would unnecessarily grow various structs.

It may be a different macro would be appropriate to differentiate between cache line sizes for padding within objects not expected to be shared across sockets vs global ones. It may be usable for the __exclusive_cache_line area. In the meantime go with a simple hammer method of bumping UMA_ALIGN in place. I don't see a good header to add the new macro - just adding to param.h would require a tour over all other archs to #define NEW_MACRO CACHE_LINE_SIZE.

Benchmarked as follows:
I have a WIP patch for scalability of posix locks, testable with the will it scale suite. In particular it got to the point where there is no internal contention in the subsystem and frequent malloc/free use reveals very serious sharing when running on a 4 socket broadwell (128 threads).

instruction samples gathered with dtrace like this: dtrace -w -n 'profile:::profile-4999 { @[arg0] = count(); } tick-5s { system("clear"); trunc(@, 10); printa("%40a %@16d\n", @); clear(@); }'

kernel`lf_advlockasync+0x43b            32940
          kernel`malloc+0xe5            42380
           kernel`bzero+0x19            47798
   kernel`spinlock_exit+0x26            60423
         kernel`0xffffffff80            78238
                         0x0           136947
   kernel`uma_zfree_arg+0x46           159594
 kernel`uma_zalloc_arg+0x672           180556
   kernel`uma_zfree_arg+0x2a           459923
 kernel`uma_zalloc_arg+0x5ec           489910

after patching:

            kernel`bzero+0xd            46115
kernel`lf_advlockasync+0x25f            46134
kernel`lf_advlockasync+0x38a            49078
   kernel`fget_unlocked+0xd1            49942
kernel`lf_advlockasync+0x43b            55392
          kernel`copyin+0x4a            56963
           kernel`bzero+0x19            81983
   kernel`spinlock_exit+0x26            91889
         kernel`0xffffffff80           136357
                         0x0           239424

Prior to patching there is huge fluctuation in performance:
min:435770 max:915190 total:102894626
min:371624 max:795878 total:84567294
min:444202 max:936194 total:105823088
min:460950 max:938158 total:108250116
min:428388 max:905290 total:101146182
min:395058 max:824672 total:90647822
min:423946 max:858306 total:96388120
min:378524 max:816704 total:87001472
min:530418 max:1050078 total:126711380
min:557534 max:1125526 total:135015810
min:447622 max:921036 total:104702698
min:459754 max:976386 total:115201488

After:
min:576860 max:1137782 total:134698878
min:577606 max:1134464 total:134707664
min:576492 max:1133410 total:134467790
min:578742 max:1129300 total:134747566
min:572880 max:1137094 total:134767962
min:567426 max:1127300 total:133970996
min:574816 max:1125308 total:134139574
min:570556 max:1129660 total:134514664
min:567004 max:1133980 total:133916864

Not only significantly faster, but also stable.