Current UMA internals are not suited for efficient operation in multi-socket environments. In particular there is very common use of MAXCPU arrays and other fields which are not always properly aligned and are not local for target threads (apart from the first node of course). Turns out the existing UMA_ALIGN macro can be used to mostly work around the problem until the code get fixed. The current setting of 64 bytes runs into trouble when adjacent cache line prefetcher gets to work.
Note the that UMA_ALIGN was already set to 128 bytes for quite some time until few months ago when CACHE_LINE_SIZE was rightfully lowered to 64, Bumping it would unnecessarily grow various structs.
It may be a different macro would be appropriate to differentiate between cache line sizes for padding within objects not expected to be shared across sockets vs global ones. It may be usable for the __exclusive_cache_line area. In the meantime go with a simple hammer method of bumping UMA_ALIGN in place. I don't see a good header to add the new macro - just adding to param.h would require a tour over all other archs to #define NEW_MACRO CACHE_LINE_SIZE.
Benchmarked as follows:
I have a WIP patch for scalability of posix locks, testable with the will it scale suite. In particular it got to the point where there is no internal contention in the subsystem and frequent malloc/free use reveals very serious sharing when running on a 4 socket broadwell (128 threads).
instruction samples gathered with dtrace like this: dtrace -w -n 'profile:::profile-4999 { @[arg0] = count(); } tick-5s { system("clear"); trunc(@, 10); printa("%40a %@16d\n", @); clear(@); }'
kernel`lf_advlockasync+0x43b 32940 kernel`malloc+0xe5 42380 kernel`bzero+0x19 47798 kernel`spinlock_exit+0x26 60423 kernel`0xffffffff80 78238 0x0 136947 kernel`uma_zfree_arg+0x46 159594 kernel`uma_zalloc_arg+0x672 180556 kernel`uma_zfree_arg+0x2a 459923 kernel`uma_zalloc_arg+0x5ec 489910
after patching:
kernel`bzero+0xd 46115 kernel`lf_advlockasync+0x25f 46134 kernel`lf_advlockasync+0x38a 49078 kernel`fget_unlocked+0xd1 49942 kernel`lf_advlockasync+0x43b 55392 kernel`copyin+0x4a 56963 kernel`bzero+0x19 81983 kernel`spinlock_exit+0x26 91889 kernel`0xffffffff80 136357 0x0 239424
Prior to patching there is huge fluctuation in performance:
min:435770 max:915190 total:102894626
min:371624 max:795878 total:84567294
min:444202 max:936194 total:105823088
min:460950 max:938158 total:108250116
min:428388 max:905290 total:101146182
min:395058 max:824672 total:90647822
min:423946 max:858306 total:96388120
min:378524 max:816704 total:87001472
min:530418 max:1050078 total:126711380
min:557534 max:1125526 total:135015810
min:447622 max:921036 total:104702698
min:459754 max:976386 total:115201488
After:
min:576860 max:1137782 total:134698878
min:577606 max:1134464 total:134707664
min:576492 max:1133410 total:134467790
min:578742 max:1129300 total:134747566
min:572880 max:1137094 total:134767962
min:567426 max:1127300 total:133970996
min:574816 max:1125308 total:134139574
min:570556 max:1129660 total:134514664
min:567004 max:1133980 total:133916864
Not only significantly faster, but also stable.