Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.
Details
I've tested this on a FreeBSD VM (Skylake running VMware) through simply
building the kernel and updating the system from source, but it's not a
great environment in which to measure performance.
A similar patch (but against our 10.x derived system, which is much different
in the details) has run through much of our regression tests. I'm planning
to run performance tests on a variety of server systems from Ivy Bridge
through Broadwell with up to 32 cores or so, but focused on overall systems
reliability & performance.
If we have general many-core FreeBSD scaling tests, that would be useful.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 5340 Build 5527: arc lint + arc unit
Event Timeline
I reviewed an earlier version of this patch and suggested getting approval from kib and mjg.
Do you have any measurements that demostrate the improvement due to the patch ? I do not object against this change going in if e.g. even micromeasurements like hwpmc or intel pcm tool show reduction of inter-core traffic.
Did you considered applying some 'advanced' locking algorithm where contention only exists between two cores, like some variant of the scalable rwlocks ?
sys/amd64/include/pcpu.h | ||
---|---|---|
68–69 | wouldn't this variable significantly benefit from avoiding false sharing ? I.e., ensure that the padding exists before the var ? | |
sys/x86/x86/mp_x86.c | ||
1404 | Do not use wmb() in FreeBSD native code. Besides, it is nop by all means on amd64, for the code in question. If your intent there is to ensure that the write to 'generation' variable is visible to other cores before IPI is sent, so that other core does write the correct value into pcpu area, it is already handled. For LAPIC xAPIC mode, APIC registers page is uncacheable, and IPI command register write flushes store buffer. For x2APIC mode, we issue neccessary fence explicitly, to handle such situations. Look at x86/local_apic.c:lapic_write_icr(). Note that other TLB handlers state variables, like smp_tlb_addr{1,2} and smp_tlb_pmap, would be vulnerable to the same issue. |
On a Westmere system (2 sockets x 2 cores x 2 threads), dtrace measurements show that smp_tlb_shootdown is about 50% faster with this patch; observations with VTune show that the percentage of time spent in invlrng_single_page on an interrupt (actually doing invalidation, rather than synchronization) increases from 31% with the old mechanism to 71% with the new one. (This is running a basic file server workload.)
I haven’t had a chance to test this change on a Broadwell system, where I originally observed the problem; that system had 16 cores, 32 threads, and had much worse contention. Once I get access to that system again, I'll have numbers from it as well.
Did you considered applying some 'advanced' locking algorithm where contention only exists between two cores, like some variant of the scalable rwlocks ?
Not really. Given the spinwait after a set of TLB interrupts are issued, would a more complex model help much? With this patch, the primary contention is always between two cores (the core initiating the TLB interrupt and each core executing it), and it's a pure reader/writer model; there are no cache lines written by more than one core.
sys/amd64/include/pcpu.h | ||
---|---|---|
68–69 | Thinking about it, I don’t think it would make much difference. It’s only written by the core completing the TLB request, and only read by the core waiting for it. The coherency traffic in the worst case should be something like
where "waiting core" = the core which requested the TLB invalidation, and "completing core" = one of the cores which executed an invalidation upon receiving an interrupt. There’s a potential for a later write by the completing core to another variable in the same line to cause invalidation of the line in the waiting core, but that’s the only transaction that I think would be saved. If the waiting core were fast enough, it’s possible that there could be some extra coherency traffic due to code on the completing core updating the line for unrelated per-cpu writes before the interrupt was handled, but that doesn’t seem likely to be excessive. Am I missing something, though? Given the amount of padding in the PCPU structure, it’d be easy to align this if it were useful. | |
sys/x86/x86/mp_x86.c | ||
1404 | Thanks — I had looked at some of that, but hadn’t realized that the memory-mapped xAPIC mode would flush the store buffer. I’ll take it out. My intent indeed was to ensure that the writes were guaranteed visible to other cores. |
Are you sure about 2 cores ? Might be 4 ?
Did you considered applying some 'advanced' locking algorithm where contention only exists between two cores, like some variant of the scalable rwlocks ?
Not really. Given the spinwait after a set of TLB interrupts are issued, would a more complex model help much? With this patch, the primary contention is always between two cores (the core initiating the TLB interrupt and each core executing it), and it's a pure reader/writer model; there are no cache lines written by more than one core.
Well, more complex solution would actually make each CPU to to only peek into one or two other's CPUs PCPU cache line, instead of current patch situation where, in the worst case, the leader has to access PCPUs of all other cores. This indeed might be not worth it if the load causes most of the invalidations happen for the user pmaps with limited pm_active masks. OTOH, for quite common ZFS usage a lot of invalidations occur for kernel pmap with the total pm_active mask.
I think that this patch could go in even if we decide later that some more optimizations are worth the efforts. Please update the current patch by removing wmb().
sys/amd64/include/pcpu.h | ||
---|---|---|
68–69 | Non-leader CPUs leave the TLB handler after the write to the line, and might start using the same cache line while leader still spins because it did not receive the invalidation message yet, or because leader still acts on other CPUs variables. So it is not hard to imagine that more exclusivity would help, but it is hard to predict whether the improvement worth it. |
Yes, it’s actually 4 cores, 1 thread each — I forgot we had hyperthreading disabled on BIOS on this system.
Move the read of smp_tlb_generation before serializing instructions.
Since several of these have two serializing instructions nearly back-to-back,
e.g. INVLPG followed by RETI, this allows the read to happen in parallel with
setup or execution of the serialized instruction; only the write need finish
before RETI completes. It ought to reduce latency slightly.
A very tiny microoptimization for invlrng_handler, while I'm there.
Since invlpg is serializing, the compare-against-memory is forced to
read smp_tlb_addr2 (albeit presumably from L1 cache) each time. By
loading it earlier, we speed up the loop a small amount. This could
be worse in cases where invlpg happens to invalidate any translation
structures affecting smp_tlb_addr2, though that's probably unlikely.
sys/amd64/amd64/mp_machdep.c | ||
---|---|---|
420 | Did you looked at the code generated ? Compiler might spill 'generation' into memory and then reload it after invpcid, not keeping the intermediary in a register. I expect it more from clang and less from gcc, the later usually generates better code. And such single-use variable deserves a comment, since my first action when looking at making any changes to such code, is to remove the once-used aliases. It sounds as if you want to rewrite the handlers back into assembler. |
sys/amd64/amd64/mp_machdep.c | ||
---|---|---|
420 | Yes, I did, at least with clang; I didn’t try building the kernel with gcc, but as you say, gcc generally does a better job with x86 code. We’re working with very few values here so they fit easily into the amd64 registers. I don’t think rewriting the handlers in assembly is worth it; the compiler does a decent job and in the absence of serializing instructions the processor can deal with almost any ordering of instructions reasonably well. However, for some of these handlers the actual invalidation cost is quite low (≈50 cycles) while the cost of serialization is high, and given the frequency of execution & simplicity of code, it seems worthwhile to me to do some of these micro-optimizations. Some of our workloads see 50K invalidations per second, which when multiplied by 31 cores is 1.5 million of these interrupts per second (ugh). I can add comments describing the use, something like “issue memory read before serializing instructions to avoid stalling afterwards with a read/write dependency”. (The IRET still stalls for the write to complete, unfortunately, but that's about half as long as for a dependent read and write.) |
Feel free. Note that this should be directly applicable to stable/11 and I do not want to allow this code to drift on stable/11 so early in its lifetime.
Note that this changes the shared x86 (i386 + amd64) TLB invalidation code without the support necessary in i386. Adding that wasn't totally trivial, so I've reverted the commit in head and reopened this review.
The bit which I do not like about the patch is the use of uint64_t for generation. Why cannot it be uint32_t ?
64bit type of i386 causes torn writes, which is not fatal, but easily avoided. The only issue I see is that you might want to avoid zero generation (and this is in fact not absolutely necessary), possible wraparound is fine because you only compare gen for equal. You could easily do generation = ++smp_tlb_generation; if (generation == 0) generation = ++smp_tlb_generation;
It could be, but I’m a bit concerned about the possibility that a CPU could be offline (or not running a particular process) for a very long time, for instance when using cpuset. With 2^32 generation numbers, in a situation where we produce 10K TLB invalidations per second, that’s about 5 days; or 6 weeks, at 1K/sec.
This is *extremely* unlikely, but in an appliance it seems conceivable to me — a UI might not be invoked for months while processes dedicated on particular cores are running. Given the difficulty of diagnosing the failure mode (running slightly too soon on one core due to an invalidation not having been processed yet on a remote CPU), the extra cost of using a 64-bit generation (a dependent add, and an additional test) seems low to me.
The torn writes don’t affect this particular code path, since we’re testing for equality and always incrementing by 1.
That said, if you feel this is not a plausible issue, I’m OK with using 32-bit generation numbers, at least on x86 where 64-bit incurs additional cost. In that case, would you rather (a) use 32-bit numbers for both i386 & amd64; (b) introduce a new typedef; or (c) conditionalize the declarations in mp_x86.c?
Thanks!
Generation numbers are global. so it does not matter if some pmap not get activated on the specific core for some time, it is only important that the core gets an invalidation IPI for some pmap. And, since kernel pmap is always active on all cores, I do not believe that you may get generation wraparound with some CPU still having stale (duplicated) generation number.
IMO it is reasonable to select the simplest approach and just use uint32_t without typedef on both arches.
I am fine with the code changes. As requested, please add a comment somewhere explaining why the generation variables are used.
@kib, I don't have a way to build-test i386 at this time. Will you build & commit when it's ready?
FYI, you can always do 'make TARGET=i386 kernel-toolchain && make TARGET=i386 buildkernel' to build test on an amd64 host.
That (or something similar) wasn't working for me due to some sort of Makefile issue that I didn't have time to dig into. I guess I need to get in the habbit of using cluster resources.