Page MenuHomeFreeBSD

lockprof: fix hangs under load when changing the state or dumping stats
AcceptedPublic

Authored by mjg on Sep 21 2019, 7:17 PM.

Details

Reviewers
kib
jeff
Summary

quiesce_cpus schedules the current thread on all available CPUs and is basically broken on contemporary machines with no hopes of getting the job done in a reasonable time. In my particular case on a 104-way box it failed to complete after 4 minutes. It should be eliminated but it has yet another consumer.

lockprof can get away with a much cheaper requirement: we need to know everyone left the code section. Stolen from rmlocks this patch adds an IPI which posts a cst fence. This paired with rel fence before critical section is exited in code updating stats provides necessary synchronization to reliably wait for everyone to leave.

With the patch all ops work with almost no delay.

Also note the approach can be used with vfs_op_thread_enter/exit, but I find it a little iffy without a bitmap of CPUs which ever used it for reasons which I'll explain in a different review.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

mjg created this revision.Sep 21 2019, 7:17 PM
mjg edited the summary of this revision. (Show Details)Sep 21 2019, 7:18 PM
kib added a comment.Sep 22 2019, 1:37 PM

Hm, having something like quiesce is useful. I understand your observation about the idle threads. But then we should create a highest-priority interrupt thread that gets bound to all cpus sequentially and provide the service.

mjg added a comment.EditedSep 22 2019, 1:43 PM

That would be avoidably wasteful for this code. Part of the motivation is to be able to get periodic snapshots of lockprof stats without disturbing the actual workload on top of the existing overhead.

Regardless, in the current kernel there are only 2 places using the func and the other one probably wont need it either. Even if needed, I think fixing the current primitive to perform is mostly orthogonal to this change. Note the other uses of the IPI barrier.

jeff added inline comments.Sep 23 2019, 8:21 PM
sys/kern/subr_lock.c
319

Seems weird to not cache the pcpu ptr.

sys/kern/subr_smp.c
940

Can we not replace quiesce_all_cpus with this mechanism entirely?

mjg added inline comments.Sep 23 2019, 8:31 PM
sys/kern/subr_lock.c
319

sure, i can change that but wont be reuploading the patch if that's all

sys/kern/subr_smp.c
940

I noted that no (or at least not the way it is used right now).

With lockprof out of the way you are left with ktrace. It would have to be modified to provide some form of "no longer in ktrace code" point, just like I'm abusing critical sections for lockprof.

Should a more general (and working) approach be desired (e.g. the one the current code claims to provide) the fix proposed by kib (high prio threads) would do the trick, but would also be too expensive for users like this one.

Finally, even if this is to become a part of a general solution later on, it will still have to work separately like here to not induce extra overhead if it can be avoided. I note once again that part of the motivation here is to be able to periodically dump stats under load with minimal extra disruption.

mjg updated this revision to Diff 62492.Sep 24 2019, 2:47 AM
  • stop re-reading the pcpu pointer
  • make use of cpus_fence_seq_cst_issue in !SMP case
  • force re-read of the curthread pointer
kib added inline comments.Sep 24 2019, 9:48 AM
sys/kern/subr_lock.c
317

Don't you need this function to execute in critical section ?

326

Are both casts needed ?

sys/kern/subr_smp.c
945

The indent is wrong, and lines can be packed more tight.

mjg added inline comments.Sep 24 2019, 11:54 AM
sys/kern/subr_lock.c
317

no. even if we migrate it's harmless since we will find ourselves not in the crit section in the loop below. in fact the check can be removed if it appears confusing.

326

the compiler complained about both cases. imo these macros should be reworked to do this interally but that's for another time, there are already users which cast

sys/kern/subr_smp.c
945

It's copy-pasted from rmlocks, I did not realize I reindented later. 4 is fine?

mjg updated this revision to Diff 62522.Sep 24 2019, 7:39 PM
  • drop the curcpu check, it is of no essence
  • add a comment explaining the loop
  • reindent cpus_fence_seq_cst
kib accepted this revision.Sep 25 2019, 12:40 PM
This revision is now accepted and ready to land.Sep 25 2019, 12:40 PM
jeff added inline comments.Sep 25 2019, 11:21 PM
sys/sys/smp.h
267

I don't really understand what the abbreviations are. I understand the code. Can you document the functions with a one-liner? You should also detail the requirements for consumers. Like requiring critical_enter() etc.

jeff accepted this revision.Sep 25 2019, 11:22 PM
mjg added inline comments.Sep 26 2019, 12:15 AM
sys/sys/smp.h
267

they come from c11 and are already used elsewhere in the tree.

there are no strict requirements for any of the consumers, apart from replacing whatever fence they had with a compiler barrier. i can add a comment above the func itself. In fact I have plans for a consumer which most definitely does not use critical_enter/exit.

kib added inline comments.Sep 26 2019, 7:35 AM
sys/sys/smp.h
267

seq_cst means sequentially consistent, it is common abbreviation, indeed seemingly started from the C++/java memory model working group.

I have some doubts that this form of the strongest possible fence semantic is really needed there (and mjg does not provide any reasoning, as usual). But even if any weaker form would suffice, sequentially consistent fence is also correct. In fact, the recommendations are to use seq_cst by default and only weaken it when there are performance implications.

mjg added inline comments.Oct 5 2019, 8:57 PM
sys/sys/smp.h
267

It's the same scheme as used for vfs_op_* and has roughly the same amount of explanation, so did not see the need to elaborate.

So how about this:
Force all CPUs to post a sequentially consistent fence. This can be used to eliminate the use of atomic_thread_fence_seq_cst in regular code at the expense of issuing an IPI if necessary, thus this should only be used when synchronization provided by the barrier is very rarely needed.

It allows you to convert the following:

foo = 1;
atomic_thread_fence_seq_cst();
bar = baz;

into this:

foo = 1;
__compiler_membar();
bar = baz;

mjg added a comment.EditedOct 5 2019, 9:47 PM
This comment has been deleted.
sys/sys/smp.h
267

I have some doubts that this form of the strongest possible fence semantic is really needed there

Both _rel and _acq expand to compiler barriers on amd64 which still allows the following to be reordered so that baz is loaded first:

foo = 1;
bar = baz;

So in this particular case, assuming lock_prof_enable == 1:

td->td_critnest++;
if (lock_prof_enable) ...

can turn into:

AB
state = lock_prof_enable;
lock_prof_enable = 0;
if (td->td_critnest == 0)
/* incorrectly concludes the thread is out */
td->td_critnest++;
if (state == 0)
... bail;
/* continues */

This can also fail the other way where we read td_critnest first prior to setting lock_prof_enable to 0.

kib added inline comments.Oct 6 2019, 7:11 PM
sys/sys/smp.h
267

No, x86 do not reorder loads.

Also, IRETQ is serializing, so if you handle the interrupt, you are guaranteed that seq_cst fence was injected.

mjg added inline comments.Oct 6 2019, 8:16 PM
sys/sys/smp.h
267

amd64 can reorder a load with a store to a different location, this is precisely what I'm preventing with the fence.

The current code already happens to have several barriers stemming from handling of IPIs, I inject this for everyone since it's simple and does not require any hackery.