Page MenuHomeFreeBSD

lockprof: fix hangs under load when changing the state or dumping stats
ClosedPublic

Authored by mjg on Sep 21 2019, 7:17 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Dec 20, 3:26 AM
Unknown Object (File)
Thu, Dec 5, 4:33 AM
Unknown Object (File)
Oct 21 2024, 12:20 PM
Unknown Object (File)
Oct 6 2024, 4:32 PM
Unknown Object (File)
Sep 25 2024, 9:46 AM
Unknown Object (File)
Sep 24 2024, 11:04 AM
Unknown Object (File)
Sep 24 2024, 8:41 AM
Unknown Object (File)
Sep 24 2024, 4:53 AM
Subscribers

Details

Summary

quiesce_cpus schedules the current thread on all available CPUs and is basically broken on contemporary machines with no hopes of getting the job done in a reasonable time. In my particular case on a 104-way box it failed to complete after 4 minutes. It should be eliminated but it has yet another consumer.

lockprof can get away with a much cheaper requirement: we need to know everyone left the code section. Stolen from rmlocks this patch adds an IPI which posts a cst fence. This paired with rel fence before critical section is exited in code updating stats provides necessary synchronization to reliably wait for everyone to leave.

With the patch all ops work with almost no delay.

Also note the approach can be used with vfs_op_thread_enter/exit, but I find it a little iffy without a bitmap of CPUs which ever used it for reasons which I'll explain in a different review.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Hm, having something like quiesce is useful. I understand your observation about the idle threads. But then we should create a highest-priority interrupt thread that gets bound to all cpus sequentially and provide the service.

That would be avoidably wasteful for this code. Part of the motivation is to be able to get periodic snapshots of lockprof stats without disturbing the actual workload on top of the existing overhead.

Regardless, in the current kernel there are only 2 places using the func and the other one probably wont need it either. Even if needed, I think fixing the current primitive to perform is mostly orthogonal to this change. Note the other uses of the IPI barrier.

sys/kern/subr_lock.c
319 ↗(On Diff #62398)

Seems weird to not cache the pcpu ptr.

sys/kern/subr_smp.c
940 ↗(On Diff #62398)

Can we not replace quiesce_all_cpus with this mechanism entirely?

sys/kern/subr_lock.c
319 ↗(On Diff #62398)

sure, i can change that but wont be reuploading the patch if that's all

sys/kern/subr_smp.c
940 ↗(On Diff #62398)

I noted that no (or at least not the way it is used right now).

With lockprof out of the way you are left with ktrace. It would have to be modified to provide some form of "no longer in ktrace code" point, just like I'm abusing critical sections for lockprof.

Should a more general (and working) approach be desired (e.g. the one the current code claims to provide) the fix proposed by kib (high prio threads) would do the trick, but would also be too expensive for users like this one.

Finally, even if this is to become a part of a general solution later on, it will still have to work separately like here to not induce extra overhead if it can be avoided. I note once again that part of the motivation here is to be able to periodically dump stats under load with minimal extra disruption.

  • stop re-reading the pcpu pointer
  • make use of cpus_fence_seq_cst_issue in !SMP case
  • force re-read of the curthread pointer
sys/kern/subr_lock.c
317 ↗(On Diff #62492)

Don't you need this function to execute in critical section ?

326 ↗(On Diff #62492)

Are both casts needed ?

sys/kern/subr_smp.c
945 ↗(On Diff #62492)

The indent is wrong, and lines can be packed more tight.

sys/kern/subr_lock.c
317 ↗(On Diff #62492)

no. even if we migrate it's harmless since we will find ourselves not in the crit section in the loop below. in fact the check can be removed if it appears confusing.

326 ↗(On Diff #62492)

the compiler complained about both cases. imo these macros should be reworked to do this interally but that's for another time, there are already users which cast

sys/kern/subr_smp.c
945 ↗(On Diff #62492)

It's copy-pasted from rmlocks, I did not realize I reindented later. 4 is fine?

  • drop the curcpu check, it is of no essence
  • add a comment explaining the loop
  • reindent cpus_fence_seq_cst
This revision is now accepted and ready to land.Sep 25 2019, 12:40 PM
sys/sys/smp.h
267 ↗(On Diff #62522)

I don't really understand what the abbreviations are. I understand the code. Can you document the functions with a one-liner? You should also detail the requirements for consumers. Like requiring critical_enter() etc.

sys/sys/smp.h
267 ↗(On Diff #62522)

they come from c11 and are already used elsewhere in the tree.

there are no strict requirements for any of the consumers, apart from replacing whatever fence they had with a compiler barrier. i can add a comment above the func itself. In fact I have plans for a consumer which most definitely does not use critical_enter/exit.

sys/sys/smp.h
267 ↗(On Diff #62522)

seq_cst means sequentially consistent, it is common abbreviation, indeed seemingly started from the C++/java memory model working group.

I have some doubts that this form of the strongest possible fence semantic is really needed there (and mjg does not provide any reasoning, as usual). But even if any weaker form would suffice, sequentially consistent fence is also correct. In fact, the recommendations are to use seq_cst by default and only weaken it when there are performance implications.

sys/sys/smp.h
267 ↗(On Diff #62522)

It's the same scheme as used for vfs_op_* and has roughly the same amount of explanation, so did not see the need to elaborate.

So how about this:
Force all CPUs to post a sequentially consistent fence. This can be used to eliminate the use of atomic_thread_fence_seq_cst in regular code at the expense of issuing an IPI if necessary, thus this should only be used when synchronization provided by the barrier is very rarely needed.

It allows you to convert the following:

foo = 1;
atomic_thread_fence_seq_cst();
bar = baz;

into this:

foo = 1;
__compiler_membar();
bar = baz;

This comment has been deleted.
sys/sys/smp.h
267 ↗(On Diff #62522)

I have some doubts that this form of the strongest possible fence semantic is really needed there

Both _rel and _acq expand to compiler barriers on amd64 which still allows the following to be reordered so that baz is loaded first:

foo = 1;
bar = baz;

So in this particular case, assuming lock_prof_enable == 1:

td->td_critnest++;
if (lock_prof_enable) ...

can turn into:

AB
state = lock_prof_enable;
lock_prof_enable = 0;
if (td->td_critnest == 0)
/* incorrectly concludes the thread is out */
td->td_critnest++;
if (state == 0)
... bail;
/* continues */

This can also fail the other way where we read td_critnest first prior to setting lock_prof_enable to 0.

sys/sys/smp.h
267 ↗(On Diff #62522)

No, x86 do not reorder loads.

Also, IRETQ is serializing, so if you handle the interrupt, you are guaranteed that seq_cst fence was injected.

sys/sys/smp.h
267 ↗(On Diff #62522)

amd64 can reorder a load with a store to a different location, this is precisely what I'm preventing with the fence.

The current code already happens to have several barriers stemming from handling of IPIs, I inject this for everyone since it's simple and does not require any hackery.

This revision was automatically updated to reflect the committed changes.