lockprof: fix hangs under load when changing the state or dumping stats
ClosedPublic
Actions

Authored by mjg on Sep 21 2019, 7:17 PM.

Details

Reviewers

kib
jeff

Commits

rS355231: lockprof: use IPI-injecetd fences to fix hangs on stat dump and reset
rS355230: Add a way to inject fences using IPIs

Summary

quiesce_cpus schedules the current thread on all available CPUs and is basically broken on contemporary machines with no hopes of getting the job done in a reasonable time. In my particular case on a 104-way box it failed to complete after 4 minutes. It should be eliminated but it has yet another consumer.

lockprof can get away with a much cheaper requirement: we need to know everyone left the code section. Stolen from rmlocks this patch adds an IPI which posts a cst fence. This paired with rel fence before critical section is exited in code updating stats provides necessary synchronization to reliably wait for everyone to leave.

With the patch all ops work with almost no delay.

Also note the approach can be used with vfs_op_thread_enter/exit, but I find it a little iffy without a bitmap of CPUs which ever used it for reasons which I'll explain in a different review.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg created this revision.Sep 21 2019, 7:17 PM

Herald added a subscriber: imp. · View Herald TranscriptSep 21 2019, 7:17 PM

Harbormaster completed remote builds in B26604: Diff 62398.Sep 21 2019, 7:17 PM

mjg edited the summary of this revision. (Show Details)Sep 21 2019, 7:18 PM

Hm, having something like quiesce is useful. I understand your observation about the idle threads. But then we should create a highest-priority interrupt thread that gets bound to all cpus sequentially and provide the service.

That would be avoidably wasteful for this code. Part of the motivation is to be able to get periodic snapshots of lockprof stats without disturbing the actual workload on top of the existing overhead.

Regardless, in the current kernel there are only 2 places using the func and the other one probably wont need it either. Even if needed, I think fixing the current primitive to perform is mostly orthogonal to this change. Note the other uses of the IPI barrier.

jeff added inline comments.Sep 23 2019, 8:21 PM

sys/kern/subr_lock.c
319 ↗	(On Diff #62398)	Seems weird to not cache the pcpu ptr.
sys/kern/subr_smp.c
940 ↗	(On Diff #62398)	Can we not replace quiesce_all_cpus with this mechanism entirely?

mjg added inline comments.Sep 23 2019, 8:31 PM

sys/kern/subr_lock.c
319 ↗	(On Diff #62398)	sure, i can change that but wont be reuploading the patch if that's all
sys/kern/subr_smp.c
940 ↗	(On Diff #62398)	I noted that no (or at least not the way it is used right now). With lockprof out of the way you are left with ktrace. It would have to be modified to provide some form of "no longer in ktrace code" point, just like I'm abusing critical sections for lockprof. Should a more general (and working) approach be desired (e.g. the one the current code claims to provide) the fix proposed by kib (high prio threads) would do the trick, but would also be too expensive for users like this one. Finally, even if this is to become a part of a general solution later on, it will still have to work separately like here to not induce extra overhead if it can be avoided. I note once again that part of the motivation here is to be able to periodically dump stats under load with minimal extra disruption.

stop re-reading the pcpu pointer
make use of cpus_fence_seq_cst_issue in !SMP case
force re-read of the curthread pointer

kib added inline comments.Sep 24 2019, 9:48 AM

sys/kern/subr_lock.c
317 ↗	(On Diff #62492)	Don't you need this function to execute in critical section ?
326 ↗	(On Diff #62492)	Are both casts needed ?
sys/kern/subr_smp.c
945 ↗	(On Diff #62492)	The indent is wrong, and lines can be packed more tight.

mjg added inline comments.Sep 24 2019, 11:54 AM

sys/kern/subr_lock.c
317 ↗	(On Diff #62492)	no. even if we migrate it's harmless since we will find ourselves not in the crit section in the loop below. in fact the check can be removed if it appears confusing.
326 ↗	(On Diff #62492)	the compiler complained about both cases. imo these macros should be reworked to do this interally but that's for another time, there are already users which cast
sys/kern/subr_smp.c
945 ↗	(On Diff #62492)	It's copy-pasted from rmlocks, I did not realize I reindented later. 4 is fine?

drop the curcpu check, it is of no essence
add a comment explaining the loop
reindent cpus_fence_seq_cst

kib accepted this revision.Sep 25 2019, 12:40 PM

This revision is now accepted and ready to land.Sep 25 2019, 12:40 PM

jeff added inline comments.Sep 25 2019, 11:21 PM

sys/sys/smp.h
267 ↗	(On Diff #62522)	I don't really understand what the abbreviations are. I understand the code. Can you document the functions with a one-liner? You should also detail the requirements for consumers. Like requiring critical_enter() etc.

jeff accepted this revision.Sep 25 2019, 11:22 PM

mjg added inline comments.Sep 26 2019, 12:15 AM

sys/sys/smp.h
267 ↗	(On Diff #62522)	they come from c11 and are already used elsewhere in the tree. there are no strict requirements for any of the consumers, apart from replacing whatever fence they had with a compiler barrier. i can add a comment above the func itself. In fact I have plans for a consumer which most definitely does not use critical_enter/exit.

kib added inline comments.Sep 26 2019, 7:35 AM

sys/sys/smp.h
267 ↗	(On Diff #62522)	seq_cst means sequentially consistent, it is common abbreviation, indeed seemingly started from the C++/java memory model working group. I have some doubts that this form of the strongest possible fence semantic is really needed there (and mjg does not provide any reasoning, as usual). But even if any weaker form would suffice, sequentially consistent fence is also correct. In fact, the recommendations are to use seq_cst by default and only weaken it when there are performance implications.

mjg added inline comments.Oct 5 2019, 8:57 PM

sys/sys/smp.h
267 ↗	(On Diff #62522)	It's the same scheme as used for vfs_op_* and has roughly the same amount of explanation, so did not see the need to elaborate. So how about this: Force all CPUs to post a sequentially consistent fence. This can be used to eliminate the use of atomic_thread_fence_seq_cst in regular code at the expense of issuing an IPI if necessary, thus this should only be used when synchronization provided by the barrier is very rarely needed. It allows you to convert the following: foo = 1; atomic_thread_fence_seq_cst(); bar = baz; into this: foo = 1; __compiler_membar(); bar = baz;

This comment has been deleted.

sys/sys/smp.h

267 ↗

(On Diff #62522)

I have some doubts that this form of the strongest possible fence semantic is really needed there

Both _rel and _acq expand to compiler barriers on amd64 which still allows the following to be reordered so that baz is loaded first:

foo = 1;
bar = baz;

So in this particular case, assuming lock_prof_enable == 1:

td->td_critnest++;
if (lock_prof_enable) ...

can turn into:

A	B
state = lock_prof_enable;
	lock_prof_enable = 0;
	if (td->td_critnest == 0)
	/* incorrectly concludes the thread is out */
td->td_critnest++;
if (state == 0)
... bail;
/* continues */

This can also fail the other way where we read td_critnest first prior to setting lock_prof_enable to 0.

kib added inline comments.Oct 6 2019, 7:11 PM

sys/sys/smp.h
267 ↗	(On Diff #62522)	No, x86 do not reorder loads. Also, IRETQ is serializing, so if you handle the interrupt, you are guaranteed that seq_cst fence was injected.

mjg added inline comments.Oct 6 2019, 8:16 PM

sys/sys/smp.h
267 ↗	(On Diff #62522)	amd64 can reorder a load with a store to a different location, this is precisely what I'm preventing with the fence. The current code already happens to have several barriers stemming from handling of IPIs, I inject this for everyone since it's simple and does not require any hackery.