Page MenuHomeFreeBSD

KCS thread hogs the CPU
Needs ReviewPublic

Authored by lskodje_isilon.com on Oct 13 2016, 8:04 PM.

Details

Summary

The KCS driver can potentially block for minutes and starve other threads. We've had issues with panics and buf timelocks with at times unresponsive BMC firmware. This change will cause the kcs thread to yield when waiting for response from hardware.

jhibbits can push this in.

Test Plan

Tested cold resets of Intel BMC over a long time to make sure driver no longer watchdogs.
Stress tested sensor reads of a satellite controller via IPMB.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

lskodje_isilon.com retitled this revision from to KCS thread hogs the CPU.
lskodje_isilon.com updated this object.
lskodje_isilon.com edited the test plan for this revision. (Show Details)
lskodje_isilon.com set the repository for this revision to rS FreeBSD src repository - subversion.

Please upload a diff with context (use -U9999 or arc diff --create).

It seems like this busy-wait on KCS_CTL_STS could maybe be generalized into a single helper routine, but I don't think that's necessary for this patch.

mav last played in IPMI space with delays, maybe he has some input.

lskodje_isilon.com edited edge metadata.
lskodje_isilon.com removed rS FreeBSD src repository - subversion as the repository for this revision.

Updated diff to include context

mav last played in IPMI space with delays, maybe he has some input.

This might have to do with tickless counters and the like...?

Is the performance the same versus DELAY? Yielding is something that I wanted to do and locally we
use pause in the kcs_loop which makes IPMI take a long time. We also have code to dump kernel back
traces into the SEL. In that mode things have to work in polled mode. I haven't tried to integrate that with
recent changes.

Is the performance the same versus DELAY? Yielding is something that I wanted to do and locally we
use pause in the kcs_loop which makes IPMI take a long time. We also have code to dump kernel back
traces into the SEL. In that mode things have to work in polled mode. I haven't tried to integrate that with
recent changes.

In testing on our systems I ran loops on sensor gathering. This goes out over IPMB and reads ~30 some sensor values + the ones off the BMC.
With DELAY the average time to gather all sensors over 100 iterations was 1.236s
With pause_sbt it came out to 1.228s for the same test. So my results show they are pretty much the same. Is there a specific benchmark you are looking for?

In testing on our systems I ran loops on sensor gathering. This goes out over IPMB and reads ~30 some sensor values + the ones off the BMC.
With DELAY the average time to gather all sensors over 100 iterations was 1.236s
With pause_sbt it came out to 1.228s for the same test. So my results show they are pretty much the same. Is there a specific benchmark you are looking for?

One concern might be that in a loaded system, the ipmi threads may not get scheduled immediately after 100µs if their priority isn't as high as other threads. But this isn't much different from spin-wait with kernel preemption.

That test is exactly what I wanted. So it looks good to me. I assume the CPU
load also goes way down since DELAY is a CPU hog!

Getting scheduled after 100µs should not be a problem, It won't
make any I/O fail. If it was less then 100µs then it would be a
problem. Taking longer on a busy machine should be a good
thing since then it working better with with other things. The
watchdog resets could take longer but I doubt this will be a
problem with that.

I have no objections against ipmi_kcs_probe_align() part, but I don't see significant benefits from this change. It is executed only once during device attach, which is mostly single-threaded any way now.

But I see problem with the ibf/obf handling part of the patch. Those functions are used also during IPMI watchdog operations, and our watchdog patting functions are required to allow execution in any context (even core dumping), and so should not sleep or switch the context.

Actually I've recalled that pause_sbt() is now hacked to fallback to DELAY() in bad contexts, so may be it is not so wrong at the end.

In D8244#171409, @mav wrote:

But I see problem with the ibf/obf handling part of the patch. Those functions are used also during IPMI watchdog operations, and our watchdog patting functions are required to allow execution in any context (even core dumping), and so should not sleep or switch the context.

Note that pause_sbt falls back on DELAY in those contexts: http://fxr.watson.org/fxr/source/kern/kern_synch.c#L322

In D8244#171411, @cem wrote:
In D8244#171409, @mav wrote:

But I see problem with the ibf/obf handling part of the patch. Those functions are used also during IPMI watchdog operations, and our watchdog patting functions are required to allow execution in any context (even core dumping), and so should not sleep or switch the context.

Note that pause_sbt falls back on DELAY in those contexts: http://fxr.watson.org/fxr/source/kern/kern_synch.c#L322

That would be my only concern. I can't recall if dumping from non-debugger sets SCHEDULER_STOPPED. If it does then this is ok. It is also possible to use interrupts with KCS to wait for each IBF/OBF transition in which case you could axe the KCS thread entirely (the IPMI backend interface is designed to not mandate a kthread)

In D8244#172057, @jhb wrote:

That would be my only concern. I can't recall if dumping from non-debugger sets SCHEDULER_STOPPED. If it does then this is ok. It is also possible to use interrupts with KCS to wait for each IBF/OBF transition in which case you could axe the KCS thread entirely (the IPMI backend interface is designed to not mandate a kthread)

Looking at sys/kern/kern_shutdown.c, vpanic() sets curthread->td_stopsched (SCHEDULER_STOPPED()) if SMP is enabled (perhaps this should be fixed, either here or in the SCHEDULER_STOPPED() macro, or elsewhere).

Yes, vpanic() sets it, but not if you invoke 'dump' from ddb. Only working after a panic() might be good enough though.

In D8244#173501, @jhb wrote:

Yes, vpanic() sets it, but not if you invoke 'dump' from ddb. Only working after a panic() might be good enough though.

Maybe kdb_enter() should set td_stopsched, too? Use td_stopsched as a nesting level?

Did this stall out over kdb_enter concerns? It would be good to get this in.

In D8244#218622, @cem wrote:

Did this stall out over kdb_enter concerns? It would be good to get this in.

There are issues with this diff as is. I think the IPMI_IO_LOCK needs to be converted to a sleepable lock upstream for this to work.

@cem mentioned this today. It sounds like this was held up over concerns about the watchdog not getting petted while panicking; is that correct? I ran into the same problem recently, and I solved it by simply adding the following to /etc/ddb.conf:

script kdb.enter.default=call db_watchdog

That stops the watchdog when entering the debugger, which works for both interactive debug sessions and dumping core.

Additionally, r325024 included a change to continue patting the watchdog while dumping core.

If debugger/dump issues were preventing this from landing, I think we're past those now.

@cem mentioned this today. It sounds like this was held up over concerns about the watchdog not getting petted while panicking; is that correct? I ran into the same problem recently, and I solved it by simply adding the following to /etc/ddb.conf:

script kdb.enter.default=call db_watchdog

That stops the watchdog when entering the debugger, which works for both interactive debug sessions and dumping core.

Additionally, r325024 included a change to continue patting the watchdog while dumping core.

If debugger/dump issues were preventing this from landing, I think we're past those now.

To answer my own question, I should have actually tried the patch before asking. :-p

As-is, it panics on the way up (when the IPMI watchdog gets patted?), because of sleeping on non-sleepable IPMI_IO_LOCK.

Sleeping thread (tid 100131, pid 3) owns a non-sleepable lock
KDB: stack backtrace of thread 100131:
sched_switch() at sched_switch+0x596/frame 0xfffffe0b585d7660
mi_switch() at mi_switch+0x3c0/frame 0xfff[-- MARK -- Tue Oct 31 20:55:00 2017]
ffe0b585d7720
sleepq_switch() at sleepq_switch+0x2d4/frame 0xfffffe0b585d7780
sleepq_timedwait() at sleepq_timedwait+0x9f/frame 0xfffffe0b585d77c0
_sleep() at _sleep+0x5be/frame 0xfffffe0b585d78b0
pause_sbt() at pause_sbt+0x15e/frame 0xfffffe0b585d7910
kcs_wait_for_ibf() at kcs_wait_for_ibf+0x101/frame 0xfffffe0b585d7950
kcs_read_byte() at kcs_read_byte+0x1b/frame 0xfffffe0b585d7990
kcs_polled_request() at kcs_polled_request+0x35b/frame 0xfffffe0b585d7a00
kcs_loop() at kcs_loop+0xad/frame 0xfffffe0b585d7a30
fork_exit() at fork_exit+0x140/frame 0xfffffe0b585d7ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0b585d7ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
panic: sleeping thread
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2e/frame 0xfffffe0b58858f80
kdb_backtrace() at kdb_backtrace+0x58/frame 0xfffffe0b58859030
vpanic() at vpanic+0x1df/frame 0xfffffe0b588590a0
kproc_shutdown() at kproc_shutdown/frame 0xfffffe0b58859100
propagate_priority() at propagate_priority+0x2ca/frame 0xfffffe0b58859190
turnstile_wait() at turnstile_wait+0x8df/frame 0xfffffe0b58859210
__mtx_lock_sleep() at __mtx_lock_sleep+0x5ef/frame 0xfffffe0b58859360
__mtx_lock_flags() at __mtx_lock_flags+0x1ec/frame 0xfffffe0b588593d0
kcs_polled_request() at kcs_polled_request+0x3a/frame 0xfffffe0b58859440
kcs_driver_request() at kcs_driver_request+0x48/frame 0xfffffe0b58859470
ipmi_submit_driver_request() at ipmi_submit_driver_request+0x36/frame 0xfffffe0b588594a0
ipmi_set_watchdog() at ipmi_set_watchdog+0x155/frame 0xfffffe0b58859550
ipmi_wd_event() at ipmi_wd_event+0xb4/frame 0xfffffe0b588595a0
wdog_kern_pat() at wdog_kern_pat+0x321/frame 0xfffffe0b58859610
wd_ioctl_patpat() at wd_ioctl_patpat+0xca/frame 0xfffffe0b58859640
wd_ioctl() at wd_ioctl+0x219/frame 0xfffffe0b588596d0
devfs_ioctl_f() at devfs_ioctl_f+0x183/frame 0xfffffe0b58859770
fo_ioctl() at fo_ioctl+0x4c/frame 0xfffffe0b588597b0
kern_ioctl() at kern_ioctl+0x365/frame 0xfffffe0b58859860
sys_ioctl() at sys_ioctl+0x2b7/frame 0xfffffe0b58859940
syscallenter() at syscallenter+0x529/frame 0xfffffe0b588599b0
amd64_syscall() at amd64_syscall+0x1f/frame 0xfffffe0b58859ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0b58859ab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0xc00b7bf7a, rsp = 0x7fffffffeba8, rbp = 0x7fffffffecd0 ---
KDB: enter: panic
[ thread pid 116 tid 100152 ]
Stopped at      breakpoint+0x5: popq    %rbp
db:0:kdb.enter.default> set $lines 0
db:0:kdb.enter.default>  call db_watchdog

So, "never mind".