Page MenuHomeFreeBSD

ktrcsw(): should not be called when the thread is owning interlock or on sleepq
AcceptedPublic

Authored by kib on Wed, Jan 21, 6:22 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Feb 2, 6:24 PM
Unknown Object (File)
Sun, Feb 1, 12:04 PM
Unknown Object (File)
Sat, Jan 31, 1:30 AM
Unknown Object (File)
Fri, Jan 30, 9:54 AM
Unknown Object (File)
Thu, Jan 29, 5:42 PM
Unknown Object (File)
Tue, Jan 27, 7:32 AM
Unknown Object (File)
Mon, Jan 26, 7:04 PM
Unknown Object (File)
Mon, Jan 26, 2:15 AM
Subscribers

Details

Reviewers
markj
Summary
The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning
the interlock from a subsystem that called msleep().  In particular, the
VM subsystem might call msleep() if page allocation failed.  This
establishes order VM locks (e.g. domain free queue lock) -> ktrace_mtx.
Calling free() while owning ktrace_mtx gives the reverse order.

Worse, msleep_spin_sbt() call s ktrcsw() while the thread is put on
sleep queue.  Then, since the mutex might be contested, the thread needs
to be put on turnstil, which cannot work.

Move the ktrcsw() call for switch-out after the wakeup, when the thread
does not yet re-obtained any locks.  From there, we call a special
version of ktrcsw(), which is passed the actual time when the context
switch occured.

The drawback is that the switch-out record is only written in the
ktrace.out file after the switch-in occurred, but this is probably not
too serious.

Reported and tested by: pho

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Wed, Jan 21, 6:22 PM

i386 adjustments for kern_thread.c KBI asserts will be done after the review

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

sys/kern/kern_ktrace.c
214

IMO ktrace_mtx_unlock() would be a better name, and some comment here would be nice.

kib marked an inline comment as done.Thu, Jan 22, 9:11 PM

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

Rename to ktrace_mtx_unlock(), add explanation.

In D54816#1253403, @kib wrote:

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

msleep_spin_sbt() drops the sleepqueue spinlock in order to call ktrcsw(). So, it is not susceptible to the problem fixed by this diff. Can we use the same strategy elsewhere instead of adding a new field to struct thread?

In D54816#1253403, @kib wrote:

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

msleep_spin_sbt() drops the sleepqueue spinlock in order to call ktrcsw(). So, it is not susceptible to the problem fixed by this diff. Can we use the same strategy elsewhere instead of adding a new field to struct thread?

So ktrcsw() locks mutexes while msleep_spin() put the thread on sleepqueue. If the mutex is contested, the thread might be put on turnstile. How could it work?

Actually commit the comment before diffing.

In D54816#1253413, @kib wrote:
In D54816#1253403, @kib wrote:

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

msleep_spin_sbt() drops the sleepqueue spinlock in order to call ktrcsw(). So, it is not susceptible to the problem fixed by this diff. Can we use the same strategy elsewhere instead of adding a new field to struct thread?

So ktrcsw() locks mutexes while msleep_spin() put the thread on sleepqueue. If the mutex is contested, the thread might be put on turnstile. How could it work?

Hmm, right, that seems broken. So then ktrace_mtx should perhaps be a spin mutex, but then we cannot call free() while holding it anyway.

sys/kern/kern_ktrace.c
549

Isn't it possible that ktr_freerequest_locked() will try to free more than one buffer?

If you really dislike this approach, I think a workable solution that also works for msleep_spin() is the following.

Lets call ktrcsw() twice after the thread is woren up, but before all locks are taken. But there, we need a special version of ktrcsw() to record proper switch-out time, we record the time before doing mi_switch(), and pass it to this special ktrcsw() so that the time is recorded correctly.

Do you agree with this?

In D54816#1253418, @kib wrote:

If you really dislike this approach, I think a workable solution that also works for msleep_spin() is the following.

Lets call ktrcsw() twice after the thread is woren up, but before all locks are taken. But there, we need a special version of ktrcsw() to record proper switch-out time, we record the time before doing mi_switch(), and pass it to this special ktrcsw() so that the time is recorded correctly.

Do you agree with this?

I somewhat dislike the idea of recording the switch-out event potentially much later than it actually happened. That is, if I am using kdump -l, then the switch-out may appear "late" even though the timestamp is correct. Maybe it is not a big problem, but it feels a bit sinful for a tracing tool to behave this way.

I think your approach is ok, I just saw the hack in msleep_spin_sbt() and wondered if it could be used elsewhere.

kib retitled this revision from ktrace: do not free(9) while owning ktrace_mtx to ktrcsw(): should not be called when the thread is owning interlock or on sleepq.
kib edited the summary of this revision. (Show Details)

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

sys/kern/kern_synch.c
245

You'll get a compile error if the kernel is defined without options KTRACE. I guess we need a __ktraceused annotation.

272–273

This comment can be removed I think.

kib marked 3 inline comments as done.

Remove comment.
Add __ktrace_used to 'tv' decls,

In D54816#1253431, @kib wrote:

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

We cannot convert ktrace_mtx to a spinlock?

This revision is now accepted and ready to land.Thu, Jan 22, 10:23 PM
In D54816#1253431, @kib wrote:

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

We cannot convert ktrace_mtx to a spinlock?

Then we need to move free() calls out of it, at least, so we are back to the first attempt. Also, I do not like an idea of it being spinlock: it is taken quite often during execution of the traced process, which would affect system responsivness due to interrupts disabling.

In D54816#1253463, @kib wrote:
In D54816#1253431, @kib wrote:

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

We cannot convert ktrace_mtx to a spinlock?

Then we need to move free() calls out of it, at least, so we are back to the first attempt.

Right, I pointed this out earlier.

Also, I do not like an idea of it being spinlock: it is taken quite often during execution of the traced process, which would affect system responsivness due to interrupts disabling.

The mutex is held for short durations only and masked interrupts will be handled once the lock is released, so I am not sure this is a serious problem in practice. But ok.