Details

Reviewers

Summary

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning
the interlock from a subsystem that called msleep().  In particular, the
VM subsystem might call msleep() if page allocation failed.  This
establishes order VM locks (e.g. domain free queue lock) -> ktrace_mtx.
Calling free() while owning ktrace_mtx gives the reverse order.

Worse, msleep_spin_sbt() call s ktrcsw() while the thread is put on
sleep queue.  Then, since the mutex might be contested, the thread needs
to be put on turnstil, which cannot work.

Move the ktrcsw() call for switch-out after the wakeup, when the thread
does not yet re-obtained any locks.  From there, we call a special
version of ktrcsw(), which is passed the actual time when the context
switch occured.

The drawback is that the switch-out record is only written in the
ktrace.out file after the switch-in occurred, but this is probably not
too serious.

Reported and tested by: pho

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib created this revision.Jan 21 2026, 6:22 PM

Herald added subscribers: olce, imp. · View Herald TranscriptJan 21 2026, 6:22 PM

i386 adjustments for kern_thread.c KBI asserts will be done after the review

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

sys/kern/kern_ktrace.c
214	IMO ktrace_mtx_unlock() would be a better name, and some comment here would be nice.

In D54816#1253389, @markj wrote:

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

Rename to ktrace_mtx_unlock(), add explanation.

In D54816#1253403, @kib wrote:

In D54816#1253389, @markj wrote:

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

msleep_spin_sbt() drops the sleepqueue spinlock in order to call ktrcsw(). So, it is not susceptible to the problem fixed by this diff. Can we use the same strategy elsewhere instead of adding a new field to struct thread?

In D54816#1253407, @markj wrote:

In D54816#1253403, @kib wrote:

In D54816#1253389, @markj wrote:

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

msleep_spin_sbt() drops the sleepqueue spinlock in order to call ktrcsw(). So, it is not susceptible to the problem fixed by this diff. Can we use the same strategy elsewhere instead of adding a new field to struct thread?

So ktrcsw() locks mutexes while msleep_spin() put the thread on sleepqueue. If the mutex is contested, the thread might be put on turnstile. How could it work?

Actually commit the comment before diffing.

In D54816#1253413, @kib wrote:

In D54816#1253407, @markj wrote:

In D54816#1253403, @kib wrote:

In D54816#1253389, @markj wrote:

The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().

Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?

I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.

msleep_spin_sbt() drops the sleepqueue spinlock in order to call ktrcsw(). So, it is not susceptible to the problem fixed by this diff. Can we use the same strategy elsewhere instead of adding a new field to struct thread?

So ktrcsw() locks mutexes while msleep_spin() put the thread on sleepqueue. If the mutex is contested, the thread might be put on turnstile. How could it work?

Hmm, right, that seems broken. So then ktrace_mtx should perhaps be a spin mutex, but then we cannot call free() while holding it anyway.

sys/kern/kern_ktrace.c
549	Isn't it possible that ktr_freerequest_locked() will try to free more than one buffer?

If you really dislike this approach, I think a workable solution that also works for msleep_spin() is the following.

Lets call ktrcsw() twice after the thread is woren up, but before all locks are taken. But there, we need a special version of ktrcsw() to record proper switch-out time, we record the time before doing mi_switch(), and pass it to this special ktrcsw() so that the time is recorded correctly.

Do you agree with this?

In D54816#1253418, @kib wrote:

If you really dislike this approach, I think a workable solution that also works for msleep_spin() is the following.

Lets call ktrcsw() twice after the thread is woren up, but before all locks are taken. But there, we need a special version of ktrcsw() to record proper switch-out time, we record the time before doing mi_switch(), and pass it to this special ktrcsw() so that the time is recorded correctly.

Do you agree with this?

I somewhat dislike the idea of recording the switch-out event potentially much later than it actually happened. That is, if I am using kdump -l, then the switch-out may appear "late" even though the timestamp is correct. Maybe it is not a big problem, but it feels a bit sinful for a tracing tool to behave this way.

I think your approach is ok, I just saw the hack in msleep_spin_sbt() and wondered if it could be used elsewhere.

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

Upload the right diff.

markj added inline comments.Jan 22 2026, 9:56 PM

sys/kern/kern_synch.c
245	You'll get a compile error if the kernel is defined without options KTRACE. I guess we need a __ktraceused annotation.
272–273	This comment can be removed I think.

Remove comment.
Add __ktrace_used to 'tv' decls,

In D54816#1253431, @kib wrote:

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

We cannot convert ktrace_mtx to a spinlock?

This revision is now accepted and ready to land.Jan 22 2026, 10:23 PM

In D54816#1253459, @markj wrote:

In D54816#1253431, @kib wrote:

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

We cannot convert ktrace_mtx to a spinlock?

Then we need to move free() calls out of it, at least, so we are back to the first attempt. Also, I do not like an idea of it being spinlock: it is taken quite often during execution of the traced process, which would affect system responsivness due to interrupts disabling.

kib added a subscriber: pho.Jan 22 2026, 10:40 PM

In D54816#1253463, @kib wrote:

In D54816#1253459, @markj wrote:

In D54816#1253431, @kib wrote:

So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.

We cannot convert ktrace_mtx to a spinlock?

Then we need to move free() calls out of it, at least, so we are back to the first attempt.

Right, I pointed this out earlier.

Also, I do not like an idea of it being spinlock: it is taken quite often during execution of the traced process, which would affect system responsivness due to interrupts disabling.

The mutex is held for short durations only and masked interrupts will be handled once the lock is released, so I am not sure this is a serious problem in practice. But ok.

ktrcsw(): should not be called when the thread is owning interlock or on sleepq
AcceptedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 170256

sys/kern/kern_ktrace.c

sys/kern/kern_synch.c

sys/sys/ktrace.h

ktrcsw(): should not be called when the thread is owning interlock or on sleepqAcceptedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 170256

sys/kern/kern_ktrace.c

sys/kern/kern_synch.c

sys/sys/ktrace.h

ktrcsw(): should not be called when the thread is owning interlock or on sleepq
AcceptedPublic
Actions

Revision Contents
Changeset List