The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep(). In particular, the VM subsystem might call msleep() if page allocation failed. This establishes order VM locks (e.g. domain free queue lock) -> ktrace_mtx. Calling free() while owning ktrace_mtx gives the reverse order. Worse, msleep_spin_sbt() call s ktrcsw() while the thread is put on sleep queue. Then, since the mutex might be contested, the thread needs to be put on turnstil, which cannot work. Move the ktrcsw() call for switch-out after the wakeup, when the thread does not yet re-obtained any locks. From there, we call a special version of ktrcsw(), which is passed the actual time when the context switch occured. The drawback is that the switch-out record is only written in the ktrace.out file after the switch-in occurred, but this is probably not too serious. Reported and tested by: pho
Details
- Reviewers
markj
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
The issue is that for ktrcsw() we lock the ktrace_mtx mutex while owning the interlock from a subsystem that called msleep().
Would another solution be to modify _sleep() to release the spinlock later, effectively copying what msleep_spin_sbt() does?
| sys/kern/kern_ktrace.c | ||
|---|---|---|
| 214 | IMO ktrace_mtx_unlock() would be a better name, and some comment here would be nice. | |
I do not quite follow, sorry? We either own the interlock, or we own the spinlock. ktrcsw() cannot be called under spinlock at all.
msleep_spin_sbt() drops the sleepqueue spinlock in order to call ktrcsw(). So, it is not susceptible to the problem fixed by this diff. Can we use the same strategy elsewhere instead of adding a new field to struct thread?
So ktrcsw() locks mutexes while msleep_spin() put the thread on sleepqueue. If the mutex is contested, the thread might be put on turnstile. How could it work?
Hmm, right, that seems broken. So then ktrace_mtx should perhaps be a spin mutex, but then we cannot call free() while holding it anyway.
| sys/kern/kern_ktrace.c | ||
|---|---|---|
| 549 | Isn't it possible that ktr_freerequest_locked() will try to free more than one buffer? | |
If you really dislike this approach, I think a workable solution that also works for msleep_spin() is the following.
Lets call ktrcsw() twice after the thread is woren up, but before all locks are taken. But there, we need a special version of ktrcsw() to record proper switch-out time, we record the time before doing mi_switch(), and pass it to this special ktrcsw() so that the time is recorded correctly.
Do you agree with this?
I somewhat dislike the idea of recording the switch-out event potentially much later than it actually happened. That is, if I am using kdump -l, then the switch-out may appear "late" even though the timestamp is correct. Maybe it is not a big problem, but it feels a bit sinful for a tracing tool to behave this way.
I think your approach is ok, I just saw the hack in msleep_spin_sbt() and wondered if it could be used elsewhere.
So I went ahead and implemented the third (?) approach. IMO it is the only safe option there.
Then we need to move free() calls out of it, at least, so we are back to the first attempt. Also, I do not like an idea of it being spinlock: it is taken quite often during execution of the traced process, which would affect system responsivness due to interrupts disabling.
Right, I pointed this out earlier.
Also, I do not like an idea of it being spinlock: it is taken quite often during execution of the traced process, which would affect system responsivness due to interrupts disabling.
The mutex is held for short durations only and masked interrupts will be handled once the lock is released, so I am not sure this is a serious problem in practice. But ok.