In some cases the DELAY implementation on amd64 can recurse on a spin
mutex in the i8254 early delay code. Detect when this is going to
happen and don't call delay in this case. It is safe to not delay here
with the only issue being KCSAN may not detect data races.
Details
- Reviewers
arichardson kib - Commits
- rG3fd63ddfdf35: Limit when we call DELAY from KCSAN on amd64
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
I don't known anything about this code but I can confirm that this fixes the panic for me when using KCSAN on QEMU.
What is the specific DELAY() call that caused the trouble?
If it is fine not to delay at that place, might be the call could be removed at all, instead?
The DELAY call is to wait for other CPUs to make memory accesses and check they are different from the one on the current CPU.
If DELAY also makes a memory access while holding a lock it can enter KCSAN that can then call DELAY for the above reason causing the kernel to try acquiring the lock when it wasn't expecting to. We could switch out the DELAY call for another mechanism, however I don't know the amd64 architecture well enough to make that change.
Are you referring to the DELAY() call in start_ap()? If yes, not doing delay there potentially makes the machine unbootable. We are sending startup IPI there, and then need some time before deciding that remote CPUs did not reacted which means that they are dead. And indeed TSC is not yet available in the moment.
In other words, this could only work on UP or under emulation that starts APs without any delay.
If DELAY also makes a memory access while holding a lock it can enter KCSAN that can then call DELAY for the above reason causing the kernel to try acquiring the lock when it wasn't expecting to. We could switch out the DELAY call for another mechanism, however I don't know the amd64 architecture well enough to make that change.
kcsan_access will be called from any memory access from a C file build with -fsanitize=thread unless the function has been marked as not to be sanitized. As long as we are far enough the boot (after SI_SUB_SMP, SI_ORDER_SECOND), kcsan_md_unsupported marks the address as valid to sanitize (always the case on amd64), and the kernel isn't in a panic we will check if the current memory access and any on the other CPUs in the system collide. If they do we report that as a possible race.
To allow for this check we add a short delay every 1024 memory accesses. This is implemented in kcsan_md_delay which calls DELAY on amd64. If the memory access that caused us to enter kcsan_access is within DELAY while holding a lock, as is the case on qemu under some conditions, we can cause a panic when it tries to acquire the same lock that doesn't expect to be recursive.