Ensure the case where timer is executing concurrently with cancel_delayed_work() is handled properly.
Found by: rstone@
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Differential D28017
Fix race in cancel_delayed_work() in the LinuxKPI • hselasky on Jan 7 2021, 11:13 AM. Authored by Tags None Referenced Files
Details Ensure the case where timer is executing concurrently with cancel_delayed_work() is handled properly. Found by: rstone@
linux_delayed_work_timer_fn() The following sequence of events is possible: A: Call linux_cancel_delayed_work() As a result, the delayed_work struct will never be invoked. This is Do you have a recommendation? Should we unconditionally
Diff Detail
Event TimelineComment Actions I don't think that spinning is the right answer. That could hold the CPU that the callout needs to make progress, resulting in a livelock. I'm testing this patch internally. It has survived 12 hours of testing so far, which is a really good sign (usually I would see IB ARP deadlock within an hour or so): https://people.freebsd.org/~rstone/patches/cancel-workqueue.diff Comment Actions @rstone: It is not spinning, it is only re-evaluating the state when the race is detected. There are a few murky things about Linux works which you should know about first: First of all the work callback function is allowed to free the work itself, which it was called on. Only if the client does a cancel request on the work, we know that the memory is still there. In the current code, issuing a cancel operation on a zero-initialized, but not initialized work, is supported. After your patch you depend on all works being cancelled have been init, due to the use of the callback's mutex. Yes, you can solve this by locking the callback mutex too, to serialize the operations, like you did in your patch. Comment Actions @rstone: You also patch linux_cancel_delayed_work_sync(), but I think that function is OK, because TIMER state goes to IDLE state, and then the timer CB has IDLE to IDLE, so that function is OK! Comment Actions This patch doesn't work. After the goto retry, nothing stops the cancelling thread from seeing state still in WORK_ST_CANCEL, and then the timer thread can come in and change it to TASK, causing the same bug. We absolutely need a lock here for proper synchronization. I'm looking at the Linux implementation and it appears to assert that the underlying timer object be initialized (well, specifically it unconditionally calls del_timer on it and that asserts that it is initialized). Comment Actions @rstone: If you have schedule delayed work racing with the cancel function by means of two threads, then yes you are right, but that is undefined behaviour (or an application problem). I don't see how the timer can still be running in WORK_ST_CANCEL, when you assume that scheduling and cancelling threads are serialized, because in WORK_ST_TIMER we call linux_cancel_timer() which takes the mutex of the callout, which means the callout is only allowed to execute between the switch() statement and until linux_cancel_timer()! When you stop a mutex protected callout when its mutex is locked, the callback cannot be executing after you drop the mutex after callout_stop() ! --HPS
Comment Actions
Yes it can! You are missing a piece of logic in the callout code: if (c_lock != NULL) { class->lc_lock(c_lock, lock_status); /* * The callout may have been cancelled * while we switched locks. */ if (cc_exec_cancel(cc, direct)) { class->lc_unlock(c_lock); goto skip; } /* The callout cannot be stopped now. */ cc_exec_cancel(cc, direct) = true; if (c_lock == &Giant.lock_object) { You see that cc_exec_cancel() is under the mutex, so this makes callout_stop() atomic and the callback will be stopped if it started to run! --HPS Comment Actions sorry I got distracted by a different critical issue at work. Will be able to return to this next week Comment Actions Sorry about the delay in coming back to this; I've had to deal with a number of critical issues at work. I ported the patch that we tested internally to main and posted it to this review: Comment Actions Already fixed by @rstone : https://cgit.freebsd.org/src/commit/?id=b58cf1cb35abc095d7fb9773630794b38bbc6b1d --HPS |