Page MenuHomeFreeBSD

Fix return value for callout_stop_safe(9).
AbandonedPublic

Authored by kib on Jul 17 2015, 12:03 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mar 14 2024, 12:04 AM
Unknown Object (File)
Mar 14 2024, 12:03 AM
Unknown Object (File)
Feb 11 2024, 9:47 PM
Unknown Object (File)
Feb 11 2024, 7:01 PM
Unknown Object (File)
Dec 28 2023, 8:35 AM
Unknown Object (File)
Dec 22 2023, 11:43 PM
Unknown Object (File)
Dec 22 2023, 5:24 AM
Unknown Object (File)
Dec 22 2023, 5:24 AM
Subscribers

Details

Reviewers
jhb
rrs
Summary

Here is a copy of the message from the PR 200992

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992

The issue is that callout_stop() call in the sleepq_check_timeout() returned zero, which made the thread to (in)voluntarily relinguish the CPU. But apparently the thread sleep callout was not really run, which means that there is nobody to make the thread runnable again.

The situation was already fixed once, it is the reason for the CALLOUT_DFRMIGRATION existence.

For me, it looks as if the r278469 could be a cause. The added block which calculates the not_on_a_list means that sometimes previously pending callout is not longer pending, but also that we return 0 in this case.I think that in case not_on_a_list == 1, we _must_ return 1 from callout_stop().

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib retitled this revision from to Fix return value for callout_stop_safe(9)..
kib updated this object.
kib edited the test plan for this revision. (Show Details)
kib added reviewers: rrs, jhb.
kib set the repository for this revision to rS FreeBSD src repository - subversion.

Hmm the problem here is the same I see in building the async-drain functionality. This
is the one case that is rather nebulous in the callout return when stopping. If you
return 1, you say the callout was stopped.. but was it? Not really, this block of
code will happen in the case you illustrate but also in the case where the
callout was already stopped. Returning 1 when it was already stopped (that I was able
to stop it) is not really what you want.

I wonder if it would not be better to have three different return values here.
-1 -- The callout was not running (i.e. it either completed or was stopped).
0 -- The callout is going to go off.
1 -- I was able to stop the callout.

This would make it so I don't have to have the ASSERT I put in that the callout_async_drain() cannot have been
called on a stopped timeout.. Since the user must know if the drain() is going to be called or not and 1
says no it won't and zero says yes it will.. with the -1 return then that would indicate no it won't since
it was *not* running.

Lets consider moving this to -1, though we would need to make sure no caller is
doing if (callout_stop()) to determine if it was stopped.. (I don't think so but let me check).

Yep that would be safe, no-one that cscope shows looks at the return of callout_stop :-)

rrs requested changes to this revision.Nov 6 2015, 12:11 PM
rrs edited edge metadata.

Oh I forgot to mark this :-)

Lets make it -1 on the return.

This revision now requires changes to proceed.Nov 6 2015, 12:11 PM
In D3115#85808, @rrs wrote:

Hmm the problem here is the same I see in building the async-drain functionality. This
is the one case that is rather nebulous in the callout return when stopping. If you
return 1, you say the callout was stopped.. but was it? Not really, this block of
code will happen in the case you illustrate but also in the case where the
callout was already stopped. Returning 1 when it was already stopped (that I was able
to stop it) is not really what you want.

No, subr_sleepqueue.c:sleepq_check_timeout() does not care about the callout already run, or rather, it handles the situation by other means. In other words, it only cares to not allow callout cause undefined behaviour by trying to wake up the thread after we already checked for timeout and do not need it anymore. As it, the code does not need to digest the difference between -1 and 1. I suppose most other consumers of callouts do not care either, because it is impossible to make any useful decision only on this information, you must consider the callout subroutine effect to decide.

I wonder if it would not be better to have three different return values here.
-1 -- The callout was not running (i.e. it either completed or was stopped).
0 -- The callout is going to go off.
1 -- I was able to stop the callout.

This would make it so I don't have to have the ASSERT I put in that the callout_async_drain() cannot have been
called on a stopped timeout.. Since the user must know if the drain() is going to be called or not and 1
says no it won't and zero says yes it will.. with the -1 return then that would indicate no it won't since
it was *not* running.

Lets consider moving this to -1, though we would need to make sure no caller is
doing if (callout_stop()) to determine if it was stopped.. (I don't think so but let me check).

Yep that would be safe, no-one that cscope shows looks at the return of callout_stop :-)

Ok, I suppose changing the not_on_a_list from 1 to -1 is what you mean ?

kib edited edge metadata.

Return -1 when callout was not found on a wheel.

kib edited edge metadata.

This is a rebase after r290664. Can we please, move forward with the patch ?

For some reason fabricator is not cooperating with me. sigh.

Line 1375 and 1384 *MUST NOT* be changed to a -1 return.

In this case we *need to know* that the callout could not be stopped!!

Why?

Well when you call callout_async_drain() you must count the 0 returns (not -1) to
figure out how many times your drain() function will be called.

If you return -1 for all three cases you will hose this up.

Please do *not* change line 1384 and line 1375 in *both* cases this should be a zero return
to indicate that
a) The callout could not be stopped (but has not ran)
and
b) If you called drain, expect a call to the drain() function.

Change this back two back to 0 return and I am more than happy with it.

In D3115#87005, @rrs wrote:

For some reason fabricator is not cooperating with me. sigh.

Line 1375 and 1384 *MUST NOT* be changed to a -1 return.

In this case we *need to know* that the callout could not be stopped!!

Why?

Well when you call callout_async_drain() you must count the 0 returns (not -1) to
figure out how many times your drain() function will be called.

If you return -1 for all three cases you will hose this up.

Please do *not* change line 1384 and line 1375 in *both* cases this should be a zero return
to indicate that
a) The callout could not be stopped (but has not ran)
and
b) If you called drain, expect a call to the drain() function.

Change this back two back to 0 return and I am more than happy with it.

Well, this means that no patch is needed at all.

Ok, let us take the step back from anything I proposed before. *The* guarantee the sleepqueue code needs from the callout_stop(9) interface is:

  • If callout_stop() returned zero, it must be *absolutely* guaranteed that the callout will run, after the lock is dropped
  • If the callout_stop() function returned non-zero, it is highly desirable that the callout will not run, otherwise spurious timeouts might happen.

Do the statements above hold for the callout_stop() code after the r290664 ? If yes, then stable/10 needs the change of
the return value for the !CALLOUT_PENDING && cc_exec_curr(cc, direct) != c) (line 1255) to -1.

The code currently in head will return one of three values:

  1. 1 -- I stopped the callout .. it *will not* be called.
  2. -1 -- The callout has already expired and has been called. You can get this by calling callout_stop() on an already stopped callout.. for example if you did: ret = callout_stop(); ret2 = callout_stop(); The value in ret2 *would* be -1. If it is not the already stopped case, then it is the case where you were doing something like: if(callout_active()) { mtx_lock(&my_mutex); ret = callout_stop(); } And you had to wait for the callout code to release the lock (it was running by the time you got to the mutex and you lost the race). So at that point of your stop it has already completed.
  1. 0 -- The callout could not be stopped and is about to fire.. i.e. when you release the lock you are holding it will complete. This only happens if the callout is waiting on the lock you hold to finish executing.

TCP shortly will be migrating to using
num_drains = 0;
ret = callout_async_drain(callout_1, cleanup_function);
if (ret == 0) {

num_drains++;

}
ret = callout_async_drain(callout_2, cleanup_function);
if (ret == 0) {

num_drains++;

}

if (num_drains == 0)

free(pcb)

else {

tp->drain_cnt = num_drains;

}
......
cleanup_function()
{

tp->drain_cnt--;
if (tp->drain_cnt == 0)
      free(tp)

}
(it has several callouts all using the same lock).

The drain_cnt of course will probably be a refcount but you get the idea

In D3115#87055, @rrs wrote:

The code currently in head will return one of three values:

I asked a reporter who is able to reproduce the issue from the PR 200992, to try the following change on stable/10:

Index: sys/kern/kern_timeout.c
===================================================================
--- sys/kern/kern_timeout.c	(revision 290743)
+++ sys/kern/kern_timeout.c	(working copy)
@@ -1204,7 +1204,7 @@ again:
 			CC_UNLOCK(cc);
 			if (sq_locked)
 				sleepq_release(&cc_exec_waiting(cc, direct));
-			return (0);
+			return (-1);
 		}
 
 		if (safe) {

I now agree with what you wrote, and the testing should be conclusive.

In D3115#87119, @kib wrote:

I now agree with what you wrote, and the testing should be conclusive.

Testing demonstrated that the problem is still there.

Could it be that the callout is not run in the not_on_a_list case, after the migration ? Or, do you have any other explanation for the problem.