Page MenuHomeFreeBSD

Dummynet PIE AQM: Fix kernel panic due to a race condition in callout
ClosedPublic

Authored by ralsaadi_swin.edu.au on Jun 23 2016, 3:30 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 7, 9:57 AM
Unknown Object (File)
Sun, Apr 7, 5:24 AM
Unknown Object (File)
Sun, Apr 7, 5:09 AM
Unknown Object (File)
Sun, Apr 7, 3:31 AM
Unknown Object (File)
Sun, Apr 7, 3:31 AM
Unknown Object (File)
Sun, Mar 31, 1:40 PM
Unknown Object (File)
Sun, Mar 31, 1:40 PM
Unknown Object (File)
Sun, Mar 31, 1:40 PM

Details

Summary

Dummynet PIE AQM uses callout function to calculate drop probability periodically. When a queue managed by PIE is being destroyed, the callout function should be stopped and the allocated memory should be freed.
The current code free the allocated memory when callout_stop() return non-zero. However, sometimes callout_stop() returns non-zero ( I guess when the callout function is about to be called) and this race cases the callout to use memory space that has already been freed.

Based on truckman (Don Lewis) suggestions, we solved this problem by removing memory freeing code and race conditions checking from calculate_drop_prob() and adding pie_callout_cleanup() callout function which does mtx destroying and memory freeing.
In aqm_pie_cleanup(), instead of using callout_stop(), we reset aqm_pie_callout to call pie_callout_cleanup() in next 1us. This stops the scheduled calculate_drop_prob() callout and call pie_callout_cleanup() which does memory freeing.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

ralsaadi_swin.edu.au retitled this revision from to Dummynet PIE AQM: Fix kernel panic due to a race condition in callout.
ralsaadi_swin.edu.au updated this object.
ralsaadi_swin.edu.au edited the test plan for this revision. (Show Details)
ralsaadi_swin.edu.au added reviewers: hiren, jch, rrs.
ralsaadi_swin.edu.au set the repository for this revision to rS FreeBSD src repository - subversion.
ralsaadi_swin.edu.au added a subscriber: network.
truckman added inline comments.
sys/netpfil/ipfw/dn_aqm_pie.c
214–226 ↗(On Diff #17787)

Do we need to use callout_pending(), callout_active(), and callout_deactiveate() here (Avoiding Race Conditions #3) given that the callout was initiailized with callout_init_mtx()?

It seems strange to free resources from within the callout.

643–647 ↗(On Diff #17787)

Instead of callout_stop(), should callout_drain() be used here instead, without locking the mutex?

sys/netpfil/ipfw/dn_aqm_pie.c
643–647 ↗(On Diff #17787)

I suspect that a flag will need to be set before callout_stop() to tell the callout not to reschedule itself, or maybe just retain the early return if !callout_active().

It would be better to associate the callout with a mutex and use callout_async_drain() to catch the final teardown.

--HPS

It would be better to associate the callout with a mutex and use callout_async_drain() to catch the final teardown.

We are already using callout_init_mtx().

It seems like callout_async_drain() would be more cumbersome to use than callout_drain():

if (callout_ascync_drain(&pst->aqm_pie_callout, teardown) != 0)
    teardown(...);

vs:

mtx_unlock(...);
(void)callout_drain(&pst->aqm_pie_callout);
[body of teardown()]

Also, this code is also present in FreeBSD 10, which does not have callout_async_drain().

By switching to callout_drain() and removing the cleanup code from the callout, it should be possible to further simplify the code by getting rid of CALLOUT_RETURNUNLOCKED and removing the mtx_unlock() calls from the callout.

sys/netpfil/ipfw/dn_aqm_pie.c
214–226 ↗(On Diff #17787)

So, the main objection for the patch is resource freeing within the callout?

643–647 ↗(On Diff #17787)

I used callout_drain() before in FreeBSD10.1-stable but I got kernel panic "sleeping thread (tid 100130, pid 0) owns a non-sleepable lock" since Dummynet uses locks in many places. In FreeBSD11-CURRENT, this panic does not appear but I don't know why. So, callout_drain() solution can be applied to PIE for FreeBSD11 only.

Regarding the idea of using a flag, it could work but the problem is with the variable that holds the flag should also be freed. I cannot use one global variable because multiple instances of PIE can exist.
Using "the early return if !callout_active()" I think it could also cause problem because if the callout function is about to be called and we have freed the memory just before that, then the timeout subsystem will use struct callout instance that was already freed.

sys/netpfil/ipfw/dn_aqm_pie.c
214–226 ↗(On Diff #17787)

Yes, that is a very strange thing do to, but it looks like it may be unavoidable on FreeBSD 10. It could also be done by using taskqueue, but that looks like it would add other complications.

Also, the callout_pending()/callout_active()/callout_deactivate() stuff can all go away since callout_init_mtx() avoids the race conditions that those are meant to avoid when using plain callout_init().

643–647 ↗(On Diff #17787)

Yeah, I think you were getting the panic because callout_drain() can sleep and it looks like dummynet can call the cleanup handler while holding a mutex. I don't know why you saw different behaviour between 10 and 11.

On 11 I think the proper fix would be to use callout_drain_async() as hselasky suggested. For FreeBSD 10, see my comment for lines 653-655 in the post-patch code. If callout_drain_async() is used and resources are not freed by a callout, then the CALLOUT_RETURNUNLOCKED flag to callout_init_mtx and the mtx_unlock() calls in the callout can be eliminated.

653–655 ↗(On Diff #17787)

Instead of using calculate_drop_prob() as the callout handler, you could use a separate cleanup function. That would avoid the need for the new flag.

sys/netpfil/ipfw/dn_aqm_pie.c
214–226 ↗(On Diff #17787)

As you suggested, I removed the callout_pending()/callout_active()/callout_deactivate() stuff from the callout function and everything is working fine.

643–647 ↗(On Diff #17787)

It seems callout_drain_async() patch commit (r287780) has been reverted until more developers have their say. So, I cannot used for now.

sys/netpfil/ipfw/dn_aqm_pie.c
643–647 ↗(On Diff #17787)

It got backed out by r288096 but it was added back by r290664, so it available for any version of FreeBSD 11 that has the PIE code (r300779 and newer).

That said, I'd concentrate on the FreeBSD 10-compatible version for now. We can use that to patch the code in the HEAD branch (after approval by re@ since we are currently in code freeze for 11.0-RELEASE), and then MFC it back to FreeBSD 10.3-STABLE. Sometime after 11.0-RELEASE we can update the code in the HEAD branch to use callout_async_drain() (I got the name wrong above). If that is after the 11.0-STABLE branch is created, then we can MFC the change to 11.0-STABLE.

Based on truckman suggestion, the callout_pending()/callout_active()/callout_deactivate() stuff has been removed since they are not required when callout_init_mtx() is used.

sys/netpfil/ipfw/dn_aqm_pie.c
210–213 ↗(On Diff #17993)

Any thoughts on moving this to a separate function?

static void
pie_callout_cleanup(void *x)
{
    struct pie_status *pst = (struct pie_status *) x;

    mtx_unlock(&pst->lock_mtx);
    mtx_destroy(&pst->lock_mtx);
    pie_desc.ref_count--;
}

BTW, the ref_count manipulation should probably be done with atomic ops. I thought about protecting it with the mutex, but that doesn't work if there are multiple PIE instances. If the dummynet lock serializes access to ref_count, then we don't need atomic ops, but then we need to move the ref_count manipulation out of the cleanup handler and move it into a context that is protected by the dummynet lock.

639–643 ↗(On Diff #17993)

The above change on eliminates the need for PIE_STOP_CALLOUT and lines 639-641 would change to

callout_reset_sbt(&pst->aqm_pie_callout,
        SBT_1US, 0, pie_callout_cleanup, pst, 0);

Using a separate cleanup function makes it easier to convert it to use callout_async_drain(). I think that lines 639-643 would change to:

 q->aqm_status = NULL;
 if (callout_async_drain(&pst->aqm_pie_callout, pie_callout_cleanup) == 0)
         mtx_unlock(&pst->lock_mtx);
else
         pie_callout_cleanup(pst);

Unfortunately it looks like we can't get rid of CALLOUT_RETURNUNLOCKED even in the latter version.

Are similar fixes needed elsewhere in the AQM code?

sys/netpfil/ipfw/dn_aqm_pie.c
210–213 ↗(On Diff #17993)

I can move the code to a separate function but I have to call it from inside the callout function (read my next comments).
Regarding protecting ref_count, you are right but I will leave it for now because ref_count checked just when dummynet is kldundloaded and if ref_count is not zero the unloading process will stop without causing a panic.

639–643 ↗(On Diff #17993)

What about if calculate_drop_prob() is about to be executed and then pie_callout_cleanup() is executed before the execution of calculate_drop_prob()? I am not pretty sure about the occurrence of such scenario but maybe it happens as callout_reset_sbt() performs the equivalent of callout_stop() at the beginning.

fq_pie also has a callout but no kernel panic has been reported for fq_pie. After we finish from PIE fixing we can have a look at fq_pie.

sys/netpfil/ipfw/dn_aqm_pie.c
639–643 ↗(On Diff #17993)

Interesting point. If calculate_drop_prob() is about to execute, it can't start because we hold the mutex at this point. When we call callout_reset_sbt(), it will update the function pointer stored in the callout, and then the callout will then preceed when we drop the mutex and pie_callout_cleanup() should get executed instead of calculate_drop_prob(), assuming that the callout thread doesn't look at the function pointer until after it grabs the mutex. I don't know why it would be otherwise.

The other potential hazard that I see would be that pie_callout_cleanup() could be called twice, once due to the alreading pending callout, and the other due to the new one that is scheduled by the callout_reset_sbt() call. If so, your patch has the same issue.
The pending call to calculate_drop_prob() will see the PIE_STOP_CALLOUT flag and destroy things and a new call to calculate_drop_prob() will be scheduled, which will then crash and burn.

The man page does say that callout_reset*() cancels any pending callout before scheduling a new one.

sys/netpfil/ipfw/dn_aqm_pie.c
639–643 ↗(On Diff #17993)

Very good points.
I don't have enough info about callout internals. However, I have another idea to solve the problem which should work fine (I think (-; ):

What about if calculate_drop_prob() does not free the memory but just return when see the PIE_STOP_CALLOUT flag (no rescheduling). Then we can schedule another callout (additional callout variable should be added to pie_status struct) for pie_callout_cleanup() to be call after tupdate of time. So, we guarantee no calculate_drop_prob() is scheduled and no one will use the memory.

The drawbacks of this solution are the memory will still be allocated for tupdate of time (15ms by default) and additional callout struct identifier should be added.

sys/netpfil/ipfw/dn_aqm_pie.c
210–213 ↗(On Diff #17993)

My concern with ref_count is that it could be possible to get into a state where the module could not be unloaded. Since the increment/decrement are not atomic ops, then an increment executed by a user thread and a decrement executed by the callout thread could race, causing unpredictable results. It might also be possible to get a zero ref_count when there are still active PIE instances. If the ref_count decrement is done in the user thread, then races should be less likely (there would have to be user threads adding or deleting instances at the same time) or impossible (if the dummynet mutex blocks this).

639–643 ↗(On Diff #17993)

I don't think that amount of complexity is needed.

With callout_init_mtx(), callout_stop() would almost work here. The case where it doesn't work is when the time expires and the callout gets scheduled while we are holding the mutex. The callout thread will then be waiting in mtx_lock() for us to unlock the mutex. If we did a callout_stop() and unlocked the mutex and did the cleanup in the main thread, then there would be a race between the callout thread sitting in mtx_unlock() waiting for us to unlock the mutex, and the mtx_destroy() and freeing of memory in the main thread.

If we use callout_reset_sbt(), I think the pending callout can be cleanly cancelled. There's no reason that callout_reset_sbt() can't set some flags in the callout to indicate that it needs to be rescheduled. The callout thread will see those once it returns from mtx_unlock() and not actually execute the callout, but instead re-add it to the wheel at the new time. I'm willing to believe the claim in the man page here.

I've tried looking at the callout internals and the code is pretty difficult to understand. It would be helpful if the callout_man page had separate descriptions of what happens in the callout_init() case vs. the callout_init_mtx() case.

sys/netpfil/ipfw/dn_aqm_pie.c
639–643 ↗(On Diff #17993)
NOTE: pie_callout_cleanup()

callout_async_drain() will call the cleanup callback function unlocked.

--HPS

639–643 ↗(On Diff #17993)

I've tried looking at the callout internals and the code is pretty difficult to understand. It would be helpful if the callout_man page had separate descriptions of what happens in the callout_init() case vs. the callout_init_mtx() case.

Hi,

+1

Maybe this implementation of the callout API is easier to understand:

https://svnweb.freebsd.org/base/projects/hps_head/sys/kern/kern_timeout.c?revision=299262&view=markup

--HPS

ralsaadi_swin.edu.au updated this object.

Based on truckman (Don Lewis) suggestions, the race fixed by resetting the callout in aqm_pie_cleanup() to call pie_callout_cleanup() and removing memory freeing from calculate_drop_prob(). No additional flags are needed.
This patch was tested and no kernel panic happens.

sys/netpfil/ipfw/dn_aqm_pie.c
608 ↗(On Diff #18054)

Either ref_count should be decremented by the user thread after the callout_reset_sbt(() call to avoid possible miscounts due to races between the user thread and the callout thread, or all ref_count manipulations should be done wth atomic operations.

If PIE ref_count is decremented by the user thread after callout_reset_sbt(), how we guarantee that pie_callout_cleanup() finished its execution?
Thus, I used DN_BH_WLOCK() and DN_BH_WUNLOCK() (provided by Dummynet) to make ref_count manipulation atomic.

If PIE ref_count is decremented by the user thread after callout_reset_sbt(), how we guarantee that pie_callout_cleanup() finished its execution?

I was thinking that wouldn't matter, and I don't think it does from the standpoint of the data, but I now see that we can't let the module get unloaded while the callout system still has a reference to code in the module.

Thus, I used DN_BH_WLOCK() and DN_BH_WUNLOCK() (provided by Dummynet) to make ref_count manipulation atomic.

That isn't sufficient unless all instances of the ref_count manipulation are protected by the same mutex. The instruction sequence for the increment/decrement probably fetches the current value, increments or decrements it, and then writes it back. If that sequence gets preempted in the middle by an interrupt, then another thread could manipulate ref_count, and when the original thread resumed, it would overwrite the value written by the second thread with an incorrect value. Locking an MTX_DEF mutex doesn't prevent the thread from being preempted, it only prevents another thread that uses the same mutex from updating ref_count until the first thread unlocks the mutex.

Protecting the other instances of ref_count manipulation with DN_BH_WLOCK() could be problematic. There is the danger that that lock could already be held in other code paths, causing recursion on that lock, and there could be lock ordering problems where different code paths could grab that lock and pst->lock_mtx in different orders, which could lead to a deadlock. This would require thorough testing with WITNESS enabled. It's probably easier to use atomic_add_int() everywhere.

There is still a small race that remains. Even decrementing ref_count at the end of the callout could allow another thread see that ref_count is 0 and unload the module before the callout has completed. Unlikely, but possible. I think the only cure for that is to use callout_drain() and then do all the final cleanup in the main thread after calllout_drain() returns. That would require finding the mutex that is being locked before the call to aqm_pie_cleanup() and either unlocking it or changing it to another type of lock that allows sleep operations. In this case we probably would not have to worry about atomic ops assuming that there is only one user thread at a time playing with dummynet.

It looks like there would be some upside to changing the type of the dummynet lock. Because it is currently a mutex, it requires a lot of memory allocations to be done with M_NOWAIT. Since those can fail, that requires a bunch of additional error handling code. I'm not sure if there is any downside to making this an sx(9) lock.

Sorry if my thought is incorrect but just to explain my idea of using DN_BH_WLOCK():
1- ref_count (namely pie_desc.ref_count) is updated only by PIE module and checked (read access) by unload_dn_sched() in ip_dummynet.c.
2- ref_count in unload_dn_sched() is accessed with DN_BH_WLOCK()
3- ref_count in PIE module is updated when DN_BH_WLOCK() is held (either explicitly by calling DN_BH_WLOCK() in PIE module or implicitly by Dummynet somewhere before call PIE functions.
So, is the lock I added still isn't sufficient?
Anyway, using atomic_add_int() is much easier than dealing with deadlocks ;-)

Regarding the "a small race that remains", I agree with you, this race could happen. However, changing Dummynet lock type is out of my knowledge.

truckman added a reviewer: truckman.

Sorry if my thought is incorrect but just to explain my idea of using DN_BH_WLOCK():
1- ref_count (namely pie_desc.ref_count) is updated only by PIE module and checked (read access) by unload_dn_sched() in ip_dummynet.c.
2- ref_count in unload_dn_sched() is accessed with DN_BH_WLOCK()
3- ref_count in PIE module is updated when DN_BH_WLOCK() is held (either explicitly by calling DN_BH_WLOCK() in PIE module or implicitly by Dummynet somewhere before call PIE functions.
So, is the lock I added still isn't sufficient?
Anyway, using atomic_add_int() is much easier than dealing with deadlocks ;-)

Regarding the "a small race that remains", I agree with you, this race could happen. However, changing Dummynet lock type is out of my knowledge.

I was able to confirm #1 and #2 earlier.

I dug through the dummynet code and I now believe #3 to be true. Adding some lock assertions to verify this is true and remains true in the future would be nice.

I now believe that the lock that you added is sufficient.

Changing the dummynet lock type is outside the scope of this fix. I haven't really looked deeply into the dummynet locking, but I suspect that it is using RLOCK in the packet processing path. Probably what should be done is that the scope of the WLOCK should be narrowed to cover just the code that directly impacts packet processing and there should be a new outer sx lock where some of the current WLOCK calls are now. Changing RLOCK/WLOCK to use a rmlock instead of a mutex might help performance be allowing more parallelism, but that's an entirely different topic.

This revision is now accepted and ready to land.Jul 4 2016, 5:32 AM

Should I remove DN_BH_WLOCK()/DN_BH_WUNLOCK() and use atomic_add_int() instead?
Is the patch reasonable now?

This revision was automatically updated to reflect the committed changes.

Should I remove DN_BH_WLOCK()/DN_BH_WUNLOCK() and use atomic_add_int() instead?
Is the patch reasonable now?

Since I was able to confirm that all other cases of ref_count manipulation are protected by DN_BH_WLOCK(), using it in the callout is probably the best thing to do. The downside is that pie_callout_cleanup() can't be used when callout_drain_async() detects that the callout is already stopped since DN_BH_WLOCK() will already be held in that context.

I received permission from re@ to commit the patch as-is and it is now in the tree. I'll merge it to FreeBSD 10 in a few days.