Associated fix for arp/nd6 timer usage.
ClosedPublic
Actions

Authored by rrs on Feb 4 2015, 7:30 PM.

Details

Reviewers

kib
jhb
imp
lstewart
gnn
sbruno
bz
adrian
rwatson

Summary

There is a hole in the timer code that when you give it
a lock, it will lock that lock and then check to see if
its been canceled. If the API user actually deletes the memory
out from under the callout has its being processed this could
(and did in nd6/arp case) lead to a panic when you try to lock
(in the callout code) a lock thats been removed. If you use
the MPSAFE version, you don't get that since the non-stoppable
timers proceed and does the right thing.

Test Plan

After running witness and invariant I will put this on a couple of netflix caches and make
sure I see no adverse results (leaked nd6 or arp entries).

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

rrs updated this revision to Diff 3636.Feb 4 2015, 7:30 PM

rrs retitled this revision from to Associated fix for arp/nd6 timer usage..

rrs updated this object.

rrs edited the test plan for this revision. (Show Details)

rrs added reviewers: jhb, imp, sbruno, gnn, rwatson, lstewart, kib, adrian.

rrs added subscribers: • hselasky, julian, hiren and 2 others.

I'm still a little iffy about this kind of fix. It looks more like a lifecycle management thing that we should fix, rather than just moving to MPSAFE. But I unfortunately don't have the time to go look at the lle code and figure out how it should look. :(

It smells like a lle_refcnt bug to me and I have send a private email to errs to understand the problem better (and also if other optional kernel options might have been used while this was experienced).

I don't think this is a refcnt issue bz, the base of this is a hole in the way
the callout code works. Basically there is a window when

a) The callout_wheel is executing, it sees that a "lock" has been configured, so it goes to

release the callout wheel lock and then lock the callout init'd lock

b) At that time some other cpu has the lock (that was inited on the callout), and it then

runs a callout_stop (not drain). This cause the callout to  "stop" the callout from running
(which it can do). It sets a flag on the callout and returns to the caller. The caller (lle in this case)
proceeds to delete the ref cnt since the callout was stopped (and it is it won't be run). It then
in the end purges the memory.

c) Now we resume <a> above and it now de-ref's the lock.

This window is not avoidable with the way the current callout code is architected. It can only
be avoided by the caller getting the lock not the callout system. That way it won't de-ref
the lock and blow up when it hits deleted memory.

There may be other ways to fix this, but I don't know how we can change the callout
system to handle it.. Even Han's re-write has this same problem if you use the callout_stop and
not callout_drain*

This is just "How It Works". You are always supposed to do a callout_drain() before freeing the storage belonging to a callout. I don't understand how you are preventing the callout/lock being freed out from under the callout routine in this version either. Now you can have this sequence:

a) softclock dequeues callout to run

b) other thread grabs lle_lock

c) softclock blocks on lle_wlock above

d) other thread tears down structure, unlocks lock, zeros memory, 0xdeadc0de, etc.

e) softclock wakes up in mutex code and panics becuase the mutex is destroyed and it either triggers an assertion, follows a bad pointer trying to propagate priority or see if the "owner" is running, etc.

You have to drain the callout somehow. Hans other solution is to arrange to have a callback function do the free for you if you can't block in the context where you are trying to free the structure.

JHB:

The scenario you outline is *exactly* the panic that was seen by sbruno. I guess my description
was unclear.

The existing code in that other thread <b> right now does a callout_stop and
tests the return code. If its one its one (which says I canceled a callout) then it
lowers the reference count. Then goes on down a few lines later and does
a FREE_LLE_LOCKED macro which lowers the reference count again.

The one return happens because the callout has a lock associated with it. If you change
to MPSAFE then instead there is no lock so the callout_stop() will return zero since the
callout can *not* be stopped. This means that the code at <b> *will not* lower the reference
count. It then will call FREE_LLE_LOCKED() but it will find a reference of 2 not 1.. since it
did not do the extra lower. So it returns without freeing the lle.

When soft clock continues, the callout will run and since the reference was not lowered
the memory has not been freed.

.. except he said callout_drain(). What happens if that's put in as part of the teardown process?

Hmm thinking about your comment jhb, we could easily add the callout_drain_async to the
current callout code. If you think its worth while maybe we should add that to D1711

Jhb, if you think its worth doing add that to D1711 and I will work on it ;-)

Adrian:

I know he said callout_drain, but just like in TCP that is *not* always possible. In
the case of the arp/nd6 code lock are held (same as TCP) so you can't do a callout_drain. Thats
why the original author put ref-counting in with the idea that the timer would kill it if it had
to execute.. they just did not anticipate that by having the callout grab the lock, it would
then be making references to it after they deleted it in this one case.

Maybe I need to go back through the code and using jhb's a-n outline point out the lines of code
so everyone can follow along how this fixes it...

Jhb/Others

So lets go through your scenario with code in arp:

a) softclock dequeues callout to run

Which calls softclock_call_cc We make it to line:676 and see that "yes" the user (arp) init'd with a rw_mtx and run the next line 677 (to get the lock).

b) other thread grabs lle_lock

Our other thread is a call in to flush the table in net/if_llatble.c the lucky winner is line 181.

c) softclock blocks on lle_wlock above

from the call to line 677 right.

d) other thread tears down structure, unlocks lock, zeros memory, 0xdeadc0de, etc.

Now here is where its interesting, the other thread does if (callout_stop(&lle->la_timer)) LLE_REMREF(lle); lleentry_free

llentry_free is going to do:
remove the entry from the lists
and in the end call LLE_FREE_LOCKED()
LLE_FREE_LOCKED() is a macro that checks if (lle->lle_refcnt == 1) call free function else { LLE_REMREF(lle) LLE_WUNLOCK() }

Since we are a "stoppable callout" and the callout has a lock, it will fall through (even though its not safe) and return 1, yes the callout was stopped. So we hit the call free function which in this case in_lltable_free which does: LLE_WUNLOCK(lle) LLE_LOK_DESTROY(lle) free(lle, M_LLTABLE)

And we wake up and boom.

However, with the change from callout_init_rw(c,..) -> callout_init(c, 1) things change.

Instead you get a 0 return from callout_stop, since the callout can *not* be stopped at this point.
We don't do that first reference lower so we hit the else case in llentry_free which just reduces
the count to 1 and unlocks and returns.

Now our callout proceeds, getting the lock and it will then go through and check only
the pending flag (the active has been removed by the stop but we don't care). There
we now do the free and all is well.

That is how this fix avoids the issue.

Would it be better to have callout_async_drain()? Yes probably so, but then this
code would have to be restructured a lot more than this small change.

I hope that explains how it works here.. unless of course I am missing something???

Update from llnw world:

Things have been pretty stable here without any panics for 24+ hours with Stable10+D1711+D1777.

Thanks a lot, Randall!

Hiren, it only took us 4 years to trigger this? Can people actually easily/reliably reproduce it?

In D1777#16, @bz wrote:

Hiren, it only took us 4 years to trigger this? Can people actually easily/reliably reproduce it?

Heh, I am not sure about "people" but we @llnw can see this very reliably.

Do you have any other theories/patches that we can try? It'd be helpful to understand your reservations about this patch.

You said about some panics, do you have traces?

It all started with:
https://lists.freebsd.org/pipermail/freebsd-net/2014-September/039730.html

Last (conclusive) email in that thread:
https://lists.freebsd.org/pipermail/freebsd-net/2015-January/040895.html

That issue was fixed by: https://reviews.freebsd.org/D1438 i.e. https://svnweb.freebsd.org/base?view=revision&revision=277213

That got reverted as it was not entirely correct/complete. And rrs@ started working on a better approach with https://reviews.freebsd.org/D1711

After applying D1711, we started seeing a bunch of other panics:
panic #1 https://reviews.freebsd.org/D1711#54
panic #2 https://reviews.freebsd.org/D1711#55
panic #3 https://reviews.freebsd.org/D1711#56

And finally. after applying patch from this review D1777, we do not see any of the panics and machines seem happy.

Randall:

Do you want to close this review as committed? The commit hook doesn't seem to have fired to close this when comitted at svn r278472

sbruno accepted this revision.Mar 24 2015, 2:22 PM

sbruno edited edge metadata.

This revision is now accepted and ready to land.Mar 24 2015, 2:22 PM

Should this be closed?

I believe we can close this.

oleg added a subscriber: oleg.Aug 10 2017, 12:23 PM

Please correct me if i'm wrong but there is ref leakage. In particular:

"If the callout was restarted, the pending bit will be back on and"
agree.

"we just want to bail"
no, i would say we should do LLE_REMREF(lle) first, then unlock/return
because this one is wrong:

"since the callout_reset would return 1 and our reference would have been remove by nd6_llinfo_settimer_locked above since canceled would have been 1."

If we got callout_pending() !=0 in callout function it does mean callout_reset failed to stop us and callout_reset return value would be 0.