Page MenuHomeFreeBSD

Fixes on Bridge+CARP crashes/freezes
Needs RevisionPublic

Authored by eri on Jul 20 2015, 6:50 AM.

Details

Summary

PR #200319

The interaction between carp and bridge code makes the OS hang.

If there is a bridge with member an interface that has a CARP ip and no minimal amount of traffic flowing in the system result is a hang of the system and sometimes a crash.

Reference https://redmine.pfsense.org/issues/4607

After analysis seems that the carp code uses taskqueue_swi for scheduling demotion events.
The carp CIF lock had a lot of contention.
ether_input was doing duplicate checks of bridge_input causing even more contention on the CIF lock.

Attached patch converts CIF lock to RW lock.

  • avoids duplicate checks from ether_input in case of bridge
  • schedules task as taskqueue_thread rather than a SWI thread
Test Plan

Confirmed on forums and from many installations through pfSense!

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 6877

Event Timeline

eri updated this revision to Diff 7092.Jul 20 2015, 6:50 AM
eri retitled this revision from to Fixes on Bridge+CARP crashes/freezes.
eri updated this object.
eri edited the test plan for this revision. (Show Details)
eri added reviewers: glebius, gnn.
eri set the repository for this revision to rS FreeBSD src repository.
eri added a project: network.
eri added a subscriber: network.
eri added a reviewer: hrs.Jul 20 2015, 6:51 AM
gnn requested changes to this revision.Jul 24 2015, 1:27 PM
gnn edited edge metadata.
gnn added inline comments.
sys/net/if_ethersubr.c
616

I am confused by this additional else. It looks like it won't compile but maybe phabricator is confused?

sys/netinet/ip_carp.c
1130

Do not leave commented out locking code in the tree please.

1134

Do not leave commented out locking code in the tree please.

This revision now requires changes to proceed.Jul 24 2015, 1:27 PM
eri added a comment.Jul 24 2015, 5:19 PM

Will submit a new update taking into consideration comments.

sys/net/if_ethersubr.c
616

No probably i forgot to properly indent this one, its correct.
It basically skips checks that are not necessary if you go into bridge code.

sys/netinet/ip_carp.c
1130

I left it so it could be more easily reviewable but will remove it.

eri updated this revision to Diff 7347.Jul 26 2015, 7:08 PM
eri edited edge metadata.

Update to catch up with coments

This certainly appears to resolve the hangs I saw with doing CARP with if_bridge, myself. I had deadlocks occurring between 3 threads (swi, bridge input, and constituent ethernet device input) VERY quickly, resulting in a full system hang. After integrating this patch on my two router VMs, they are stable.

Is anyone taking on ownership of this so that if_bridge is compatible with CARP? We really ought to disable CARP on if_bridge completely if it is not fixed, because this deadlock ends up taking down everything, even the system console.

eri updated this revision to Diff 24185.Jan 19 2017, 5:38 AM
eri edited edge metadata.
eri changed the edit policy from "All Users" to "committers (Project)".

Update to include context.

eri marked 5 inline comments as done.Jan 19 2017, 5:40 AM
eri added a reviewer: ae.Feb 14 2017, 1:24 AM
ae added inline comments.Feb 15 2017, 7:21 AM
sys/netinet/ip_carp.c
146

Since you changed the lock type and all places from where it is invoked, maybe it will be better to change the name too. cif_lock for example.

1218

Note, you lost the lock protection here. Without this lock you can miss state changing. This is very unlikely and not so important, but anyway I think you need note this possibility at least in comment. Or resurrect the lock back again.

But in the carp_linkstate() you keep the lock in the similar loop.

ae edited edge metadata.Feb 15 2017, 7:30 AM

Also can you describe the cause of hang? From the patch it is not obviously to me.

ae added a reviewer: kp.Feb 15 2017, 7:33 AM
eri added a comment.Feb 15 2017, 6:30 PM
In D3133#198459, @ae wrote:

Also can you describe the cause of hang? From the patch it is not obviously to me.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200319 has some history and links.
Though the real cause is order or locks being taken is not correct in the path.

This is related to how bridge is hooked into the ethernet input code and the ugliness it tries to do handling carp.

ae added a comment.Feb 18 2017, 11:32 AM
In D3133#198740, @eri wrote:
In D3133#198459, @ae wrote:

Also can you describe the cause of hang? From the patch it is not obviously to me.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200319 has some history and links.
Though the real cause is order or locks being taken is not correct in the path.
This is related to how bridge is hooked into the ethernet input code and the ugliness it tries to do handling carp.

There was a number of fixes in the bridge code, that are related to locking issues. This is why I added Kristof the the review.
I'm not sure that changing of lock type really fixes the problem (it just can become harder to reproduce), since I still did not found the detailed analysis of the problem.
But I can imagine that rwlock reduces the lock contention, and I think this change should not be described as a fix for a deadlock.

This revision now requires changes to proceed.Sep 29 2017, 7:38 AM
loos added a subscriber: loos.Mar 14 2018, 5:39 PM
loos added inline comments.
sys/netinet/ip_carp.c
1218

Interesting you have noticed this :-)

This change is what actually fix this issue, all the other changes are a nop.

carp_forus() is used by bridge and there is a race between the moment that the bridge lock is held and the call of carp_forus().

If the a carp callout triggers, it will try to send a packet to the bridge and it now has to wait for the bridge lock with the carp lock held. The carp_forus() runs now and it is now waiting for the carp lock with the bridge lock held.

loos requested changes to this revision.Mar 14 2018, 5:40 PM