This deadlock can be relatively easy reproduced with UDP.
Several application in the same time can try to send UDP datagrams
to the same IPv6 destination address. In case when the size of UDP
datagrams is bigger than the link MTU and IP fragmentation is disabled
by application, ip6_output() will call pfctlinput2(PRC_MSGSIZE) for the
given destination. pfctlinput2() will call pr_ctlinput for all protocols.
So, what we have at this moment:
- several application are trying to send UDP datagrams;
- app1 does udp6_send() and takes INP_WLOCK(), INP_HASH_WLOCK(), calls udp6_output()->ip6_output().
- app2 also does the same;
- both threads are in the ip6_output() and it determines that MTU of outgoing interface isn't enough to send. And it calls pfctlinput2().
- now we are in the in6_pcbnotify() and it takes INP_INFO_WLOCK(), then it tries enumerate all PCB and take its INP_WLOCK(). But some of these PCB are already held by other thread.
Robert has suggested two solution here:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197059
- Use netisr to reinject this notify to avoid deadlock.
- Don't notify all sockets and let the others discover the MTU on demand.
RFC 3542 p.11.3 suggests notify all application wanted to receive
IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message,
but it doesn't restrict such behaviour for our case, when we don't
receive ICMPv6 message.
This patch implements second solution. I changed ip6_notify_pmtu()
function to use from both in6_pcbnotify() and from ip6_output().
Now we don't notify all sockets when no ICMPv6 was received.
Since we are already holding all PCB locks when we are trying to
send datagram, it should be safe to call ip6_notify_pmtu() directly from
ip6_output(). In the second case (ICMPv6 packet too big received) we are
doing this from the input path.