lacp: short timeout erroneously declares link-flapping


lacp: short timeout erroneously declares link-flapping

Panasas was seeing a higher-than-expected number of link-flap events.
After joint debugging with the switch vendor, we determined there were
problems on both sides; either of which might cause the occasional
event, but together caused lots of them.

On the switch side, an internal queuing issue was causing LACP PDUs --
which should be sent every second, in short-timeout mode -- to sometimes
be sent slightly later than they should have been. In some cases, two
successive PDUs were late, but we never saw three late PDUs in a row.

On the FreeBSD side, we saw a link-flap event every time there were two
late PDUs, while the spec says that it takes *three* seconds of downtime
to trigger that event. It turns out that if a PDU was received shortly
before the timer code was run, it would decrement less than a full
second after the PDU arrived. Then two delayed PDUs would cause two
additional decrements, causing it to reach zero less than three seconds
after the most-recent on-time PDU.

The solution is to note the time a PDU arrives, and only decrement if at
least a full second has elapsed since then.

Reported by: Greg Foster <gfoster@panasas.com>
Reviewed by: gallatin
Tested by: Greg Foster <gfoster@panasas.com>
MFC after: 3 days
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D35070

(cherry picked from commit 00a80538b4471b2978c5a1990f48189f2c692e24)


Greg Foster <gfoster@panasas.com>Authored on Apr 26 2022, 6:38 AM
rpokalaCommitted on May 1 2022, 7:16 PM
Differential Revision
D35070: LACP w/ short timeout erroneously declares link-flapping
rG3529ddcfbe09: CAM: Replicate e0ceec676dc8 from da to ada and nda.