lacp: short timeout erroneously declares link-flapping


lacp: short timeout erroneously declares link-flapping

Panasas was seeing a higher-than-expected number of link-flap events.
After joint debugging with the switch vendor, we determined there were
problems on both sides; either of which might cause the occasional
event, but together caused lots of them.

On the switch side, an internal queuing issue was causing LACP PDUs --
which should be sent every second, in short-timeout mode -- to sometimes
be sent slightly later than they should have been. In some cases, two
successive PDUs were late, but we never saw three late PDUs in a row.

On the FreeBSD side, we saw a link-flap event every time there were two
late PDUs, while the spec says that it takes *three* seconds of downtime
to trigger that event. It turns out that if a PDU was received shortly
before the timer code was run, it would decrement less than a full
second after the PDU arrived. Then two delayed PDUs would cause two
additional decrements, causing it to reach zero less than three seconds
after the most-recent on-time PDU.

The solution is to note the time a PDU arrives, and only decrement if at
least a full second has elapsed since then.

Reported by: Greg Foster <gfoster@panasas.com>
Reviewed by: gallatin
Tested by: Greg Foster <gfoster@panasas.com>
MFC after: 3 days
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D35070


Greg Foster <gfoster@panasas.com>Authored on Apr 26 2022, 6:38 AM
rpokalaCommitted on Apr 27 2022, 7:41 PM
Differential Revision
D35070: LACP w/ short timeout erroneously declares link-flapping
rGfa160738bd18: firewire: Initialize firewire_devclass in fw_modevent.