Paths

Table of Contentst

lagg: Handle a port count of zero
ClosedPublic
Actions

Authored by markj on May 11 2026, 2:07 PM.

Details

Reviewers

zlei
adrian
pouria

Group Reviewers

network

Commits

rGc511c9eee2a1: lagg: Handle a port count of zero
rG49d90d9ddfc1: lagg: Handle a port count of zero

Summary

The sc_count check in lagg_transmit_ethernet() and
lagg_transmit_infiniband() is racy, as the lagg protocol handlers are
only synchronized by net_epoch. Handle a count of 0 in each protocol
handler where it's needed, namely in the RR and LB handlers.

PR: 289017

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj created this revision.May 11 2026, 2:07 PM

Herald added subscribers: glebius, melifaro, ae, imp. · View Herald TranscriptMay 11 2026, 2:07 PM

markj requested review of this revision.May 11 2026, 2:07 PM

Harbormaster completed remote builds in B72979: Diff 177582.May 11 2026, 2:07 PM

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

In D56942#1304346, @zlei wrote:

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

Could you be more specific? I'm sure there are other problems, but this one is easy to trigger in practice.

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

Here, it seems straightfoward to me that at a minimum we need to verify that the count isn't zero. A relaxed atomic_load_int()+check is sufficient.

sys/net/if_lagg.c
2430	We also need to make sure that `lp` isn't NULL.

Avoid a related potential null pointer dereference in lagg_rr_start().

Harbormaster completed remote builds in B72987: Diff 177592.May 11 2026, 3:55 PM

emaste added a subscriber: emaste.May 12 2026, 6:42 PM

In D56942#1304350, @markj wrote:

In D56942#1304346, @zlei wrote:

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

Could you be more specific? I'm sure there are other problems, but this one is easy to trigger in practice.

I think the design of switching protocols is not right.

Different protocols has different requirements, the whole idea to keep pushing traffic / receiving traffic to / from the lagg interface while switching protocols looks wrong to me. That's why I suggested stop the interface, wait a while, switch the protocol, bring up the interface as a workaround in the mail.

There're multiple possible solutions for that. A straight forward one is introducing a protocol_ready flag to sc. But that looks a bit overkill, as user rarely switch protocol. So I think maybe we can re-use IFF_DRV_RUNNING. On switching the protocol, stop the interface first, that is

ifp->if_drv_flags &= ~IFF_DRV_RUNNING;
lagg_proto_stop();
/* switch protocol */
...
lagg_proto_init();
ifp->if_drv_flags |= IFF_DRV_RUNNING;

Well, that still need proper synchronization of if_drv_flags. The fast path lacks that. ( Actually a lot of drivers rely if_drv_flags, that is not right. CC @glebius )

So I think, if we can extends NET_EPOCH_WAIT / NET_EPOCH_ENTER a bit, so that they establish a release / require semantic. Lots of drivers will benifet that I think. That is what I planned to ask on the mailing list.

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

Here, it seems straightfoward to me that at a minimum we need to verify that the count isn't zero. A relaxed atomic_load_int()+check is sufficient.

We're racing with releasing of 15.1, I think maybe we can land a simple solid fix firstly. I'm preparing it now.

In D56942#1305280, @zlei wrote:

In D56942#1304350, @markj wrote:

In D56942#1304346, @zlei wrote:

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

Could you be more specific? I'm sure there are other problems, but this one is easy to trigger in practice.

I think the design of switching protocols is not right.

Different protocols has different requirements, the whole idea to keep pushing traffic / receiving traffic to / from the lagg interface while switching protocols looks wrong to me. That's why I suggested stop the interface, wait a while, switch the protocol, bring up the interface as a workaround in the mail.

How long do you wait?

There're multiple possible solutions for that. A straight forward one is introducing a protocol_ready flag to sc. But that looks a bit overkill, as user rarely switch protocol.

if_lagg already sets sc->sc_proto = LAGG_PROTO_NONE for the window where the current protocol is detaching and we're attaching the new protocol. This is racy of course, as you say, but I don't see why any new flag is needed.

So I think maybe we can re-use IFF_DRV_RUNNING. On switching the protocol, stop the interface first, that is
ifp->if_drv_flags &= ~IFF_DRV_RUNNING;
lagg_proto_stop();
/* switch protocol */
...
lagg_proto_init();
ifp->if_drv_flags |= IFF_DRV_RUNNING;
Well, that still need proper synchronization of if_drv_flags. The fast path lacks that. ( Actually a lot of drivers rely if_drv_flags, that is not right. CC @glebius )

So I think, if we can extends NET_EPOCH_WAIT / NET_EPOCH_ENTER a bit, so that they establish a release / require semantic. Lots of drivers will benifet that I think. That is what I planned to ask on the mailing list.

I do not see how protocol switching can be fixed without updating protocol pr_attach and pr_detach implementations to be epoch-aware. This means, at minimum:

sc_proto must be loaded/updated using acquire/release semantics.
pr_detach implementations must use deferred free(), i.e., NET_EPOCH_CALL().

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

Here, it seems straightfoward to me that at a minimum we need to verify that the count isn't zero. A relaxed atomic_load_int()+check is sufficient.

We're racing with releasing of 15.1, I think maybe we can land a simple solid fix firstly. I'm preparing it now.

This patch is intended to exactly a "simple solid fix."

zlei added inline comments.May 15 2026, 9:09 PM

sys/net/if_lagg.c
882	Do we need a barrier here ?

markj added inline comments.May 17 2026, 5:20 PM

sys/net/if_lagg.c
882	CK_SLIST_INSERT_AFTER and CK_SLIST_INSERT_HEAD issue a store-store fence, so the data path should not be able to observe a partially constructed `lp` even today. In any case, fixing such bugs is not really the goal of this particular patch.
1843	We should check here too.

Update lookup_snd_tag_port() too.

Harbormaster completed remote builds in B73174: Diff 177988.May 17 2026, 10:53 PM

zlei added inline comments.May 18 2026, 7:41 PM

sys/net/if_lagg.c

882

For example the implementation of CK_SLIST_INSERT_HEAD on aarch64,

#define CK_SLIST_INSERT_HEAD(head, elm, field) do {                             \
        (elm)->field.csle_next = (head)->cslh_first;                            \
        ck_pr_fence_store();                                                    \
        ck_pr_store_ptr(&(head)->cslh_first, elm);                              \
} while (0)

and ck_pr_store_ptr,

#define CK_PR_STORE_SAFE(DST, VAL, TYPE)                        \
    ck_pr_md_store_##TYPE(                                      \
        ((void)sizeof(*(DST) = (VAL)), (DST)),                  \
        (VAL))

#define ck_pr_store_ptr(DST, VAL) CK_PR_STORE_SAFE((DST), (VAL), ptr)

#define CK_PR_STORE_64(S, M, T, I)                              \
        CK_CC_INLINE static void                                \
        ck_pr_md_store_##S(M *target, T v)                      \
        {                                                       \
                __asm__ __volatile__(I " %2, [%1]"              \
                                        : "=m" (*(T *)target)   \
                                        : "r" (target),         \
                                          "r" (v)               \
                                        : "memory");            \
                return;                                         \
        }

CK_PR_STORE_64(ptr, void, const void *, "str")
...

If I read the code correctly, the ck_pr_store_ptr only ensures compiler order, but there's no ( implicit or explicit ) fence introduced by it, then it is possible to get runtime disorder of following write to sc->sc_count on SMP ?

So that another thread see an increased sc->sc_count++ but the list sc->sc_ports does not contain the lp ?

Please correct me if wrong. I'm not expert on ARM, really.

markj added inline comments.May 18 2026, 7:56 PM

sys/net/if_lagg.c
882	The memory ordering is provided by ck_pr_fence_store(). But in any case it is indeed necessary to ensure that data path code handles the case where `sc->sc_count` and the port list are not in agreement. That's why I added the NULL check to the loop in lagg_rr_start().

In D56942#1305647, @markj wrote:

I do not see how protocol switching can be fixed without updating protocol pr_attach and pr_detach implementations to be epoch-aware. This means, at minimum:

sc_proto must be loaded/updated using acquire/release semantics.

pr_detach implementations must use deferred free(), i.e., NET_EPOCH_CALL().

Sorry for unrelated comment, but I have re-read this several times and it blew my mind until I realized you are not talking about protosw(9). I'd suggest to rename the lagg protocol method names to something different to avoid this confusion.

glebius mentioned this in D57023: lagg: Fix SIOCSLAGG ioctl.May 20 2026, 6:32 PM

Ping? I'd like to commit this if there are no objections. The bug it fixes is real, and this patch or something like it is required no matter how other if_lagg lifecycle issues are handled.

adrian accepted this revision.May 29 2026, 7:48 PM

This revision is now accepted and ready to land.May 29 2026, 7:48 PM

pouria accepted this revision.May 29 2026, 9:40 PM

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

In D56942#1314685, @zlei wrote:

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

In D56942#1314704, @markj wrote:

In D56942#1314685, @zlei wrote:

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

The writer side is lagg_port_create() and lagg_port_destroy(). They are serialized. The reader side always checks sc_count before accessing sc_ports, so this order should be sufficient for the writer side,

diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 0333162da0d4..226565633b3a 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -879,7 +879,7 @@ lagg_port_create(struct lagg_softc *sc, struct ifnet *ifp)
                CK_SLIST_INSERT_AFTER(tlp, lp, lp_entries);
        else
                CK_SLIST_INSERT_HEAD(&sc->sc_ports, lp, lp_entries);
-       sc->sc_count++;
+       atomic_add_rel_int(&sc->sc_count, 1);
 
        lagg_setmulti(lp);
 
@@ -962,8 +962,8 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport)
        }
 
        /* Finally, remove the port from the lagg */
+       atomic_add_rel_int(&sc->sc_count, -1);
        CK_SLIST_REMOVE(&sc->sc_ports, lp, lagg_port, lp_entries);
-       sc->sc_count--;
 
        /* Update the primary interface */
        if (lp == sc->sc_primary) {

The reader side use atomic_load_acq_int() to get sc_count, then iterate sc_ports.

In D56942#1314723, @zlei wrote:
In D56942#1314704, @markj wrote:

In D56942#1314685, @zlei wrote:

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

The writer side is lagg_port_create() and lagg_port_destroy(). They are serialized. The reader side always checks sc_count before accessing sc_ports, so this order should be sufficient for the writer side,
diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 0333162da0d4..226565633b3a 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -879,7 +879,7 @@ lagg_port_create(struct lagg_softc *sc, struct ifnet *ifp)
                CK_SLIST_INSERT_AFTER(tlp, lp, lp_entries);
        else
                CK_SLIST_INSERT_HEAD(&sc->sc_ports, lp, lp_entries);
-       sc->sc_count++;
+       atomic_add_rel_int(&sc->sc_count, 1);
 
        lagg_setmulti(lp);
 
@@ -962,8 +962,8 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport)
        }
 
        /* Finally, remove the port from the lagg */
+       atomic_add_rel_int(&sc->sc_count, -1);
        CK_SLIST_REMOVE(&sc->sc_ports, lp, lagg_port, lp_entries);
-       sc->sc_count--;
 
        /* Update the primary interface */
        if (lp == sc->sc_primary) {
The reader side use atomic_load_acq_int() to get sc_count, then iterate sc_ports.

This is still not sufficient. The barrier just enforces an ordering of memory accesses.

Suppose a reader loads sc_count into a register, and then an interrupt arrives and the reader thread is preempted for a long time. Meanwhile, writer threads are updating the port list. When the reader resumes execution, the list might be completely different.

In D56942#1314738, @markj wrote:
In D56942#1314723, @zlei wrote:
In D56942#1314704, @markj wrote:

In D56942#1314685, @zlei wrote:

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

The writer side is lagg_port_create() and lagg_port_destroy(). They are serialized. The reader side always checks sc_count before accessing sc_ports, so this order should be sufficient for the writer side,
diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 0333162da0d4..226565633b3a 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -879,7 +879,7 @@ lagg_port_create(struct lagg_softc *sc, struct ifnet *ifp)
                CK_SLIST_INSERT_AFTER(tlp, lp, lp_entries);
        else
                CK_SLIST_INSERT_HEAD(&sc->sc_ports, lp, lp_entries);
-       sc->sc_count++;
+       atomic_add_rel_int(&sc->sc_count, 1);
 
        lagg_setmulti(lp);
 
@@ -962,8 +962,8 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport)
        }
 
        /* Finally, remove the port from the lagg */
+       atomic_add_rel_int(&sc->sc_count, -1);
        CK_SLIST_REMOVE(&sc->sc_ports, lp, lagg_port, lp_entries);
-       sc->sc_count--;
 
        /* Update the primary interface */
        if (lp == sc->sc_primary) {
The reader side use atomic_load_acq_int() to get sc_count, then iterate sc_ports.
This is still not sufficient. The barrier just enforces an ordering of memory accesses.

Suppose a reader loads sc_count into a register, and then an interrupt arrives and the reader thread is preempted for a long time. Meanwhile, writer threads are updating the port list. When the reader resumes execution, the list might be completely different.

Ahh, I see the problem.

I think lagg(4) deserves a better design. We want a mean to get consistent view of sc_count and sc_ports. That is a long work. Just go ahead and commit this firstly.

Closed by commit rG49d90d9ddfc1: lagg: Handle a port count of zero (authored by markj). · Explain WhyJun 1 2026, 6:30 PM

This revision was automatically updated to reflect the committed changes.

markj added a commit: rG49d90d9ddfc1: lagg: Handle a port count of zero.

markj added a commit: rGc511c9eee2a1: lagg: Handle a port count of zero.Jun 15 2026, 6:02 PM