Page MenuHomeFreeBSD

lagg: Handle a port count of zero
ClosedPublic

Authored by markj on May 11 2026, 2:07 PM.
Tags
None
Referenced Files
F159594872: D56942.id177582.diff
Tue, Jun 16, 1:45 AM
Unknown Object (File)
Wed, Jun 10, 3:18 PM
Unknown Object (File)
Sun, Jun 7, 1:50 PM
Unknown Object (File)
Thu, Jun 4, 4:12 AM
Unknown Object (File)
Tue, Jun 2, 2:20 AM
Unknown Object (File)
Tue, Jun 2, 2:16 AM
Unknown Object (File)
Sun, May 31, 5:39 AM
Unknown Object (File)
Sun, May 31, 5:36 AM

Details

Summary

The sc_count check in lagg_transmit_ethernet() and
lagg_transmit_infiniband() is racy, as the lagg protocol handlers are
only synchronized by net_epoch. Handle a count of 0 in each protocol
handler where it's needed, namely in the RR and LB handlers.

PR: 289017

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 73174
Build 70057: arc lint + arc unit

Event Timeline

markj requested review of this revision.May 11 2026, 2:07 PM

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

Could you be more specific? I'm sure there are other problems, but this one is easy to trigger in practice.

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

Here, it seems straightfoward to me that at a minimum we need to verify that the count isn't zero. A relaxed atomic_load_int()+check is sufficient.

sys/net/if_lagg.c
2430

We also need to make sure that lp isn't NULL.

Avoid a related potential null pointer dereference in lagg_rr_start().

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

Could you be more specific? I'm sure there are other problems, but this one is easy to trigger in practice.

I think the design of switching protocols is not right.

Different protocols has different requirements, the whole idea to keep pushing traffic / receiving traffic to / from the lagg interface while switching protocols looks wrong to me. That's why I suggested stop the interface, wait a while, switch the protocol, bring up the interface as a workaround in the mail.

There're multiple possible solutions for that. A straight forward one is introducing a protocol_ready flag to sc. But that looks a bit overkill, as user rarely switch protocol. So I think maybe we can re-use IFF_DRV_RUNNING. On switching the protocol, stop the interface first, that is

ifp->if_drv_flags &= ~IFF_DRV_RUNNING;
lagg_proto_stop();
/* switch protocol */
...
lagg_proto_init();
ifp->if_drv_flags |= IFF_DRV_RUNNING;

Well, that still need proper synchronization of if_drv_flags. The fast path lacks that. ( Actually a lot of drivers rely if_drv_flags, that is not right. CC @glebius )

So I think, if we can extends NET_EPOCH_WAIT / NET_EPOCH_ENTER a bit, so that they establish a release / require semantic. Lots of drivers will benifet that I think. That is what I planned to ask on the mailing list.

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

Here, it seems straightfoward to me that at a minimum we need to verify that the count isn't zero. A relaxed atomic_load_int()+check is sufficient.

We're racing with releasing of 15.1, I think maybe we can land a simple solid fix firstly. I'm preparing it now.

Actually there are other sc members ( and other places sc->sc_count invoved ) not synchronized properly, I think on architectures with weak memory order they are easier to spot. Found that while I was working on PR 289017 .

Could you be more specific? I'm sure there are other problems, but this one is easy to trigger in practice.

I think the design of switching protocols is not right.

Different protocols has different requirements, the whole idea to keep pushing traffic / receiving traffic to / from the lagg interface while switching protocols looks wrong to me. That's why I suggested stop the interface, wait a while, switch the protocol, bring up the interface as a workaround in the mail.

How long do you wait?

There're multiple possible solutions for that. A straight forward one is introducing a protocol_ready flag to sc. But that looks a bit overkill, as user rarely switch protocol.

if_lagg already sets sc->sc_proto = LAGG_PROTO_NONE for the window where the current protocol is detaching and we're attaching the new protocol. This is racy of course, as you say, but I don't see why any new flag is needed.

So I think maybe we can re-use IFF_DRV_RUNNING. On switching the protocol, stop the interface first, that is

ifp->if_drv_flags &= ~IFF_DRV_RUNNING;
lagg_proto_stop();
/* switch protocol */
...
lagg_proto_init();
ifp->if_drv_flags |= IFF_DRV_RUNNING;

Well, that still need proper synchronization of if_drv_flags. The fast path lacks that. ( Actually a lot of drivers rely if_drv_flags, that is not right. CC @glebius )

So I think, if we can extends NET_EPOCH_WAIT / NET_EPOCH_ENTER a bit, so that they establish a release / require semantic. Lots of drivers will benifet that I think. That is what I planned to ask on the mailing list.

I do not see how protocol switching can be fixed without updating protocol pr_attach and pr_detach implementations to be epoch-aware. This means, at minimum:

  1. sc_proto must be loaded/updated using acquire/release semantics.
  2. pr_detach implementations must use deferred free(), i.e., NET_EPOCH_CALL().

I planned to ask what is the best approach for drivers to handle those synchronizations on the mailing list, but at that time the 14.4-RELEASE was about to be released, and I thought that is a generic question, so I decide to postpone the asking.

Here, it seems straightfoward to me that at a minimum we need to verify that the count isn't zero. A relaxed atomic_load_int()+check is sufficient.

We're racing with releasing of 15.1, I think maybe we can land a simple solid fix firstly. I'm preparing it now.

This patch is intended to exactly a "simple solid fix."

sys/net/if_lagg.c
882

Do we need a barrier here ?

sys/net/if_lagg.c
882

CK_SLIST_INSERT_AFTER and CK_SLIST_INSERT_HEAD issue a store-store fence, so the data path should not be able to observe a partially constructed lp even today. In any case, fixing such bugs is not really the goal of this particular patch.

1843

We should check here too.

Update lookup_snd_tag_port() too.

sys/net/if_lagg.c
882

For example the implementation of CK_SLIST_INSERT_HEAD on aarch64,

#define CK_SLIST_INSERT_HEAD(head, elm, field) do {                             \
        (elm)->field.csle_next = (head)->cslh_first;                            \
        ck_pr_fence_store();                                                    \
        ck_pr_store_ptr(&(head)->cslh_first, elm);                              \
} while (0)

and ck_pr_store_ptr,

#define CK_PR_STORE_SAFE(DST, VAL, TYPE)                        \
    ck_pr_md_store_##TYPE(                                      \
        ((void)sizeof(*(DST) = (VAL)), (DST)),                  \
        (VAL))

#define ck_pr_store_ptr(DST, VAL) CK_PR_STORE_SAFE((DST), (VAL), ptr)

#define CK_PR_STORE_64(S, M, T, I)                              \
        CK_CC_INLINE static void                                \
        ck_pr_md_store_##S(M *target, T v)                      \
        {                                                       \
                __asm__ __volatile__(I " %2, [%1]"              \
                                        : "=m" (*(T *)target)   \
                                        : "r" (target),         \
                                          "r" (v)               \
                                        : "memory");            \
                return;                                         \
        }

CK_PR_STORE_64(ptr, void, const void *, "str")
...

If I read the code correctly, the ck_pr_store_ptr only ensures compiler order, but there's no ( implicit or explicit ) fence introduced by it, then it is possible to get runtime disorder of following write to sc->sc_count on SMP ?

So that another thread see an increased sc->sc_count++ but the list sc->sc_ports does not contain the lp ?

Please correct me if wrong. I'm not expert on ARM, really.

sys/net/if_lagg.c
882

The memory ordering is provided by ck_pr_fence_store().

But in any case it is indeed necessary to ensure that data path code handles the case where sc->sc_count and the port list are not in agreement. That's why I added the NULL check to the loop in lagg_rr_start().

I do not see how protocol switching can be fixed without updating protocol pr_attach and pr_detach implementations to be epoch-aware. This means, at minimum:

  1. sc_proto must be loaded/updated using acquire/release semantics.
  2. pr_detach implementations must use deferred free(), i.e., NET_EPOCH_CALL().

Sorry for unrelated comment, but I have re-read this several times and it blew my mind until I realized you are not talking about protosw(9). I'd suggest to rename the lagg protocol method names to something different to avoid this confusion.

Ping? I'd like to commit this if there are no objections. The bug it fixes is real, and this patch or something like it is required no matter how other if_lagg lifecycle issues are handled.

This revision is now accepted and ready to land.Fri, May 29, 7:48 PM

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

The writer side is lagg_port_create() and lagg_port_destroy(). They are serialized. The reader side always checks sc_count before accessing sc_ports, so this order should be sufficient for the writer side,

diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 0333162da0d4..226565633b3a 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -879,7 +879,7 @@ lagg_port_create(struct lagg_softc *sc, struct ifnet *ifp)
                CK_SLIST_INSERT_AFTER(tlp, lp, lp_entries);
        else
                CK_SLIST_INSERT_HEAD(&sc->sc_ports, lp, lp_entries);
-       sc->sc_count++;
+       atomic_add_rel_int(&sc->sc_count, 1);
 
        lagg_setmulti(lp);
 
@@ -962,8 +962,8 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport)
        }
 
        /* Finally, remove the port from the lagg */
+       atomic_add_rel_int(&sc->sc_count, -1);
        CK_SLIST_REMOVE(&sc->sc_ports, lp, lagg_port, lp_entries);
-       sc->sc_count--;
 
        /* Update the primary interface */
        if (lp == sc->sc_primary) {

The reader side use atomic_load_acq_int() to get sc_count, then iterate sc_ports.

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

The writer side is lagg_port_create() and lagg_port_destroy(). They are serialized. The reader side always checks sc_count before accessing sc_ports, so this order should be sufficient for the writer side,

diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 0333162da0d4..226565633b3a 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -879,7 +879,7 @@ lagg_port_create(struct lagg_softc *sc, struct ifnet *ifp)
                CK_SLIST_INSERT_AFTER(tlp, lp, lp_entries);
        else
                CK_SLIST_INSERT_HEAD(&sc->sc_ports, lp, lp_entries);
-       sc->sc_count++;
+       atomic_add_rel_int(&sc->sc_count, 1);
 
        lagg_setmulti(lp);
 
@@ -962,8 +962,8 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport)
        }
 
        /* Finally, remove the port from the lagg */
+       atomic_add_rel_int(&sc->sc_count, -1);
        CK_SLIST_REMOVE(&sc->sc_ports, lp, lagg_port, lp_entries);
-       sc->sc_count--;
 
        /* Update the primary interface */
        if (lp == sc->sc_primary) {

The reader side use atomic_load_acq_int() to get sc_count, then iterate sc_ports.

This is still not sufficient. The barrier just enforces an ordering of memory accesses.

Suppose a reader loads sc_count into a register, and then an interrupt arrives and the reader thread is preempted for a long time. Meanwhile, writer threads are updating the port list. When the reader resumes execution, the list might be completely different.

I'm OK with this revision, although I think an explicit memory barrier while increasing / decreasing sc->sc_count will make the code more clear than a NULL check in the reader side (lagg_rr_start).

A memory barrier isn't sufficient. For sc_count and length(sc_ports) to be consistent with each other, you need a mutex.

The writer side is lagg_port_create() and lagg_port_destroy(). They are serialized. The reader side always checks sc_count before accessing sc_ports, so this order should be sufficient for the writer side,

diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 0333162da0d4..226565633b3a 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -879,7 +879,7 @@ lagg_port_create(struct lagg_softc *sc, struct ifnet *ifp)
                CK_SLIST_INSERT_AFTER(tlp, lp, lp_entries);
        else
                CK_SLIST_INSERT_HEAD(&sc->sc_ports, lp, lp_entries);
-       sc->sc_count++;
+       atomic_add_rel_int(&sc->sc_count, 1);
 
        lagg_setmulti(lp);
 
@@ -962,8 +962,8 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport)
        }
 
        /* Finally, remove the port from the lagg */
+       atomic_add_rel_int(&sc->sc_count, -1);
        CK_SLIST_REMOVE(&sc->sc_ports, lp, lagg_port, lp_entries);
-       sc->sc_count--;
 
        /* Update the primary interface */
        if (lp == sc->sc_primary) {

The reader side use atomic_load_acq_int() to get sc_count, then iterate sc_ports.

This is still not sufficient. The barrier just enforces an ordering of memory accesses.

Suppose a reader loads sc_count into a register, and then an interrupt arrives and the reader thread is preempted for a long time. Meanwhile, writer threads are updating the port list. When the reader resumes execution, the list might be completely different.

Ahh, I see the problem.

I think lagg(4) deserves a better design. We want a mean to get consistent view of sc_count and sc_ports. That is a long work. Just go ahead and commit this firstly.

This revision was automatically updated to reflect the committed changes.