Page MenuHomeFreeBSD

infiniband: Widen NET_EPOCH coverage
ClosedPublic

Authored by zlei on Mar 26 2023, 11:33 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 8, 3:59 PM
Unknown Object (File)
Dec 16 2024, 7:08 PM
Unknown Object (File)
Nov 14 2024, 5:55 PM
Unknown Object (File)
Nov 13 2024, 8:23 PM
Unknown Object (File)
Nov 12 2024, 3:14 PM
Unknown Object (File)
Nov 6 2024, 1:20 PM
Unknown Object (File)
Oct 6 2024, 1:52 AM
Unknown Object (File)
Oct 3 2024, 6:22 AM

Details

Summary

From static code analysis, some device drivers (cxgbe, mthca, mlx4 and qlnx) do not enter net epoch before lagg_input_infiniband(). If IPoIB interface is a member of lagg(4), after return from lagg_input_infiniban() the receiving interface of mbuf is set to lagg(4) interface. When concurrently destroying the interface, there's a small window that the interface get destroyed and become invalid before re-enter net epoch, thus leading use-after-free.

MFC after: 1 week

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

zlei requested review of this revision.Mar 26 2023, 11:33 AM
static void
ether_input(struct ifnet *ifp, struct mbuf *m)
{
        struct epoch_tracker et;
        struct mbuf *mn;
        bool needs_epoch;

        needs_epoch = !(ifp->if_flags & IFF_KNOWSEPOCH);

This code should basically be similar to if_ethersubr.c, but I think IFF_KNOWSEPOCH is just for ethernet ? @jhb Should IFF_KNOWSEPOCH be marked as for ethernet protocols only? @glebius

--HPS

Changes look good basically!

This revision is now accepted and ready to land.Mar 26 2023, 5:52 PM

This code should basically be similar to if_ethersubr.c, but I think IFF_KNOWSEPOCH is just for ethernet ? @jhb Should IFF_KNOWSEPOCH be marked as for ethernet protocols only? @glebius

--HPS

I dont think IFF_KNOWSEPOCH should be marked as for ethernet protocols only, but for infiniband, it seems there's only one consumer IPoIB which is not marked with IFF_KNOWSEPOCH .

Yes, then the EPOCH enter in infiniband_input() should be conditional too. Could you do some code tracing, see what functions are calling into the functions you've patched, and why EPOCH is not entered beforehand, when IFF KNOWS EPOCH is set!

Is this something you've observed on a particular hardware driver?

but for infiniband, it seems there's only one consumer IPoIB which is not marked with IFF_KNOWSEPOCH .

Sorry that's not correct. lagg(4) also consumes infiniband.

Yes, then the EPOCH enter in infiniband_input() should be conditional too.

Reasonable, at least for lagg(4) ;) I think lagg(4) can be safely marked with IFF_KNOWSEPOCH.

Could you do some code tracing, see what functions are calling into the functions you've patched, and why EPOCH is not entered beforehand, when IFF KNOWS EPOCH is set!

Search entry point on current/14 or stable/13

# egrep -r 'infiniband_ifattach` sys
# ignore the definition
sys/net/if_lagg.c:		infiniband_ifattach(ifp, eaddr, sc->sc_bcast_addr);
sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c:	infiniband_ifattach(priv->dev, NULL, priv->broadcastaddr);

A reverse call tree (should be graph but hard to write with plain text)

infiniband_input(dev, mb)
    if_input(dev, mb)
        ipoib_cm_handle_rx_wc()
        ipoib_ib_handle_rx_wc()
            ipoib_poll()
                ipoib_ib_completion()
                    ib_create_cq(priv->ca, ipoib_ib_completion, NULL, priv, &cq_attr) ## device->create_cq(cq, cq_attr, NULL);
                        device->create_cq(cq, cq_attr, NULL);
                        c4iw_create_cq
                        qlnxr_create_cq
                        irdma_create_cq
            ipoib_drain_cq()
                ipoib_cm_dev_stop()
                ipoib_ib_dev_stop()
                    __ipoib_ib_dev_flush()
                        ipoib_ib_dev_flush_light()
                            INIT_WORK(&priv->flush_light,   ipoib_ib_dev_flush_light);
                        ipoib_ib_dev_flush_normal()
                            INIT_WORK(&priv->flush_normal,   ipoib_ib_dev_flush_normal);
                        ipoib_ib_dev_flush_heavy()
                            INIT_WORK(&priv->flush_heavy,   ipoib_ib_dev_flush_heavy);

Then check NET_EPOCH_ENTER around the receive path

if_input(dev, mb)
        ipoib_cm_handle_rx_wc()
        ipoib_ib_handle_rx_wc()
            ipoib_poll()
                ipoib_ib_completion()

Also check possible tasks with flag TASK_NETWORK (so that they will run under net epoch)

# egrep -r 'NET_TASK_INIT' sys/ofed sys/dev/cxgbe sys/dev/qlnx sys/dev/irdma
# egrep -r 'NET_GROUPTASK_INIT' sys/ofed sys/dev/cxgbe sys/dev/qlnx sys/dev/irdma

Is this something you've observed on a particular hardware driver?

No, I do not have IB hardware. I found this when doing https://reviews.freebsd.org/D39267 .

Yes, you are using the right tools for this task, tracing functions and such. That's good to see.

When this stuff is upstreamed to 14-current / main, send me an e-mail and/or CC some FreeBSD network mailing list, and I'll test it on a lab machine using mlx5ib compatible network cards, having WITNESS enabled.

I'll let it up to you to decide if IFF_KNOWSEPOCH should be respected on the infiniband side (IPoIB typically).

Recursive NET_EPOCH_ENTER() should be fine too.

IPoIB is not that critical in terms of performance. From what I know it is mostly used for maintainance.

--HPS

Yes, you are using the right tools for this task, tracing functions and such. That's good to see.

When this stuff is upstreamed to 14-current / main, send me an e-mail and/or CC some FreeBSD network mailing list, and I'll test it on a lab machine using mlx5ib compatible network cards, having WITNESS enabled.

Well I'm confident about the analysis, but to be sure, it is best to test first.
You can add an extra assert to verify IPoIB is actually not protected by net epoch.

static void
infiniband_input(struct ifnet *ifp, struct mbuf *m)
{
	struct infiniband_header *ibh;
	struct epoch_tracker et;
	int isr;

	/* XXX for verify only */
	NET_EPOCH_ASSERT();
	CURVNET_SET_QUIET(ifp->if_vnet);
...
}

@hselasky
Could you please verify it ?

I'll let it up to you to decide if IFF_KNOWSEPOCH should be respected on the infiniband side (IPoIB typically).

Recursive NET_EPOCH_ENTER() should be fine too.

My head is messed up with those if_input functions. I'll check that later ( also may be better to commit separately ).

IPoIB is not that critical in terms of performance. From what I know it is mostly used for maintainance.

This fix is focusing on correctness but not the performance. Anyway it have little side effect about the performance (expected to reduce a little cpu cycles)

--HPS

Could you please verify it ?

Yes! Can you send me a complete list of patches on top of 14-current / main as of today? And I'll give it a spin!

Sorry the device->create_cq(cq, cq_attr, NULL); analysis is not complete.

# egrep -r 'create_cq\s*=\s*\S+' sys/dev
sys/dev/cxgbe/iw_cxgbe/provider.c:	ibdev->create_cq = c4iw_create_cq;
sys/dev/mlx4/mlx4_ib/mlx4_ib_main.c:	ibdev->ib_dev.create_cq		= mlx4_ib_create_cq;
sys/dev/mlx5/mlx5_ib/mlx5_ib_main.c:	dev->ib_dev.create_cq		= mlx5_ib_create_cq;
sys/dev/qlnx/qlnxr/qlnxr_os.c:        ibdev->create_cq = qlnxr_create_cq;
sys/dev/mthca/mthca_provider.c:	dev->ib_dev.create_cq            = mthca_create_cq;
sys/dev/irdma/irdma_verbs.c:	dev_ops->create_cq = irdma_create_cq;

So we want to fully check those:

c4iw_create_cq()
mlx4_ib_create_cq()
qlnxr_create_cq()
mthca_create_cq()
irdma_create_cq()

As for mlx5_ib_create_cq(), mlx5 has been tested and verified by @hselasky. And the source code shows clearly that mlx5_cq_completion() has net epoch around ipoib_ib_completion().

Search all possible handler by

# egrep -r 'comp_handler\)?\s*\(' sys/dev

And generate a more complete reverse call tree:

infiniband_input(dev, mb)
    if_input(dev, mb)
        ipoib_cm_handle_rx_wc()
        ipoib_ib_handle_rx_wc()
            ipoib_poll()
                ipoib_ib_completion()
                    // idrma
                    irdma_comp_handler()
                        irdma_generate_flush_completions()
                            irdma_flush_worker()         // INIT_DELAYED_WORK
                    // mthca
                    mthca_cq_completion()
                        mthca_eq_int()
                            mthca_tavor_interrupt()       // registered as interrupt
                            mthca_tavor_msi_x_interrupt() // registered as interrupt
                            mthca_arbel_interrupt()       // registered as interrupt
                            mthca_arbel_msi_x_interrupt() // registered as interrupt
                    // qlnx
                    qlnxr_ll2_complete_tx_packet() // registered as interrupt
                    qlnxr_ll2_complete_rx_packet() // registered as interrupt
                    qlnxr_intr()                   // registered as interrupt
                    // mlx5
                    notify_soft_wc_handler()
                    mlx5_ib_cq_comp()
                        mlx5_cq_completion()  // has NET_EPOCH_ENTER / NET_EPOCH_EXIT
                            mlx5_eq_int()
                                mlx5_msix_handler() // registered as interrupt
                                mlx5_poll_interrupts()
                                    mlx5_en_debugnet_poll()
                    // mlx4
                    mlx4_ib_cq_comp()
                        mlx4_cq_completion()
                            mlx4_eq_int()
                                mlx4_interrupt()
                                    mlx4_poll_interrupts() // registered as interrupt
                                        mlx4_en_debugnet_poll()
                                mlx4_msi_x_interrupt()     // registered as interrupt
                    // cxgbe
                    post_qp_event()
                        c4iw_ev_dispatch()
                            process_err_cqes()
                                process_req() // INIT_WORK
                    c4iw_ev_handler()
                        service_iq()
                            t4_intr_evt()     // registered as interrupt
                                t4_intr_all() // registered as interrupt
                        service_iq_fl()
                            t4_intr()         // registered as interrupt
                    complete_sq_drain_wr()
                        c4iw_post_send()      // ibdev->post_send = c4iw_post_send
                    complete_rq_drain_wr()
                        c4iw_post_receive()   // ibdev->post_recv = c4iw_post_receive
                    __flush_qp()
                    flush_qp()
                        c4iw_modify_qp()
                            c4iw_ib_modify_qp() // ibdev->modify_qp = c4iw_ib_modify_qp
                            c4iw_destroy_qp()   // ibdev->destroy_qp = c4iw_destroy_qp 

            //
            ipoib_drain_cq()
                ipoib_cm_dev_stop()
                ipoib_ib_dev_stop()
                    __ipoib_ib_dev_flush()
                        ipoib_ib_dev_flush_light()
                            INIT_WORK(&priv->flush_light,   ipoib_ib_dev_flush_light);
                        ipoib_ib_dev_flush_normal()
                            INIT_WORK(&priv->flush_normal,   ipoib_ib_dev_flush_normal);
                        ipoib_ib_dev_flush_heavy()
                            INIT_WORK(&priv->flush_heavy,   ipoib_ib_dev_flush_heavy);

It seems only mlx5 has net epoch around ipoib_ib_completion(). @hselasky You are lucky ;)

Other devices mthca qlnx mlx4 and cxgbe do not (from the analysis source code).

I think it is safe to commit this. Even without testing on those devices, cxgbe, mthca, mlx4 and qlnx .
At worst case the analysis that the potential use-after-free is wrong but nothing bad will happen.

I'll commit this if there's no objection from network .

Thanks @hselasky for testing with mlx5 devices.

This revision was automatically updated to reflect the committed changes.

The static analysis missed 511d1afb6bfe (Enter the network epoch for interrupt handlers of INTR_TYPE_NET), but the change is still good.