Page MenuHomeFreeBSD

infiniband: Opt-in for net epoch
ClosedPublic

Authored by zlei on Apr 4 2023, 4:09 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jun 17, 11:45 PM
Unknown Object (File)
Mar 14 2024, 9:18 AM
Unknown Object (File)
Mar 14 2024, 9:18 AM
Unknown Object (File)
Mar 14 2024, 9:18 AM
Unknown Object (File)
Mar 14 2024, 9:17 AM
Unknown Object (File)
Mar 10 2024, 11:37 PM
Unknown Object (File)
Jan 3 2024, 3:22 PM
Unknown Object (File)
Jan 3 2024, 3:22 PM

Details

Summary

This is counterpart to e87c4940156c, which did the same for ethernet.

Reported by: hselasky
Tested by: hselasky
MFC after: 2 weeks

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

zlei requested review of this revision.Apr 4 2023, 4:09 PM

Just write:
Suggested by: hselasky@

This revision is now accepted and ready to land.Apr 4 2023, 4:23 PM

Just write:
Suggested by: hselasky@

I follow .git/hooks/prepare-commit-msg (from doc) which does not have meta Suggested by: .
Maybe the doc should be updated to reflect the reality ?

CC @carlavilla .

In D39405#897067, @zlei wrote:

Just write:
Suggested by: hselasky@

I follow .git/hooks/prepare-commit-msg (from doc) which does not have meta Suggested by: .
Maybe the doc should be updated to reflect the reality ?

CC @carlavilla .

Good question.
I'm gonna ask to Doceng@ and algo git@ team about that

kib added inline comments.
sys/net/if_infiniband.c
420

FreeBSD idiomatic way to write this would be

sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c
925

I do not understand this line. Should it be & instead of |?

sys/ofed/include/rdma/ib_verbs.h
248

Does the comment add any information that is not obvious from the cap name? And why should such trivial comment eat 4 lines of vertical space?

Just remove it.

sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_main.c
925

Yes, it should be & . I intended to check if ib_device has attribute IB_DEVICE_KNOWSEPOCH.
I must be drunk.

sys/ofed/include/rdma/ib_verbs.h
248

Reasonable.

This revision now requires review to proceed.Apr 5 2023, 2:12 AM
zlei marked 3 inline comments as done.Apr 5 2023, 2:13 AM

Did you considered how much efforts would be to convert all five or six ib drivers we have in tree, instead of adding this flag?

This revision is now accepted and ready to land.Apr 5 2023, 3:21 AM
In D39405#897353, @kib wrote:

Did you considered how much efforts would be to convert all five or six ib drivers we have in tree, instead of adding this flag?

Actually not too much.

The problem is I don't have infiniband devices and I'm not able to test (any of) them. The maintainer of qlnx driver does not response for quite a while and I hesitate to change it.

IIUC converting all ib drivers is the final goal (of widen net epoch), this (IFF_KNOWSEPOCH or IB_DEVICE_KNOWSEPOCH) is short term and will be eventually retired.
That also applies to ethernet.

So if you @kib encourage me to convert then I would like to do and prefer.

@hselasky What do you think ?

hselasky added a subscriber: np.

@zlei : It's ok if this is for mlx5ib currently. You would need to get @np into the loop to test it for chelsio.

Of course it is better to convert everything, but for now I think you can go ahead with the patch. The difference, in my opinion, is that we have several dozens of ethernet drivers, while only five or like ib drivers.

Did you considered how much efforts would be to convert all five or six ib drivers we have in tree, instead of adding this flag?

Actually not too much.

I think I was too optimistic.

Adding NET_EPOCH_ENTER() / NET_EPOCH_EXIT() around cq->comp() does not benefits from EPOCH(9) (reduce locks / refcounts in fast path).
@kib Did you mean that?

So which one should we do then?

  1. Keep current implementation. Drivers such as mlx5 is punished with extra overhead of NET_EPOCH (recursively enter net epoch).
  2. Step further, have this intermediate improvement for mlx5 committed into base, and eventually reverted when all ib drivers have been converted to employ EPOCH(9).
  3. Do not commit this but add NET_EPOCH_ENTER() / NET_EPOCH_EXIT() around cq->comp() for other ib drivers, and then remove NET_EPOCH from infiniband_input().

For the third one, ib drivers can be still improved to benefit EPOCH(9).

EPOCH() does not reduce locks contention just be fact that it is applied, so I am not sure what do you mean there. Entering epoch should fix bugs with accesses to freed resources protected by epoch guarantee.

Just commit the patch for now.

Found a possible flaw with this change while I was preparing to commit.

infiniband_input()
    if_input()
        ipoib_cm_handle_rx_wc()
        ipoib_ib_handle_rx_wc()
            ipoib_drain_cq()
                ipoib_cm_dev_stop()
                ipoib_ib_dev_stop()
                    __ipoib_ib_dev_flush()
                        ipoib_ib_dev_flush_light()  // INIT_WORK(&priv->flush_light, ipoib_ib_dev_flush_light);
                        ipoib_ib_dev_flush_normal() // INIT_WORK(&priv->flush_normal, ipoib_ib_dev_flush_normal);
                        ipoib_ib_dev_flush_heavy()  // INIT_WORK(&priv->flush_heavy, ipoib_ib_dev_flush_heavy);

If some ib driver is marked with IB_DEVICE_KNOWSEPOCH, on ipoib_ib_dev_flush_[light | normal | heavy]() the needs_epoch is false and net stack is not in net epoch, then the kernel will assert panic (with WITNESS), or there is a potential use-after-free (without WITNESS).

Maybe the right approach is moving the needs_epoch check to ipoib_cm_handle_rx_wc() and ipoib_ib_handle_rx_wc().

Maybe the right approach is moving the needs_epoch check to ipoib_cm_handle_rx_wc() and ipoib_ib_handle_rx_wc().

No. Just leave it like is for now. I think @glebius is the owner of that technique.

There is some magic cruft up in the IRQ handler too. The mlx5 core may process multiple events per completion, so it won't be per-packet. It's pretty clever that way.

This revision was automatically updated to reflect the committed changes.

Maybe the right approach is moving the needs_epoch check to ipoib_cm_handle_rx_wc() and ipoib_ib_handle_rx_wc().

No. Just leave it like is for now. I think @glebius is the owner of that technique.

There is some magic cruft up in the IRQ handler too. The mlx5 core may process multiple events per completion, so it won't be per-packet. It's pretty clever that way.

OK, committed.