Remove use of 9k clusters from mlx4_en rx path
Needs ReviewPublic
Actions

Authored by rstone on Jul 11 2017, 10:59 PM.

Details

Reviewers

Summary

FreeBSD revision r291699 introduced the use of busdma to mlx4,
and as a simplification removed the use of packet "fragments" in
the receive path. Instead, the driver attempts to allocate 9KB
mbuf clusters. However, under memory pressure allocating more
than 1 page's worth of contiguous physical memory can be extremely
CPU intensive, and in certain cases impossible. This means that
mlx4_en is prone to starving other threads of CPU time, and at
times network connectivity is lost entirely.

Fix this by restoring the use of packet fragments to mlx4_en.
This ensures that the driver never tries to allocate more than
1 page at a time, avoiding the pathological VM behaviour.

Also remove the allocation of a "spare" mbuf from the rx path.
Reconciling this feature with packet fragments was difficult, its
existance complicated the receive path, and it provides no benefit
outside of highly contrived scenarios.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 10436
Build 10846: arc lint + arc unit

Event Timeline

rstone created this revision.Jul 11 2017, 10:59 PM

Harbormaster completed remote builds in B10436: Diff 30657.Jul 11 2017, 10:59 PM

mjoras added a subscriber: mjoras.Jul 11 2017, 11:27 PM

Hi,

Don't pull the "old behaviour" into the driver. You should use a linked list of mbufs and load that instead of the mlx4_en_frag_info structure now we are using busdma!

There is a better way to implement this. Can Mellanox take over this issue?

--HPS

In D11560#239289, @hselasky wrote:

Hi,

Don't pull the "old behaviour" into the driver. You should use a linked list of mbufs and load that instead of the mlx4_en_frag_info structure now we are using busdma!

There is a better way to implement this. Can Mellanox take over this issue?

--HPS

Hm. The issue with this approach is that (if a 9K MTU is in use), you will always send a 9K mbuf chain up the stack, even if it only contains a 64-byte TCP ACK. This is more inefficient both memory-wise and means that the driver will make a lot more calls into UMA to allocate mbufs and clusters (and UMA isn't very cheap)

Hi @rstone ,

The other approach is not about sending 9K mbufs up the stack.

I think that this patch puts a penalty on the single fragment case and is a bit intrusive.

Using less than 9K mbufs to receive 9K mbufs is counter productive and is a special case which should be #if'ed. You could just aswell use 2K mbufs and 1500 bytes MTU all the way!

To avoid fragmentation issues use 16K PAGE_SIZE.

I'll upload another patch later tonight so that you can see the other approach to this issue.

--HPS

kib added a subscriber: kib.Sep 14 2017, 4:14 PM

Is my reading of the patch correct that for the large MTUs, unused fragments and mbufs get reused without free/new alloc, in your patch ?

sys/dev/mlx4/mlx4_en/mlx4_en_rx.c
595	Does this mean that we claim that the reception of the packet failed if we were unable to allocate the replacement mbuf or page for the used fragment ? If yes. should we split that ?
627	I would say that after such refactoring, the function is no longer useful. You can inline its call into process_rx_cq().

kib added a subscriber: menyy_mellanox.com.May 16 2018, 9:15 AM

@kib: Is it possible to follow the approach of mlx5en(4) ? When using the same methodology around 9k mbufs, the code is easier to understand.

I think @rstone 's patch is simply bringing back old code sort of.

In D11560#326053, @hselasky wrote:

@kib: Is it possible to follow the approach of mlx5en(4) ? When using the same methodology around 9k mbufs, the code is easier to understand.

I think @rstone 's patch is simply bringing back old code sort of.

Well, from my reading, your patch and this patch are quite similar, but this review provides two interesting optimizations.

Use page-size fragments if MTU >= PAGE_SIZE, already discussed on lists for mlx5.
Keep around mbufs and memory for unused fragments, if the received packed was shorter than MTU.

If these changes are transplanted into your patch, I believe the things start look almost identical. I want to confirm my understanding with Ryan.

@kib: Leaving the RX-mbuf dangling might require more busdma tags. I think in my version there is only one busdma tag, and the mbufs are always chained using m_nextpkt .

In D11560#326055, @hselasky wrote:

@kib: Leaving the RX-mbuf dangling might require more busdma tags. I think in my version there is only one busdma tag, and the mbufs are always chained using m_nextpkt .

dma tags ? What do you mean ?

You need to have the memory for mbufs on rings loaded into busdma map anyway. The optimization there is to avoid unload/free/alloc/load when the buffer was kept unused (at least this is my current understanding).

My point is when you organize the mbufs in a linked list, you get away with one DMA map and tag. Else you will need one BUSDMA map for each mbuf.

This patch uses one DMA map per frag. Mine only needs one DMA map per series of mbuf frags.

In D11560#326054, @kib wrote:

Well, from my reading, your patch and this patch are quite similar, but this review provides two interesting optimizations.

Use page-size fragments if MTU >= PAGE_SIZE, already discussed on lists for mlx5.

Keep around mbufs and memory for unused fragments, if the received packed was shorter than MTU.

If these changes are transplanted into your patch, I believe the things start look almost identical. I want to confirm my understanding with Ryan.

Those are the two relevant optimizations in this code. However, I believe that the second isn't very important anymore. If I recall correctly, mlx4 now has an optimization to copy very small packets that fit within a single (clusterless) mbuf into a new mbuf and leave the allocated mbuf chain on the ring. In most network traffic, the overwhelming majority of packets are either MTU-sized or ACKs with no payload. Therefore, copying very small packets into an mbuf effectively subsumes the second optimization, so implementing it is probably not worthwhile.

The first optimization, to use PAGE_SIZE clusters for MTU >= PAGE_SIZE, should be fairly trivial to implement without bringing in this whole patch.

In D11560#326139, @rstone wrote:

If I recall correctly, mlx4 now has an optimization to copy very small packets that fit within a single (clusterless) mbuf into a new mbuf and leave the allocated mbuf chain on the ring. In most network traffic, the overwhelming majority of packets are either MTU-sized or ACKs with no payload. Therefore, copying very small packets into an mbuf effectively subsumes the second optimization, so implementing it is probably not worthwhile.

I do not see such code in mlx4_en_rx.c. Could you point it out, please ?

In D11560#326314, @kib wrote:

I do not see such code in mlx4_en_rx.c. Could you point it out, please ?

Ah, that's my mistake. hps sent me a patch privately about a year ago. I can forward it to you if you like.

In D11560#326362, @rstone wrote:

In D11560#326314, @kib wrote:

I do not see such code in mlx4_en_rx.c. Could you point it out, please ?

Ah, that's my mistake. hps sent me a patch privately about a year ago. I can forward it to you if you like.

No need, we will discuss it internally. Thanks.

Ping - is this patch still relevant?

Adding Slava to this review aswell.

Revision Contents
Changeset List

Path

Size

sys/

dev/

mlx4/

mlx4_en/

en.h

27 lines

mlx4_en_rx.c

310 lines

Diff 30657

View Options

sys/dev/mlx4/mlx4_en/en.h