Page MenuHomeFreeBSD

Iflib changes to handle multiple netmap fragments
ClosedPublic

Authored by rajesh1.kumar_amd.com on Dec 28 2020, 12:21 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 5:21 PM
Unknown Object (File)
Tue, Jan 14, 11:29 PM
Unknown Object (File)
Tue, Jan 14, 11:00 PM
Unknown Object (File)
Tue, Jan 14, 11:19 AM
Unknown Object (File)
Sat, Jan 4, 11:03 AM
Unknown Object (File)
Dec 10 2024, 12:42 AM
Unknown Object (File)
Nov 30 2024, 1:55 AM
Unknown Object (File)
Nov 20 2024, 6:21 PM

Details

Summary

The current netmap implementation in Iflib seems to have limitation with respect
to handling a packet split across multiple buffers.

This scenario is common when the hardware supports split header feature, where
the header and data will be in seperate buffers.

This patch has the changes to netmap path in Iflib to handle packets with
multiple fragments.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 35835
Build 32724: arc lint + arc unit

Event Timeline

vmaffione added inline comments.
sys/net/iflib.c
992

Maybe pkt_start could be called nic_i_start for better consistency?
Similarly, maybe seg_idx is a better name for pkt_slot, since the term "slot" in this scope is also used for netmap slots (which follow an independent numbering)?

1204–1205

Maybe a for (i=0; i < ri.iri_nfrags; i++) is more readable?

1209

We can drop this comment.

1214

We should set this to NS_MOREFRAG rather than ORing its previous content, because we are initializing this field.

2651–2652

This change looks unrelated to the NAF_MOREFRAG, and should be evaluated with a different patch.

tools/tools/netmap/bridge.c
39 ↗(On Diff #81263)

This function was separated in two versions rx_slots_avail and tx_slots_avail for better readability (and to avoid the conditional branch). Please do not revert it.

101 ↗(On Diff #81263)

NS_MOREFRAG support for bridge should go in a separate patch.

This revision now requires changes to proceed.Dec 28 2020, 9:34 PM
rajesh1.kumar_amd.com marked 7 inline comments as done.

V2 patch with previous review comments addressed.

Thanks.
I tested it on my side on em(3) and vmx(3) in a VM to look for regressions.
What kind of tests have you done (I guess you are using axgbe)?
Have you tried the pkt-gen -F and -M options on the transmitter side (pkt-gen supports multifrag), and see if a pkt-gen receiver receives the correct traffic?
Do you want me to commit this change?

sys/net/iflib.c
2651

This still needs to be removed.

Hi @vmaffione

Thanks for your time and comments.

I tested it on my side on em(3) and vmx(3) in a VM to look for regressions.
What kind of tests have you done (I guess you are using axgbe)?

This patch has been tested with Netmap bridge utility and axgbe driver. With Netmap briding, we see the basic ping failing when we have split header feature enabled in driver (where the packet is split into header and data fragments). On debugging, I observed the bridge utility and iflib code doesn't seem to support multiple fragments. So, written this patch.

Have you tried the pkt-gen -F and -M options on the transmitter side (pkt-gen supports multifrag), and see if a pkt-gen receiver receives the correct traffic?

I have tested pkt-gen utility sometime earlier, but not with this patch. I have tried tx and rx with source and destinations MAC's provided. I haven't tried with -F and -M option. I just did a quick run of simple Tx and Rx with pkt-gen and Tx seems to be good, but seeing kernel panic on Rx. Is that the same behavior you observe? If not, can you please give details of the issue you observe. Also, is there any examples of using the -F and -M options?

In relation to this discussion, axgbe driver has been initially committed with 2 Freelists per receive queue for supporting split header. But Netmap stack, doesn't seem to support 2 Freelists per receive queue. So, we made changes to our axgbe driver to use 1 Freelist (rather than two) when trying Netmap on top of it (Driver changes are submitted in seperate patch - D27797). We need to discuss this topic with community. But do you have any idea whether Netmap stack has future plans related to this?

Do you want me to commit this change?

I will do more testing with pkt-gen as well, see what going wrong there in the Rx path and give an updated patch.

sys/net/iflib.c
1209

Does any other drivers have impact with respect to the change corresponding to CRC len?

2651

Looks like my previous comment is not posted.

I will remove this change from this patch.

2651–2652

This change has been submitted along with another review D27797. But, since this looks generic, I made this changes part of this review.

If it's insisted to be removed from this patch to avoid confusion. I shall remove it.

tools/tools/netmap/bridge.c
39 ↗(On Diff #81263)

Reverting this change and submitted the bridge changes for NS_MOREFRAG as seperate patch.

101 ↗(On Diff #81263)

Submitted the bridge changes for NS_MOREFRAG as seperate patch D27877

Il giorno mar 5 gen 2021 alle ore 07:55 rajesh1.kumar_amd.com (Rajesh
Kumar) <phabric-noreply@freebsd.org> ha scritto:

rajesh1.kumar_amd.com marked an inline comment as done.
rajesh1.kumar_amd.com added a comment.

Hi @vmaffione

Thanks for your time and comments.

> I tested it on my side on em(3) and vmx(3) in a VM to look for

regressions.

> What kind of tests have you done (I guess you are using axgbe)?

This patch has been tested with Netmap bridge utility and axgbe driver.

With Netmap briding, we see the basic ping failing when we have split
header feature enabled in driver (where the packet is split into header and
data fragments). On debugging, I observed the bridge utility and iflib
code doesn't seem to support multiple fragments. So, written this patch.

> Have you tried the `pkt-gen` `-F` and `-M` options on the transmitter

side (pkt-gen supports multifrag), and see if a pkt-gen receiver receives
the correct traffic?

I have tested pkt-gen utility sometime earlier, but not with this

patch. I have tried tx and rx with source and destinations MAC's
provided. I haven't tried with -F and -M option. I just did a quick run
of simple Tx and Rx with pkt-gen and Tx seems to be good, but seeing kernel
panic on Rx. Is that the same behavior you observe? If not, can you please
give details of the issue you observe. Also, is there any examples of
using the -F and -M options?

The -F and -M options are described by the pkt-gen help (pkt-gen -h), and
they only affect transmission (-f tx):

-F num_frags
        Send multi-slot packets, each one with num_frags fragments.  A

multi-slot packet is repre-

sented by two or more consecutive netmap slots with the

NS_MOREFRAG flag set (except for the

last slot).  This is useful to transmit or receive packets

larger than the netmap buffer

        size.

-M frag_size
        In multi-slot mode, frag_size specifies the size of each

fragment, if smaller than the packet

length divided by num_frags.

You could then try something like:

pkt-gen -i ax0 -f tx -F 3

or

pkt-gen -i ax0 -f tx -F 2 -M 100

and check that the receiver indeed receives what you expect to receive.
Some useful pkt-gen options are -R (to rate limit the transmitter), and `-f
txseq` (on the sender) and -f rxseq (on the receiver) to let the
sender/receiver use sequence numbers and check if you lose packets.

I did not observe issues. The panic on RX is something that we should look
at. Is that a regression introduced by this patch? What steps to reproduce?

In relation to this discussion, axgbe driver has been initially

committed with 2 Freelists per receive queue for supporting split header.
But Netmap stack, doesn't seem to support 2 Freelists per receive queue.
So, we made changes to our axgbe driver to use 1 Freelist (rather than two)
when trying Netmap on top of it (Driver changes are submitted in seperate
patch - D27797 https://reviews.freebsd.org/D27797). We need to discuss
this topic with community. But do you have any idea whether Netmap stack
has future plans related to this?

Netmap has no idea about the "freelists" concept. Netmap uses a simple RX
ring (circular buffer of descriptors), exposed to the userspace
application, and uses the driver-specific RXSYNC operation to align that
ring (the "netmap ring") to the real hardware ring. Also, all the netmap
buffers in the RX ring must have the same size.
The "freelist" concept is something introduced by iflib to match
functionality that is present in some newer drivers/NICs, where each
hardware "RX queue" is typically implemented by a completion queue (where
one would extract received packets) plus one or more freelists (where one
would submit buffers of different lengths to be used to receive the
upcoming packets). AFAIK, having more than one freelist is an optimization
useful to save memory and improve buffer allocation, and it implies that
each freelist is specialized on buffers of different length (e.g. freelist
#0 holds 64 bytes buffers to be used for packet headers, whereas freelist
#1 holds 2048 bytes buffer to be used for packet payload, and maybe
freelist #2 holds 256 bytes packets for short payload packets).
However, netmap buffers have all the same length by design, and therefore a
single free list is enough. And indeed this is what netmap support in iflib
does: only the first free list is used.

I do not see a way for netmap to be able to use multiple free lists. And
this is in the very end a consequence of its keep-it-simple design that had
the goal of maximize packet-rate throughput even with short packets.
However, note that also the if_vmx driver uses 2 free lists, and netmap
works on that (using only the first free list).
Maybe the same can be done for axgbe?

> Do you want me to commit this change?

I will do more testing with pkt-gen as well, see what going wrong there

in the Rx path and give an updated patch.

INLINE COMMENTS

vmaffione wrote in iflib.c:2651
This still needs to be removed.

Looks like my previous comment is not posted.

I will remove this change from this patch.

vmaffione wrote in iflib.c:1186
We can drop this comment.

Does any other drivers have impact with respect to the change
corresponding to CRC len?

The "iflib_crcstrip" is an old sysctl knob that is useful for benchmarks. I
may remove it soon.
In any case, "crclen" is by default always 0.

vmaffione wrote in iflib.c:2652
This change looks unrelated to the NAF_MOREFRAG, and should be evaluated

with a different patch.

This change has been submitted along with another review D27797 <
https://reviews.freebsd.org/D27797>. But, since this looks generic, I
made this changes part of this review.

If it's insisted to be removed from this patch to avoid confusion. I shall
remove it.

Yes, please remove it as it is unrelated.

vmaffione wrote in bridge.c:39
This function was separated in two versions rx_slots_avail and

tx_slots_avail for better readability (and to avoid the conditional
branch). Please do not revert it.

Reverting this change and submitted the bridge changes for NS_MOREFRAG as
seperate patch.

vmaffione wrote in bridge.c:101
NS_MOREFRAG support for bridge should go in a separate patch.

Submitted the bridge changes for NS_MOREFRAG as seperate patch D27877 <
https://reviews.freebsd.org/D27877>

REPOSITORY

rS FreeBSD src repository - subversion

CHANGES SINCE LAST ACTION

https://reviews.freebsd.org/D27799/new/

REVISION DETAIL

https://reviews.freebsd.org/D27799

EMAIL PREFERENCES

https://reviews.freebsd.org/settings/panel/emailpreferences/

To: rajesh1.kumar_amd.com, shurd, iflib, vmaffione, mmacy, manu
Cc: stephan.dewt_yahoo.co.uk, imp, ae, melifaro,
Contributor Reviews (src), krzysztof.galazka_intel.com

You could then try something like:

pkt-gen -i ax0 -f tx -F 3

or

pkt-gen -i ax0 -f tx -F 2 -M 100

and check that the receiver indeed receives what you expect to receive.
Some useful pkt-gen options are -R (to rate limit the transmitter), and `-f
txseq` (on the sender) and -f rxseq (on the receiver) to let the
sender/receiver use sequence numbers and check if you lose packets.

Thanks for the examples here. I will do some tests with these flags and let you know how it goes.

I did not observe issues. The panic on RX is something that we should look
at. Is that a regression introduced by this patch? What steps to reproduce?

Axgbe side :

pkt-gen -i ax0 -f rx -S <MAC of ax0 interface>

Peer side :

pkt-gen -i <iface name> -f tx -S <MAC of iface> -D <MAC of ax0 interface on axgbe side>

Running this panics. If I swap the tx and rx sides (where axgbe will be tx side), I am not seeing any problem.

Netmap has no idea about the "freelists" concept. Netmap uses a simple RX
ring (circular buffer of descriptors), exposed to the userspace
application, and uses the driver-specific RXSYNC operation to align that
ring (the "netmap ring") to the real hardware ring. Also, all the netmap
buffers in the RX ring must have the same size.
The "freelist" concept is something introduced by iflib to match
functionality that is present in some newer drivers/NICs, where each
hardware "RX queue" is typically implemented by a completion queue (where
one would extract received packets) plus one or more freelists (where one
would submit buffers of different lengths to be used to receive the
upcoming packets). AFAIK, having more than one freelist is an optimization
useful to save memory and improve buffer allocation, and it implies that
each freelist is specialized on buffers of different length (e.g. freelist
#0 holds 64 bytes buffers to be used for packet headers, whereas freelist
#1 holds 2048 bytes buffer to be used for packet payload, and maybe
freelist #2 holds 256 bytes packets for short payload packets).
However, netmap buffers have all the same length by design, and therefore a
single free list is enough. And indeed this is what netmap support in iflib
does: only the first free list is used.

I do not see a way for netmap to be able to use multiple free lists. And
this is in the very end a consequence of its keep-it-simple design that had
the goal of maximize packet-rate throughput even with short packets.
However, note that also the if_vmx driver uses 2 free lists, and netmap
works on that (using only the first free list).
Maybe the same can be done for axgbe?

Thanks for the detailed explanation. Actually, as you stated here, split header is the reason why we went for 2 Freelists approach earlier. But since netmap just use freelist 0, we now made changes (D27797) to the driver to accomodate netmap (without disturbing 2 freelist approach on non netmap case).

I just had a quick glance at the if_vmx driver, it seems that driver uses two rx queues(with 1 freelist each) per rx queue set. I am not sure whether if_vmx driver has this design for split header (or similar feature). But I remember, we faced some issues when using two rx queues per queue set, that's why we stick to one 1 rxq with 2 freelist approach. Anyway, I will study and test making similar changes to axgbe driver and see how it goes. Please let me know if you have any comments.

Il giorno mer 6 gen 2021 alle ore 20:16 rajesh1.kumar_amd.com (Rajesh
Kumar) <phabric-noreply@freebsd.org> ha scritto:

rajesh1.kumar_amd.com marked an inline comment as done.
rajesh1.kumar_amd.com added a comment.

> You could then try something like:
>
> pkt-gen -i ax0 -f tx -F 3
> =========================
>
> or
>
> pkt-gen -i ax0 -f tx -F 2 -M 100
> ================================
>
> and check that the receiver indeed receives what you expect to receive.
> Some useful pkt-gen options are -R (to rate limit the transmitter),

and `-f

> txseq` (on the sender) and `-f rxseq` (on the receiver) to let the
> sender/receiver use sequence numbers and check if you lose packets.

Thanks for the examples here. I will do some tests with these flags and

let you know how it goes.

> I did not observe issues. The panic on RX is something that we should

look

> at. Is that a regression introduced by this patch?  What steps to

reproduce?

Axgbe side :

pkt-gen -i ax0 -f rx -S <MAC of ax0 interface>

Peer side :

pkt-gen -i <iface name> -f tx -S <MAC of iface> -D <MAC of ax0 interface

on axgbe side>

Running this panics.  If I swap the tx and rx sides (where axgbe will be

tx side), I am not seeing any problem.

Ok, so the axgbe netmap receive datapath panics. Does this happen with or
without D27797?
If the panic is not introduced by your changes (e.g. if it is in HEAD) you
should submit a bug to the bugzilla.

> Netmap has no idea about the "freelists" concept. Netmap uses a simple

RX

> ring (circular buffer of descriptors), exposed to the userspace
> application, and uses the driver-specific RXSYNC operation to align

that

> ring (the "netmap ring") to the real hardware ring. Also, all the

netmap

> buffers in the RX ring must have the same size.
> The "freelist" concept is something introduced by iflib to match
> functionality that is present in some newer drivers/NICs, where each
> hardware "RX queue" is typically implemented by a completion queue

(where

> one would extract received packets) plus one or more freelists (where

one

> would submit buffers of different lengths to be used to receive the
> upcoming packets). AFAIK, having more than one freelist is an

optimization

> useful to save memory and improve buffer allocation, and it implies

that

> each freelist is specialized on buffers of different length (e.g.

freelist

> #0 holds 64 bytes buffers to be used for packet headers, whereas

freelist

> #1 holds 2048 bytes buffer to be used for packet payload, and maybe
> freelist #2 holds 256 bytes packets for short payload packets).
> However, netmap buffers have all the same length by design, and

therefore a

> single free list is enough. And indeed this is what netmap support in

iflib

> does: only the first free list is used.
>
> I do not see a way for netmap to be able to use multiple free lists.

And

> this is in the very end a consequence of its keep-it-simple design

that had

> the goal of maximize packet-rate throughput even with short packets.
> However, note that also the if_vmx driver uses 2 free lists, and netmap
> works on that (using only the first free list).
> Maybe the same can be done for axgbe?

Thanks for the detailed explanation. Actually, as you stated here, split

header is the reason why we went for 2 Freelists approach earlier. But
since netmap just use freelist 0, we now made changes (D27797 <
https://reviews.freebsd.org/D27797>) to the driver to accomodate netmap
(without disturbing 2 freelist approach on non netmap case).

It looks reasonable.

I just had a quick glance at the if_vmx driver, it seems that driver

uses two rx queues(with 1 freelist each) per rx queue set. I am not sure
whether if_vmx driver has this design for split header (or similar
feature). But I remember, we faced some issues when using two rx queues per
queue set, that's why we stick to one 1 rxq with 2 freelist approach.
Anyway, I will study and test making similar changes to axgbe driver and
see how it goes. Please let me know if you have any comments.

If you look at vmxnet3_isc_rxd_refill() it looks like freelist #0 (and rxq
#0) is for BTYPE_HEAD descriptors, whereas freelist #1 (and rxq#1) is for
BTYPE_BODY, for each qset. This seems to point at a sort of split
descriptor strategy. However I don't have the vmxnet3 specifications, so I
don't know for sure.

To be honest I'm quite a bit confused about what combinations of isc_nfl
and isc_nrxqs are valid, and what their meaning is in general, except for
the simple case where isc_nfl == isc_nrxqs == 1. It looks like drivers
where isc_nrxqs > 1 have one completion ring plus 1 or 2 command rings..
But this is something that should depend on the NIC specifications. So I
don't understand how you can "decide" to go for isc_nfl == 2 && isc_nrxqs

1 rather than isc_nfl == 2 && isc_nrxqs == 2. If you have a better

understanding than mine please do let me know.

REPOSITORY

rS FreeBSD src repository - subversion

CHANGES SINCE LAST ACTION

https://reviews.freebsd.org/D27799/new/

REVISION DETAIL

https://reviews.freebsd.org/D27799

EMAIL PREFERENCES

https://reviews.freebsd.org/settings/panel/emailpreferences/

To: rajesh1.kumar_amd.com, shurd, iflib, vmaffione, mmacy, manu
Cc: stephan.dewt_yahoo.co.uk, imp, ae, melifaro,
Contributor Reviews (src), krzysztof.galazka_intel.com

If you look at vmxnet3_isc_rxd_refill() it looks like freelist #0 (and rxq
#0) is for BTYPE_HEAD descriptors, whereas freelist #1 (and rxq#1) is for
BTYPE_BODY, for each qset. This seems to point at a sort of split
descriptor strategy. However I don't have the vmxnet3 specifications, so I
don't know for sure.

To be honest I'm quite a bit confused about what combinations of isc_nfl
and isc_nrxqs are valid, and what their meaning is in general, except for
the simple case where isc_nfl == isc_nrxqs == 1. It looks like drivers
where isc_nrxqs > 1 have one completion ring plus 1 or 2 command rings..
But this is something that should depend on the NIC specifications. So I
don't understand how you can "decide" to go for isc_nfl == 2 && isc_nrxqs

1 rather than isc_nfl == 2 && isc_nrxqs == 2. If you have a better

understanding than mine please do let me know.

Just gonna slide in here real quick...

Check to see if my understanding is correct:
It seems to me that a NIC has (according to the definitions of iflib) as many isc_nrxqsets as the amount of CPU cores on the system (as determined by hardware-specific code). In case of an 8-core CPU this would be:
isc_nrxqsets = 8

During iflib queue allocation, an (iflib_rxq_t * isc_nrxqsets) memory pool gets allocated. This struct basically wraps the entire RX side of the driver and iflib.
First, the RX descriptor DMA ring(s) are allocated. The amount of rings is determined by isc_nrxqs. The size of each individual ring is dictated by the driver. In the case of axgbe, this is

(scctx->isc_nrxd[i] * sizeof(struct xgbe_ring_desc))

which should in the default case equal 512 * 32B.
Based on isc_nfl, a (iflib_fl_t * isc_nfl) memory pool gets allocated and assigned to the iflib_rxq_t, should for example isc_nfl == 2, each iflib_rxq_t gets assigned two freelist pointers.
The freelists are then DMA aligned to the necessary mbuf structures.

Iflib now hands over control to the driver in order for it to allocate it's own RX rings and align it to iflib.
The descriptor base address for each ring gets assigned the virtual address provided by iflib, the same goes for the physical address which eventually makes its way into the hardware as part of the TXRX path.

At this point, we have two RX descriptor rings for two iflib freelists. The rings and freelists are both identical.

Knowing axgbe, the descriptor structure consists of 4 uint32_t's, called desc0, desc1, desc2, desc3 respectively.
During iflib_init, (isc_nrxqsets * isc_nfl) freelists get filled with empty mbufs (up to a count of 128) which means driver is now prepared to handle packets.
This initialization calls the drivers rxd_refill() and rxd_flush() functions up to count times, respectively for each freelist. The driver picks up on this and decides that:

  • should the freelist id be 0, set desc0 and desc1 (upper and lower 32 bits) to the physical buffer address provided by iflib. ---> HEADER
  • should the freelist id be 1, set desc2 and desc3 to the physical buffer address provided by iflib. ---> DATA/BODY

This is hardware-dependent code which works when the Split Header NIC function is enabled. I am unsure of what the NIC does when the SPH function is disabled.

Upon pkt_get(), the driver reads the header buf length and data buf length from the same descriptor where it placed the physical header/data address, where:

ri->iri_frags[0].irf_flid = 0 --> header
ri->iri_frags[1].irf_flid = 1 --> data

The driver then relies on iflib to assemble the segments in iflib_rxd_pkt_get() to form an mbuf.

One solution:
It seems if_vmx relies on IFLIB_HAS_RXCQ, implementing a third RX (completion) queue, representing a single received packet per descriptor (I think?). This completion queue is allocated at flid[0], meaning it works with netmap. I'm not entirely sure how IFLIB_HAS_RXCQ works, maybe someone could shed some light on this?

EDIT:
The completion queue is not at flid[0]. It seems to be a seperate completion queue implemented by the driver. How does it communicate with iflib?

Another solution to this problem:
I suppose you could specify isc_nfl == 1 && isc_nrxqs == 2, but you would have to DMA-align this in some way, kind of defeating the purpose of iflib.

In any case, the problem has a workaround in the submitted code (https://reviews.freebsd.org/D27797) by disabling the split header feature via a sysctl

Please correct me if I'm wrong

Ok, so the axgbe netmap receive datapath panics. Does this happen with or
without D27797?
If the panic is not introduced by your changes (e.g. if it is in HEAD) you
should submit a bug to the bugzilla.

Yes, But issues seems not with respect to driver changes in D27797, but with the iflib changes (specifically netmap_rxsync) in this patch. I will do some more testing and confirm that.

If you look at vmxnet3_isc_rxd_refill() it looks like freelist #0 (and rxq
#0) is for BTYPE_HEAD descriptors, whereas freelist #1 (and rxq#1) is for
BTYPE_BODY, for each qset. This seems to point at a sort of split
descriptor strategy. However I don't have the vmxnet3 specifications, so I
don't know for sure.

To be honest I'm quite a bit confused about what combinations of isc_nfl
and isc_nrxqs are valid, and what their meaning is in general, except for
the simple case where isc_nfl == isc_nrxqs == 1. It looks like drivers
where isc_nrxqs > 1 have one completion ring plus 1 or 2 command rings..
But this is something that should depend on the NIC specifications. So I
don't understand how you can "decide" to go for isc_nfl == 2 && isc_nrxqs
1 rather than isc_nfl == 2 && isc_nrxqs == 2. If you have a better
understanding than mine please do let me know.

First of all, sorry about my wrong and misleading statement. Axgbe infact uses isc_nrxq=2 and isc_nfl=2.

But 1 rxq (per queue set) with 2 free list is sufficient for axgbe. But because of the following code in iflib.c (iflib_queue_alloc function), we faced issue and sticked to isc_nrxq=2 and isc_nfl=2. Looks like each freelist expects a seperate descriptor DMA buffer.

for (j = 0; j < nfree_lists; j++) {
                        fl[j].ifl_rxq = rxq;
                        fl[j].ifl_id = j;
                        fl[j].ifl_ifdi = &rxq->ifr_ifdi[j + rxq->ifr_fl_offset];
                        fl[j].ifl_rxd_size = scctx->isc_rxd_size[j];
}

From my understanding, isc_nrxq corresponds to the number of descriptor rings in the hardware and isc_nfl corresponds to the number of memory buffers list needed. And iflib maps them 1x1. Please correct me if I am wrong.

Looks like, vmx uses seperate descriptor rings for header and data. Descriptor in each ring hold one buffer. "vmxnet3_isc_rxd_flush" writes the pidx to different offsets for flid 0 and 1. So it uses isc_nrxq and isc_nfl to 2. Please correct me if I am wrong.

But in case of axgbe, we have a single descriptor ring, with each descriptor holding two buffers (one for header and one for data). And "axgbe_isc_rxd_flush" writes the tail pointer only when flid is 1 (on 2 freelist approach). But, since we can't use 2 freelist when isc_nrxq is 1, we set it to 2 and ignore the second descriptor ring.

This approach works good with regular data traffic. But when tried with netmap, we faced issue and made the driver changes for single freelist (D27797). But in that case, using NS_MOREFRAG for split header fragments (iflib_netmap_rxsync) is messing up things with pkt-gen. I am trying to make sure about that.

One more thing I noticed is, current iflib code doesn't touch(set or get) the NS_MOREFRAG bit. But this patch revision infact has the changes for that only. And netmap pkt-gen, set and get the NS_MOREFRAG bit. So, is touching NS_MOREFRAG bit purposely ignored in iflib with the intent to handle that in application level?

Thanks to both for the clarification.

Il giorno ven 8 gen 2021 alle ore 12:08 rajesh1.kumar_amd.com (Rajesh
Kumar) <phabric-noreply@freebsd.org> ha scritto:

rajesh1.kumar_amd.com added a comment.

> Ok, so the axgbe netmap receive datapath panics. Does this happen with

or

> without D27797 <https://reviews.freebsd.org/D27797>?
> If the panic is not introduced by your changes (e.g. if it is in HEAD)

you

> should submit a bug to the bugzilla.

Yes, But issues seems not with respect to driver changes in D27797 <

https://reviews.freebsd.org/D27797>, but with the iflib changes
(specifically netmap_rxsync) in this patch. I will do some more testing
and confirm that.

> If you look at vmxnet3_isc_rxd_refill() it looks like freelist #0 (and

rxq

> #0) is for BTYPE_HEAD descriptors, whereas freelist #1 (and rxq#1) is

for

> BTYPE_BODY, for each qset. This seems to point at a sort of split
> descriptor strategy. However I don't have the vmxnet3 specifications,

so I

> don't know for sure.
>
> To be honest I'm quite a bit confused about what combinations of

isc_nfl

> and isc_nrxqs are valid, and what their meaning is in general, except

for

> the simple case where isc_nfl == isc_nrxqs == 1. It looks like drivers
> where isc_nrxqs > 1 have one completion ring plus 1 or 2 command

rings..

> But this is something that should depend on the NIC specifications. So

I

> don't understand how you can "decide" to go for isc_nfl == 2 &&

isc_nrxqs

> 1 rather than isc_nfl == 2 && isc_nrxqs == 2. If you have a better
> understanding than mine please do let me know.

First of all, sorry about my wrong and misleading statement.  Axgbe

infact uses isc_nrxq=2 and isc_nfl=2.

But 1 rxq (per queue set) with 2 free list is sufficient for axgbe.  But

because of the following code in iflib.c (iflib_queue_alloc function), we
faced issue and sticked to isc_nrxq=2 and isc_nfl=2. Looks like each
freelist expects a seperate descriptor DMA buffer.

for (j = 0; j < nfree_lists; j++) {
                        fl[j].ifl_rxq = rxq;
                        fl[j].ifl_id = j;
                        fl[j].ifl_ifdi = &rxq->ifr_ifdi[j +

rxq->ifr_fl_offset];

                          fl[j].ifl_rxd_size = scctx->isc_rxd_size[j];
  }

From my understanding, isc_nrxq corresponds to the number of descriptor

rings in the hardware and isc_nfl corresponds to the number of memory
buffers list needed. And iflib maps them 1x1. Please correct me if I am
wrong.

Unfortunately my knowledge here is quite limited. I think isc_nrxq
corresponds to the number of descriptor rings and/or completion queues: for
instance bnxt has a completion ring, a descriptor ring and an "AG" ring
(whatever it means). if_vmx has a completion ring plus two descriptor
rings).
I agree that from the code it looks like it must be isc_nfl <= isc_nrxqs,
because rxq->ifr_ifdi has isc_nrxqs elements, and the line you outlined

fl[j].ifl_ifdi = &rxq->ifr_ifdi[j + rxq->ifr_fl_offset];

would cause out of bound access if isc_nfl > isc_nrxqs.
Now I also understand that rxq->ifr_fl_offset is used for cases where the
driver has a completion queue, which must be the first one of the "rxqs".
I think I will add an assert to check that (isc_nfl<=isc_nrxqs), so that at
least we avoid issues with future drivers.

Looks like, vmx uses seperate descriptor rings for header and data.

Descriptor in each ring hold one buffer. "vmxnet3_isc_rxd_flush" writes
the pidx to different offsets for flid 0 and 1. So it uses isc_nrxq and
isc_nfl to 2. Please correct me if I am wrong.

But in case of axgbe, we have a single descriptor ring, with each

descriptor holding two buffers (one for header and one for data). And
"axgbe_isc_rxd_flush" writes the tail pointer only when flid is 1 (on 2
freelist approach). But, since we can't use 2 freelist when isc_nrxq is 1,
we set it to 2 and ignore the second descriptor ring.

Yes, that makes sense.

This approach works good with regular data traffic.  But when tried with

netmap, we faced issue and made the driver changes for single freelist
(D27797 https://reviews.freebsd.org/D27797). But in that case, using
NS_MOREFRAG for split header fragments (iflib_netmap_rxsync) is messing up
things with pkt-gen. I am trying to make sure about that.

One more thing I noticed is, current iflib code doesn't touch(set or

get) the NS_MOREFRAG bit. But this patch revision infact has the changes
for that only. And netmap pkt-gen, set and get the NS_MOREFRAG bit. So,
is touching NS_MOREFRAG bit purposely ignored in iflib with the intent to
handle that in application level?

No.
NS_MOREFRAG follows the same rules as the other fields in the netmap_slot,
such as "len".
On receive rings, NS_MOREFRAG may be set by the kernel (iflib, the driver,
or netmap core, depending on the case), and should only be read by the
userspace application.
On transmit rings, NS_MOREFRAG may be set by the application (e.g. pkt-gen)
and can only read by the kernel.
Current iflib drivers simply do not currently support NS_MOREFRAG
(NAF_MOREFRAG not set). That's why iflib does not touch that flag.
With your patch, NS_MOREFRAG will become supported, and that's useful in
general.
In any case, if I understand correctly, with D27797 NS_MOREFRAG should not
be needed anymore for your axgbe deployment.

REPOSITORY

rS FreeBSD src repository - subversion

CHANGES SINCE LAST ACTION

https://reviews.freebsd.org/D27799/new/

REVISION DETAIL

https://reviews.freebsd.org/D27799

EMAIL PREFERENCES

https://reviews.freebsd.org/settings/panel/emailpreferences/

To: rajesh1.kumar_amd.com, shurd, iflib, vmaffione, mmacy, manu
Cc: stephan.dewt_yahoo.co.uk, imp, ae, melifaro,
Contributor Reviews (src), krzysztof.galazka_intel.com

The kernel panic issue which I reported with pkt-gen here is root caused to an issue in the axgbe driver. Now that issue is fixed in the driver and I will submit the driver patches seperately.

Initially, the stack trace pointed to the netmap_rxsync path in Iflib. So, I suspected the issue could be with the changes made in netmap_rxsync. But the issue here is because of the wrong receive index written after refilling the buffers in the driver. Now, with this changes Netmap pkt-gen and bridge has been validated with axgbe driver with both split header enabled and disabled.

So, request the reviewers to review/try the patch here and let me know your comments. Please let me know if any info needed.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 24 2021, 9:22 PM
This revision was automatically updated to reflect the committed changes.

Thanks @vmaffione for reviewing and accepting the changes.