Details

Reviewers

vmaffione
murat_sunnyvalley.io
franco_opnsense.org

Group Reviewers

network
Klara

Commits

rGce12afaa6fff: netmap: Fix queue stalls with generic interfaces

Summary

In emulated mode, the FreeBSD netmap port attempts to perform zero-copy
transmission. This works as follows: the kernel ring is populated with
mbuf headers to which netmap buffers are attached. When transmitting,
the mbuf refcount is initialized to 2, and when the counter value has
been decremented to 1 netmap infers that the driver has freed the mbuf
and thus transmission is complete.

This scheme does not generalize to the situation where netmap is
attaching to a software interface which may transmit packets among
multiple "queues", as is the case with bridge or lagg interfaces. In
that case, we would be relying on backing hardware drivers to free
transmitted mbufs promptly, but this isn't guaranteed; a driver may
reasonably defer freeing a small number of transmitted buffers
indefinitely. If such a buffer ends up at the tail of a netmap transmit
ring, further transmits will end up blocked.

Fix the problem by removing the zero-copy scheme in the FreeBSD port.
Emulated mode is not intended to provide the highest possible
performance, so this is acceptable. Meanwhile, this change allows
netmap to attach to software interfaces without running into the
deadlock described above. In the future it should be possible to
reintroduce zero-copy transmits for hardware interfaces running in
emulated mode, and the transmit path still attempts to recycle mbufs
without an extra trip into UMA per transmit.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 50385
Build 47276: arc lint + arc unit

Event Timeline

markj created this revision.Jan 15 2023, 5:04 PM

Herald added a subscriber: imp. · View Herald TranscriptJan 15 2023, 5:04 PM

markj requested review of this revision.Jan 15 2023, 5:04 PM

markj added a child revision: D38066: bridge: Add support for emulated netmap mode.

Harbormaster completed remote builds in B49117: Diff 115156.Jan 15 2023, 5:04 PM

markj added inline comments.Jan 15 2023, 5:05 PM

sys/dev/netmap/netmap_generic.c
634	This expands to nothing on FreeBSD and I can't see why it's needed, so I removed it. If it is in fact required, let's add a comment explaining why and ensure that it does the right thing on FreeBSD.

afedorov added a subscriber: afedorov.Jan 15 2023, 9:21 PM

thj added reviewers: Klara, murat_sunnyvalley.io, franco_opnsense.org.Jan 16 2023, 1:45 PM

markj added inline comments.Jan 18 2023, 3:41 PM

sys/dev/netmap/netmap_generic.c
780	This case isn't handled properly. generic_set_tx_event() sets a custom destructor on the oldest mbuf in the kernel ring and prevents further tx cleaning until that mbuf is freed. But again, with software interfaces we cannot guarantee that that mbuf is going to be freed as soon as possible, so it is still possible to hang. I don't see any solution other than to disable this mechanism on FreeBSD. In other words, if if_transmit returns an error, we drop the packet and keep going.

Remove calls to generic_set_tx_event(). This mechanism also assumes
that drivers will free mbufs after transmission within some short
interval, but this is not something we can assume. So upon a transmit
error, simply drop the packet and keep going.

Harbormaster completed remote builds in B49207: Diff 115542.Jan 24 2023, 4:45 PM

des added a subscriber: des.Jan 31 2023, 10:40 AM

des added inline comments.

sys/dev/netmap/netmap_generic.c
731
sys/dev/netmap/netmap_kern.h
2396–2400	It's only ever called through a pointer (see right below), so the `inline` is pointless.

Ok with dropping the zerocopy, but I think we should try to reuse the mbufs.

sys/dev/netmap/netmap_freebsd.c
437	Is a `memcpy()` good enough, or should we use something more generic like `m_append`?
sys/dev/netmap/netmap_generic.c
554	I think this `m_freem(m)` here makes the whole refcounting scheme useless (e.g. set it to 2 in `nm_os_generic_xmit_frame()`). The whole point of the scheme is to try to reuse mbufs as much as possible. But if you look at `generic_netmap_txsync()`, in the first part you advance `nr_hwcur` by submitting the corresponding mbufs to the driver. In the second part, `generic_netmap_tx_clean()` is called to advance `nr_hwtail` towards `nr_hwcur`. However, the latter call will find (with very high probability) all the mbufs submitted during the first part with the refcount still set to 2, because the driver hasn't `m_free`d them yet. Therefore, with this change we would decrease the refcount to 1, and the driver `m_free` would deallocate the buffer later. The end result is that we pay one mbuf allocation+deallocation per packet transmitted, which is not nice. But if this is the end result, why bother setting the refcount to 2? We could just set it to 1 from the beginning and leave it to the driver. A better strategy would be to perform this operation opportunistically, for example only when we are running out of tx space, which could be if `nm_kr_txspace()` is less than a quarter of the ring size. In this way we get a chance to reuse the mbuf.
745	The `generic_set_tx_event()` mechanism is used on Linux. Maybe define `generic_set_tx_event()` a NOP for FreeBSD.
795	Same as above, this can be a NOP on FreeBSD, but we would like to keep it for Linux.

This revision now requires changes to proceed.Jan 31 2023, 10:21 PM

markj marked 4 inline comments as done.Feb 2 2023, 4:35 PM

markj added inline comments.

sys/dev/netmap/netmap_freebsd.c
437	I think using m_copydata() is reasonable. It's not strictly necessary today since this function is only called from the generic txsync routine and the mbuf layout is predictable, but I agree that this should be done more generically.
sys/dev/netmap/netmap_generic.c
554	I tried adding counters; in my simple experiment with virtio-net and if_bridge, "very high probability" is actually 50-60%. :) I had tried this before and found a similar result, so I left the refcounting scheme in place. I do agree in general though, there are lots of factors which could make mbuf reuse less likely, especially on physical hw, so it's not a reliable optimization anymore. What do you mean by "this operation"? TX ring cleaning? Alternately, I wonder if we should try cleaning the ring before transmitting rather than after?
745	I removed it entirely since txqdisc is set to 1 on Linux in the upstream netmap sources. But I see now that it's a tunable which can be changed, so I'll restore it.
sys/dev/netmap/netmap_kern.h
2396–2400	Without that qualifier we get -Wunused-function warnings from every file which includes netmap_kern.h but doesn't reference void_mbuf_dtor().

Address some review feedback:

Use m_copyback() instead of memcpy() for copying the netmap buffer.
Restore generic_set_tx_event() and make it a no-op on FreeBSD.

Harbormaster completed remote builds in B49480: Diff 116339.Feb 2 2023, 4:37 PM

vmaffione added inline comments.Feb 6 2023, 9:53 PM

sys/dev/netmap/netmap_generic.c
554	Yes, I mean TX cleaning. I don't think changing the order will have effect, because it is very common for txsync routines to be called twice in a very short timeframe. Regarding physical hw, I was actually saying the contrary: on physical hw you have better chances of optimization (mbuf reuse), because transmission is not instantaneous. For virtio-net, that's different. It's easier to hit the case where transmission is kind of synchronous, because a TX virtqueue kick will likely wake up the consumer thread in the hypervisor. So I think the optimization is worth, and we should try to TX clean optimistically (e.g. when running out of space). However, we can leave this optimization for a later changeset.

Can I ask what kind of tests you performed? I guess you have set sysctl dev.netmap.admode=2 (see netmap(4)) and tried on a vtnet0 interface.
If not done yet, could you please perform some tests on an em0 interface (e.g. emulated by qemu or bhyve)?

In D38065#874905, @vmaffione wrote:

Can I ask what kind of tests you performed? I guess you have set sysctl dev.netmap.admode=2 (see netmap(4)) and tried on a vtnet0 interface.
If not done yet, could you please perform some tests on an em0 interface (e.g. emulated by qemu or bhyve)?

I've tested lagg(4) on em(4) previously in VMware with this and it worked without stalls or visible slowdown. Now with your suggestion the plain em(4) and sysctl set to 2 I'm seeing very slow downloads getting slower still and ending up emitting:

[zone: mbuf_cluser] kern.ipc.nmbclusters limit reached

Testing without this patch ... shows the same issue. It could be the way Suricata handles the netmap(4), in this case em1 -> em1^ and em1^ -> em1 to listen to the bidirectional traffic flow of a single interface or it could be an inherent issue. I'm not sure. I only know that on FreeBSD 11 or 12 when we had to deal with an out of tree Intel driver of em(4) as a default choice this likely wasn't present and working as expected.

In D38065#874905, @vmaffione wrote:

Can I ask what kind of tests you performed? I guess you have set sysctl dev.netmap.admode=2 (see netmap(4)) and tried on a vtnet0 interface.

Yes, I tried this admode=2 with vtnet and also tested with if_lagg and if_bridge, mostly just with the bridge program.

If not done yet, could you please perform some tests on an em0 interface (e.g. emulated by qemu or bhyve)?

Hmm, if I try with bhyve-emulated em0 and bridge netmap:em0 and netmap:em0^, I see messages like "netmap_transmit em0 drop mbuf that needs checksum offload" and no traffic passes. This is with admode=0 or 2.

If I disable the rxcsum and txcsum flags on the interface, then it works, but with very low throughput (~1Mbit/s) when admode=2. This also happens without my patch though. I don't see an mbuf leak though.

It may be because of bhyve em0 emulation, I'm not sure it's bug free.
QEMU emulation (on a linux KVM box) should be very reliable.

vmaffione added a comment.Feb 11 2023, 4:53 PM

This comment was removed by vmaffione.

In D38065#876239, @vmaffione wrote:

It may be because of bhyve em0 emulation, I'm not sure it's bug free.
QEMU emulation (on a linux KVM box) should be very reliable.

Could be, but I don't see any problems in netmap native mode. I will test with some hardware igb interfaces.

In D38065#876240, @markj wrote:

In D38065#876239, @vmaffione wrote:

It may be because of bhyve em0 emulation, I'm not sure it's bug free.
QEMU emulation (on a linux KVM box) should be very reliable.

Could be, but I don't see any problems in netmap native mode. I will test with some hardware igb interfaces.

Sorry for the delay. I tested a hardware interface:

markj@biggie> pciconf -lv igb0
igb0@pci0:5:0:0:        class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1533 subvendor=0x15d9 subdevice=0x1533
    vendor     = 'Intel Corporation'
    device     = 'I210 Gigabit Network Connection'
    class      = network
    subclass   = ethernet

and I see the same problem: very low throughput from iperf when admode=2, with or without my patch. Again, I am just running sudo ./bridge -w 1 -i netmap:igb0 -i netmap:igb0^, and with admode=0 there is no problem, I hit line rate. So it seems that there has been a regression, probably somewhere in iflib? I do not see this problem with vtnet, it is fine with admode=2.

Yeah, it is very possible that a regression has been introduced with iflib.
Note that it does not make sense to use emulated mode with iflib, since iflib has native support. So it's not surprising that the regression has gone unnoticed.

Would you mind also testing pkt-gen -f tx -i netmap:vtnet0? That's a different workload w.r.t. bridge.

This revision is now accepted and ready to land.Feb 26 2023, 3:35 PM

Issue a tx wakeup after if_transmit fails in emulated mode. Also count drops
caused by if_transmit failures.

Testing with pkt-gen revealed a problem: if if_transmit fails and the tx ring
is full, the netmap application may poll() for additional space in the ring.
However, because we no longer use mbuf destructors to issue tx wakeups, this
situation can cause the application to stall.

Fix the problem by issuing a tx wakeup immediately.

This revision now requires review to proceed.Feb 27 2023, 4:19 PM

Harbormaster completed remote builds in B50048: Diff 118011.Feb 27 2023, 4:19 PM

In D38065#882897, @vmaffione wrote:

Yeah, it is very possible that a regression has been introduced with iflib.
Note that it does not make sense to use emulated mode with iflib, since iflib has native support. So it's not surprising that the regression has gone unnoticed.

Would you mind also testing pkt-gen -f tx -i netmap:vtnet0? That's a different workload w.r.t. bridge.

This revealed a bug in the patch, I've updated it. With that, I get about 1.9Mpps from vtnet with admode=2, versus 2.4Mpps before. I presume that this is due to overhead from mbuf allocations, but since we're in emulated mode, and have some ideas to improve the situation, this is not a fatal problem?

Thanks for testing.
I think that's not a fatal problem, since (1) one is supposed to use native mode where available, and (2) this patch enables more use cases for emulated adapter, as you are suggesting.
In this regard, I would be curious to have an example where "a driver may reasonably defer freeing a small number of transmitted buffers indefinitely" (quoting your commit message above).
indefinitely"

For sure this performance drop is due to mbuf allocation, as I was suggesting before when commenting on the refcounting scheme and opportunistic TX cleaning... as you can see this issue is better exposed with tests without the host rings, since host rings are "slower" by their nature.
We can always improve in later patches.

vmaffione requested changes to this revision.Feb 28 2023, 5:28 PM

vmaffione added inline comments.

sys/dev/netmap/netmap_generic.c
591	This is the latest fix you were mentioning? What is the logic? You are apparently waking up userspace from the TXSYNC routine, which will trigger another TXSYNC...

This revision now requires changes to proceed.Feb 28 2023, 5:28 PM

In D38065#883802, @vmaffione wrote:

Thanks for testing.
I think that's not a fatal problem, since (1) one is supposed to use native mode where available, and (2) this patch enables more use cases for emulated adapter, as you are suggesting.
In this regard, I would be curious to have an example where "a driver may reasonably defer freeing a small number of transmitted buffers indefinitely" (quoting your commit message above).

iflib may do this, for example. iflib_if_transmit() enqueues packets in a ring buffer which is drained by iflib_txq_drain(). iflib_encap() is responsible for queuing packets in a hardware TX ring. It asks the driver to raise completion interrupts by setting IPI_TX_INTR, but it doesn't do this for every packet.

Similarly, virtio-net only enables TX interrupts when a certain number of completions are pending. See vtnet_txq_enable_intr().

With these drivers, if one sends a single packet, the mbuf refcount might not be decremented after transmission. In principle, the problem could also arise drivers which distribute packets among multiple TX rings, though this is unlikely since nm_os_generic_xmit_frame() uses a fixed flowid.

sys/dev/netmap/netmap_generic.c
591	Sorry, this is a hack and not really appropriate to commit. The problem is that we failed to transmit a packet at the tail of the TX ring. The packet is not dropped, and `hwcur` is not advanced. But we cannot rely on the mbuf destructor to trigger a wakeup, so if the tx ring is full, a poll() will sleep forever. The hack is to issue a wakeup so that the poll() will keep retrying until transmission succeeds. This fixes a hang that I can trigger occasionally with pkt-gen. But it's not the right general solution. We could alternately use a callout to periodically schedule a wakeup.

In D38065#883858, @markj wrote:

In D38065#883802, @vmaffione wrote:

Thanks for testing.
I think that's not a fatal problem, since (1) one is supposed to use native mode where available, and (2) this patch enables more use cases for emulated adapter, as you are suggesting.
In this regard, I would be curious to have an example where "a driver may reasonably defer freeing a small number of transmitted buffers indefinitely" (quoting your commit message above).

iflib may do this, for example. iflib_if_transmit() enqueues packets in a ring buffer which is drained by iflib_txq_drain(). iflib_encap() is responsible for queuing packets in a hardware TX ring. It asks the driver to raise completion interrupts by setting IPI_TX_INTR, but it doesn't do this for every packet.

Yes, iflib does not set IPI_TX_INTR for each packet in order to moderate TX interrupts and descriptor writeback overhead, but I'm pretty sure it guarantees that all the mbufs submitted to the hardware TX rings are m_free()d in a bounded amount of time.
Where have you observed indefinite m_free() deferral?

Similarly, virtio-net only enables TX interrupts when a certain number of completions are pending. See vtnet_txq_enable_intr().

I know virtio and virtio-net quite well, and I'm pretty sure the backend (e.g. qemu, bhyve) virtqueue implementation does consume the buffers in a timely manner, so that the if_vtnet driver can m_free() them in a bounded time.
Have you observed indefinite m_free() deferral with if_vtnet?

In any case, the multiqueue argument looks more convincing, but for a different reason. In case of software network interfaces that dispatch mbufs to indipendent sub-interfaces, packets may be completed out of order, which breaks the scheme assumptions.

With these drivers, if one sends a single packet, the mbuf refcount might not be decremented after transmission. In principle, the problem could also arise drivers which distribute packets among multiple TX rings, though this is unlikely since nm_os_generic_xmit_frame() uses a fixed flowid.

In D38065#884005, @vmaffione wrote:

In D38065#883858, @markj wrote:

In D38065#883802, @vmaffione wrote:

Thanks for testing.
I think that's not a fatal problem, since (1) one is supposed to use native mode where available, and (2) this patch enables more use cases for emulated adapter, as you are suggesting.
In this regard, I would be curious to have an example where "a driver may reasonably defer freeing a small number of transmitted buffers indefinitely" (quoting your commit message above).

iflib may do this, for example. iflib_if_transmit() enqueues packets in a ring buffer which is drained by iflib_txq_drain(). iflib_encap() is responsible for queuing packets in a hardware TX ring. It asks the driver to raise completion interrupts by setting IPI_TX_INTR, but it doesn't do this for every packet.

Yes, iflib does not set IPI_TX_INTR for each packet in order to moderate TX interrupts and descriptor writeback overhead, but I'm pretty sure it guarantees that all the mbufs submitted to the hardware TX rings are m_free()d in a bounded amount of time.

In iflib, transmitted mbufs are freed only by iflib_completed_tx_reclaim(), which does not do any work in the reclaim <= thresh case. So if the interface is idle, nothing will force the reclamation of a small number of transmitted mbufs.

Where have you observed indefinite m_free() deferral?

When testing the if_bridge patch in the other review, also in testing if_lagg. When the software interface has multiple member ports, we can hit this problem when a single packet is transmitted on one member port, and then a large number of packets are transmitted on another member port, since nothing triggers a m_free() of the first packet.

Similarly, virtio-net only enables TX interrupts when a certain number of completions are pending. See vtnet_txq_enable_intr().

I know virtio and virtio-net quite well, and I'm pretty sure the backend (e.g. qemu, bhyve) virtqueue implementation does consume the buffers in a timely manner, so that the if_vtnet driver can m_free() them in a bounded time.

if_vtnet can free transmitted mbufs promptly, but it doesn't. Transmitted mbufs are freed by vtnet_txq_eof(), which is called in two places: from vtnet_tick(), called once per second, and when transmitting mbufs. So if I transmit a single mbuf, it might take up to one second before it is freed. This can be observed using dtrace -n fbt::m_free:entry {stack();} and transmitting a single UDP packet in an otherwise idle VM.

Have you observed indefinite m_free() deferral with if_vtnet?

Sorry, it is not quite indefinite: vtnet_tick() will free transmitted mbufs, but this only happens once per second, which is not far from indefinite. :)

But yes, I saw this in the same scenario described above, with if_bridge and if_lagg.

In D38065#884010, @markj wrote:

In D38065#884005, @vmaffione wrote:

In D38065#883858, @markj wrote:

In D38065#883802, @vmaffione wrote:

Thanks for testing.
I think that's not a fatal problem, since (1) one is supposed to use native mode where available, and (2) this patch enables more use cases for emulated adapter, as you are suggesting.
In this regard, I would be curious to have an example where "a driver may reasonably defer freeing a small number of transmitted buffers indefinitely" (quoting your commit message above).

iflib may do this, for example. iflib_if_transmit() enqueues packets in a ring buffer which is drained by iflib_txq_drain(). iflib_encap() is responsible for queuing packets in a hardware TX ring. It asks the driver to raise completion interrupts by setting IPI_TX_INTR, but it doesn't do this for every packet.

Yes, iflib does not set IPI_TX_INTR for each packet in order to moderate TX interrupts and descriptor writeback overhead, but I'm pretty sure it guarantees that all the mbufs submitted to the hardware TX rings are m_free()d in a bounded amount of time.

In iflib, transmitted mbufs are freed only by iflib_completed_tx_reclaim(), which does not do any work in the reclaim <= thresh case. So if the interface is idle, nothing will force the reclamation of a small number of transmitted mbufs.

In theory you are right, although it looks like in practice thresh is always set to 0, so no deferral should happen.

Where have you observed indefinite m_free() deferral?

When testing the if_bridge patch in the other review, also in testing if_lagg. When the software interface has multiple member ports, we can hit this problem when a single packet is transmitted on one member port, and then a large number of packets are transmitted on another member port, since nothing triggers a m_free() of the first packet.

Maybe the problem here is the out of order completion due to the packet redistribution on different ports?

Similarly, virtio-net only enables TX interrupts when a certain number of completions are pending. See vtnet_txq_enable_intr().

I know virtio and virtio-net quite well, and I'm pretty sure the backend (e.g. qemu, bhyve) virtqueue implementation does consume the buffers in a timely manner, so that the if_vtnet driver can m_free() them in a bounded time.

if_vtnet can free transmitted mbufs promptly, but it doesn't. Transmitted mbufs are freed by vtnet_txq_eof(), which is called in two places: from vtnet_tick(), called once per second, and when transmitting mbufs. So if I transmit a single mbuf, it might take up to one second before it is freed. This can be observed using dtrace -n fbt::m_free:entry {stack();} and transmitting a single UDP packet in an otherwise idle VM.

Correct.

Have you observed indefinite m_free() deferral with if_vtnet?

Sorry, it is not quite indefinite: vtnet_tick() will free transmitted mbufs, but this only happens once per second, which is not far from indefinite. :)

Agreed.

But yes, I saw this in the same scenario described above, with if_bridge and if_lagg.

vmaffione added inline comments.Mar 1 2023, 11:00 PM

sys/dev/netmap/netmap_generic.c
591	Yeah, a callout would be better. But a mixed approach could be the best solution... an mbuf destructor wake up may arrive before the callout (improving tx rate), and at the same time the callout gives you a guaranteed wakeup (no deadlock).

In D38065#884300, @vmaffione wrote:

In D38065#884010, @markj wrote:

In D38065#884005, @vmaffione wrote:

In D38065#883858, @markj wrote:

In D38065#883802, @vmaffione wrote:

Thanks for testing.
I think that's not a fatal problem, since (1) one is supposed to use native mode where available, and (2) this patch enables more use cases for emulated adapter, as you are suggesting.
In this regard, I would be curious to have an example where "a driver may reasonably defer freeing a small number of transmitted buffers indefinitely" (quoting your commit message above).

iflib may do this, for example. iflib_if_transmit() enqueues packets in a ring buffer which is drained by iflib_txq_drain(). iflib_encap() is responsible for queuing packets in a hardware TX ring. It asks the driver to raise completion interrupts by setting IPI_TX_INTR, but it doesn't do this for every packet.

Yes, iflib does not set IPI_TX_INTR for each packet in order to moderate TX interrupts and descriptor writeback overhead, but I'm pretty sure it guarantees that all the mbufs submitted to the hardware TX rings are m_free()d in a bounded amount of time.

In iflib, transmitted mbufs are freed only by iflib_completed_tx_reclaim(), which does not do any work in the reclaim <= thresh case. So if the interface is idle, nothing will force the reclamation of a small number of transmitted mbufs.

In theory you are right, although it looks like in practice thresh is always set to 0, so no deferral should happen.

Sorry, my mistake. But look at the call path: iflib_completed_tx_reclaim() is only called from iflib_txq_drain(), which is a callback invoked from ifmp_ring_enqueue() after some items are enqueued. There are two places that insert items: iflib_if_transmit(), i.e., the iflib interface's if_transmit routine, and _task_fn_tx(), which is the task scheduled from the TX interrupt handler. A dtrace script shows that _task_fn_tx() is not scheduled after every transmission:

`dtrace -n 'fbt::_task_fn_tx:entry {@ = count(); printa(@);}' -c "ping 169.254.0.10"

In this case, I see roughly one task per 2 packets, which corresponds to my reading of txq_max_rs_deferred(). So I infer that after transmitting a single packet on an otherwise idle tx ring, nothing will reclaim the mbuf until a second mbuf is transmitted.

Where have you observed indefinite m_free() deferral?

When testing the if_bridge patch in the other review, also in testing if_lagg. When the software interface has multiple member ports, we can hit this problem when a single packet is transmitted on one member port, and then a large number of packets are transmitted on another member port, since nothing triggers a m_free() of the first packet.

Maybe the problem here is the out of order completion due to the packet redistribution on different ports?

Even if completion occurs out of order, I would still expect the netmap application to make progress so long as transmitted mbufs are freed promptly.

markj marked an inline comment as done.Mar 7 2023, 7:18 PM

markj added inline comments.

sys/dev/netmap/netmap_generic.c
591	I ended up implementing this, and I think it's a much better solution. I'll upload a new version of the patch and describe the changes.

When cleaning tx rings, go back to waiting for the mbuf refcount to drop to 1
before allowing the ring slot to be reused. This has two benefits: it restores
the old behaviour of reusing mbufs, which gives a throughput improvement, and
it limits the number of mbufs which can be queued up for transmission at a
given point in time, so the driver is much less likely to return ENOBUFS.

To handle "stuck" mbufs, use a callout with 1ms period. If transmission fails,
or the TX ring is full, the callout handler will treat the mbuf as transmitted
and wake up the netmap application. This will require allocation of a new
mbuf, but this is a relatively rare scenario.

Harbormaster completed remote builds in B50203: Diff 118484.Mar 7 2023, 7:27 PM

Fix a regression in the previous commit: after transmission of an mbuf,
M_PKTHDR be cleared, so it needs to be set again before the mbuf is reused.

nm_os_generic_xmit_frame() handled this before; it affects at least vlan
interfaces, which may prepend an mbuf to the chain and thus clear M_PKTHDR on
the mbuf allocated by netmap.

However, setting M_PKTHDR is not sufficient. Other fields in the mbuf header
need to be reinitialized. For instance, M_VLANTAG may be left set. Thus
introduce nm_os_mbuf_reinit() and use it in txsync before transmission. This
reinitializes the mbuf header completely and so solves the problem properly.

Harbormaster completed remote builds in B50385: Diff 118862.Mar 15 2023, 4:46 PM

vmaffione added inline comments.Mar 15 2023, 10:47 PM

sys/dev/netmap/netmap_kern.h
2437	Can we avoid the per-packet mbuf reinit in case there is no need? (e.g. hw ifnets...)

markj added inline comments.Mar 16 2023, 2:30 PM

sys/dev/netmap/netmap_kern.h
2437	The proposed check is not sufficient. We might need to clear M_VLANTAG, for example. I don't see a good way to avoid the reinit. In my testing this patch is still much faster than the older version which allocates mbufs in the txsync path.

vmaffione added inline comments.Mar 20 2023, 9:45 PM

sys/dev/netmap/netmap_kern.h
2437	I do not follow. If `M_VLANTAG` is set for `m`, the proposed check will not apply and reinit will happen... I think we should make an optimization like this to address simple cases... (maybe in a separate commit)

markj added inline comments.Mar 20 2023, 10:00 PM

sys/dev/netmap/netmap_kern.h
2437	Sorry, I see, you are right. I am still a bit leery of this - there are other fields besides m_flags where state can persist. I'd prefer to handle that in a separate commit once I've done a bit more testing. I'm a bit skeptical that the overhead is very significant compared to copying each packet.

Looks good.
How are the tests going?

sys/dev/netmap/netmap_kern.h
2396	Maybe rename this as `nm_generic_mbuf_dtor`

This revision is now accepted and ready to land.Mar 22 2023, 9:41 PM

Rename void_mbuf_dtor() to nm_generic_mbuf_dtor().
Rename generic_mbuf_destructor to generic_mbuf_dtor().
Fix an mbuf leak that can happen when generic_mbuf_dtor() returns without freeing the cluster.

This revision now requires review to proceed.Apr 5 2023, 3:57 PM

Harbormaster completed remote builds in B50773: Diff 119900.Apr 5 2023, 3:57 PM