Page MenuHomeFreeBSD

svc_vc.c: Add support for an xp_extpg boolean
AbandonedPublic

Authored by rmacklem on Feb 9 2026, 10:58 PM.
Tags
None
Referenced Files
F159693469: D55203.id171717.diff
Wed, Jun 17, 3:19 AM
F159689018: D55203.id171600.diff
Wed, Jun 17, 2:41 AM
Unknown Object (File)
Thu, Jun 4, 4:25 AM
Unknown Object (File)
Tue, May 26, 9:48 PM
Unknown Object (File)
May 14 2026, 1:51 AM
Unknown Object (File)
May 12 2026, 6:07 AM
Unknown Object (File)
May 11 2026, 10:28 PM
Unknown Object (File)
May 11 2026, 7:57 PM
Subscribers

Details

Summary

This patch adds a boolean called xp_extpg to the
SVCXPRT structure, which indicates that the NIC
that will be used for RPC messages sent on the
socket can handle M_EXTPG mbufs.

If xp_extpg gets set when the outbound NIC
cannot handle M_EXTPG mbufs, the code in
ip_outout() will call mb_unmapped_to_ext() to
handle the situation.

xp_extpg will be used by the NFS server to
decide if Read replies can be stored in M_EXTPG
mbufs.

Test Plan

Tested by using this patch with the NFS server
patch on both systems that have Mellanox NICs
(which set IFCAP_MEXTPG) and localhost, which
does not set IFCAP_MEXTPG.

Thanks go to Greg Becker <becker.greg@att.net>
for doing the testing using Mellanox NICs.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/rpc/svc_vc.c
321

Oh I'm not an expert on /sys/rpc, really :)

You can use rib_lookup() to ease the code a bit :)

375

The routing table is dynamic. So to make it perfect, and you may want to register the rib ( routing table ) change events via rib_subscribe(), and update xp_extpg accordingly. So that you can benefit it when new nexthop has IFCAP_MEXTPG enabled.

Probably another keen is, emitting IFNET_EVENT_CAPEN event when the enabled capability of an interface is changed, so that the consumers are possible to receive an IFCAP_MEXTPG change and react on, but not instead pulling, which wastes many resources.

sys/rpc/svc_vc.c
321

Maybe. I think I still have to do a
in6_splitscope() call for IPv6, so it
still ends up as a IPv4 vs IPv6 switch
statement.

rib_lookup() also requires NET_EPOCH().
Does that just mean that the call needs
to be wrapped between:

NET_EPOCH_ENTER(et);
rib_looup(..);
NET_EPOCH_EXIT(et);

or does it require more than that.

If I still need to do a in6_splitscope()
call and NET_EPOCH, I'm not sure the code
gets similar, but I will do so, if you think
it is better?

375

xp_extpg is just a hint and nothing breaks if
it is set erroneously. As such, I don't this needs
to worry about routing changes.

Also, NFS servers won't typically have routing
table changes (they tend to be leaf nodes that
aren't changed, at least w.r.t. networking for months
or years).

To be honest, I recall glebius@ suggesting I just
enable use of M_EXTPG always and let ip_output()
call mb_unmapped_to_ext() to deal with it.
--> I just felt that doing the mapping for most

NICs wasn't ideal until most NIC drivers can
handle M_EXTPG mbufs.

I'm not sure what you are referring to w.r.t.
IFNET_EVENT_CAPEN, but all that happens
is that xp_extpg is set when the TCP connection
is made (not on every RPC) and checked once/RPC,
so I don't think overhead is an issue?

Added NET_EPOCH_ENTER()/NET_EPOCH_EXIT() and
cleaned up the code a bit.

I stayed with fib_lookup()/fib6_lookup() since I had
to call in6_splitscope() for IPv6 and the returned
arguments match what fib_lookup() needs.
(For rib_lookup(), the in6_addr would have to be
copied into sin6_addr.)

sys/rpc/svc_vc.c
321

Hmm, it looks like NET_EPOCH is required.
NET_EPOCH_ENTER()/NET_EPOCH_EXIT()
has been added.

I didn't change it to rib_lookup(), since
the code still needs to call in6_splitscope()
and it returns arguments appropriate for
fib6_lookup().

zlei added inline comments.
sys/rpc/svc_vc.c
375

To be honest, I recall glebius@ suggesting I just enable use of M_EXTPG always and let ip_output() call mb_unmapped_to_ext() to deal with it.

Do you have benchmarks for these two setup over an interface without IFCAP_MEXTPG enabled ?

  1. set xprt->xp_extpg = FALSE
  2. set xprt->xp_extpg = TRUE

If the benchmark shows the no noticeable performance degradation, then @glebius 's suggestion is much better and simpler.

375

xp_extpg is just a hint and nothing breaks if it is set erroneously. As such, I don't this needs to worry about routing changes.

Also, NFS servers won't typically have routing table changes (they tend to be leaf nodes that aren't changed, at least w.r.t. networking for months or years).

Indeed.

375

I'm not sure what you are referring to w.r.t. IFNET_EVENT_CAPEN, but all that happens is that xp_extpg is set when the TCP connection is made (not on every RPC) and checked once/RPC, so I don't think overhead is an issue?

Currently there is no such an event IFNET_EVENT_CAPEN exists. If always use M_EXTPG mbuf has noticeable performance degradation over interface without IFCAP_MEXTPG enabled, then this event is valuable.

sys/rpc/svc_vc.c
375

No, I do not have that. I only have very slow
NICs (100Mbps) and they just run at wire speed.

A tester did find a 14% improvement with the
patch (and associated NFS server patch) using
a 100Gbps Mellanox NIC that sets M_EXTPG.

My concern with always enabling it would be
a resource exhaustion for a small system, since
mb_unmapped_to_ext() uses the Sbuf interface
to map the pages. (I don't know if there is a
limit on how much of that can be done?)

I do still have a 256Mbyte i386 at home.
I can test using that in a couple of weeks
to see what happens if you enable M_EXTPG
for it (it has a 100Mbps broadcom NIC).

Could you please take another look at this?
kib@ seems to think this is preferred to just
enabling it for all NICs.

I understand that it is only a "hint", since
routing can change at any time.
However, for an NFS server, routing is
unlikely to change without a NFS server
restart.

Could you please take another look at this?
kib@ seems to think this is preferred to just
enabling it for all NICs.

I understand that it is only a "hint", since
routing can change at any time.
However, for an NFS server, routing is
unlikely to change without a NFS server
restart.

Typically a high available NAS server wants link aggregation or ECMP route. The latter may introduce route change if one of the links fails ( planned or un-planned ).

I have a two ports Chelsio T520-CR ethernet card which supports MEXTPG. I'd like to setup an NFS server to test firstly.

Could you please take another look at this?
kib@ seems to think this is preferred to just
enabling it for all NICs.

I understand that it is only a "hint", since
routing can change at any time.
However, for an NFS server, routing is
unlikely to change without a NFS server
restart.

Typically a high available NAS server wants link aggregation or ECMP route. The latter may introduce route change if one of the links fails ( planned or un-planned ).

I have a two ports Chelsio T520-CR ethernet card which supports MEXTPG. I'd like to setup an NFS server to test firstly.

Sure. After you have applied this patch, apply the patch here..
https://people.freebsd.org/~rmacklem/new-extpg.patch

I had one person test it who had Mellanox NICs, but no one
who has Chelsio.

Thanks, rick

Could you please take another look at this?
kib@ seems to think this is preferred to just
enabling it for all NICs.

I understand that it is only a "hint", since
routing can change at any time.
However, for an NFS server, routing is
unlikely to change without a NFS server
restart.

Typically a high available NAS server wants link aggregation or ECMP route. The latter may introduce route change if one of the links fails ( planned or un-planned ).

I have a two ports Chelsio T520-CR ethernet card which supports MEXTPG. I'd like to setup an NFS server to test firstly.

Sure. After you have applied this patch, apply the patch here..
https://people.freebsd.org/~rmacklem/new-extpg.patch

I had one person test it who had Mellanox NICs, but no one
who has Chelsio.

Thanks, rick

Oh, and as I think I noted, if _svc_vc_checkextpg() returns
the wrong answer, the only effect for most is a slight difference
(<= 15% from what I've seen) in performance.

My main concern is the corner case where enabling it causes
a big performance hit (that the user would not realize can be
fixed by setting vfs.nfsd.enable_mextpg=0).
--> Since without the patch, it is always the case, if

_svc_vc_checkextpg() returns false when it should not,
is not a serious problem, since there won't be a regression.

I don't think this is a right fix. A route lookup now doesn't guarantee the same interface will be used in the future. There is dynamic routing, weighted routing, policy routing, etc etc etc.

There should be some generic gate that would convert mbufs otherwise, we will need to add a code like this every module that generates mbufs. Again, a code that is not correct when routing isn't static.

I don't think this is a right fix. A route lookup now doesn't guarantee the same interface will be used in the future. There is dynamic routing, weighted routing, policy routing, etc etc etc.

There should be some generic gate that would convert mbufs otherwise, we will need to add a code like this every module that generates mbufs. Again, a code that is not correct when routing isn't static.

As Rick said in the description, this is just a hint, and the conversion routine in ip_output() will handle any case where the egress NIC changes. However, just using extpg mbufs all the time seems like a better solution. That's what SW ktls offload does. mb_unmapped_to_ext() is not super expensive, and most high-ish performance NICs are aware of extpgs these days (iflib + mlx5 + cxgbe covers most 10Gb or more NICs in practice).

I don't think this is a right fix. A route lookup now doesn't guarantee the same interface will be used in the future. There is dynamic routing, weighted routing, policy routing, etc etc etc.

There should be some generic gate that would convert mbufs otherwise, we will need to add a code like this every module that generates mbufs. Again, a code that is not correct when routing isn't static.

As Rick said in the description, this is just a hint, and the conversion routine in ip_output() will handle any case where the egress NIC changes. However, just using extpg mbufs all the time seems like a better solution. That's what SW ktls offload does. mb_unmapped_to_ext() is not super expensive, and most high-ish performance NICs are aware of extpgs these days (iflib + mlx5 + cxgbe covers most 10Gb or more NICs in practice).

Kostik preferred not enabling it all the time.
See thtis:
https://lists.freebsd.org/archives/freebsd-net/2026-May/008766.html

The case of concern is some corner case, where
the "always enabled" causes a regression. I asked on freebsd-net@
to try and determine if such a case exists?

Kostik didn't indicate if he thought such a case exists, but was
concerned that the user wouldn't understand why the regression
happened (and the NFS patch includes a sysctl that turns it off).
--> Right now, it is never enabled, so not enabling it does not

 introduce a regression.
Enabling it unconditionally makes a regression more likely
than only enabling it when this hint thinks the NIC can do it
and that hint is incorrect.

I doubt NFS servers have frequent routing changes, but I do not
know that for certain?

You guys can debate it. If there is no consensus, I'll just leave
it disabled as it is now.

I don't think this is a right fix. A route lookup now doesn't guarantee the same interface will be used in the future. There is dynamic routing, weighted routing, policy routing, etc etc etc.

There should be some generic gate that would convert mbufs otherwise, we will need to add a code like this every module that generates mbufs. Again, a code that is not correct when routing isn't static.

As Rick said in the description, this is just a hint, and the conversion routine in ip_output() will handle any case where the egress NIC changes. However, just using extpg mbufs all the time seems like a better solution. That's what SW ktls offload does. mb_unmapped_to_ext() is not super expensive, and most high-ish performance NICs are aware of extpgs these days (iflib + mlx5 + cxgbe covers most 10Gb or more NICs in practice).

Kostik preferred not enabling it all the time.
See thtis:
https://lists.freebsd.org/archives/freebsd-net/2026-May/008766.html

The case of concern is some corner case, where
the "always enabled" causes a regression. I asked on freebsd-net@
to try and determine if such a case exists?

Kostik didn't indicate if he thought such a case exists, but was
concerned that the user wouldn't understand why the regression
happened (and the NFS patch includes a sysctl that turns it off).
--> Right now, it is never enabled, so not enabling it does not

 introduce a regression.
Enabling it unconditionally makes a regression more likely
than only enabling it when this hint thinks the NIC can do it
and that hint is incorrect.

I doubt NFS servers have frequent routing changes, but I do not
know that for certain?

You guys can debate it. If there is no consensus, I'll just leave
it disabled as it is now.

I think assuming a performance reduction from mb_unmapped_to_ext() may be a bad assumption. Eg, in my experience from ~2017 or so (when Netflix had enough non-https traffic to matter), using M_EXTPG mbufs for sendfile and doing the conversion at the edge in ip_output() was *faster* than using plain mbufs. I attributed this to avoiding cold-cache pointer chasing in socket buffers. Note that sendfile's use of ext pgs is gated by kern.ipc.mb_use_ext_pgs, which has been 1 since its inception at Netflix. We should probably default it to 1 upstream as well.

I don't think this is a right fix. A route lookup now doesn't guarantee the same interface will be used in the future. There is dynamic routing, weighted routing, policy routing, etc etc etc.

There should be some generic gate that would convert mbufs otherwise, we will need to add a code like this every module that generates mbufs. Again, a code that is not correct when routing isn't static.

As Rick said in the description, this is just a hint, and the conversion routine in ip_output() will handle any case where the egress NIC changes. However, just using extpg mbufs all the time seems like a better solution. That's what SW ktls offload does. mb_unmapped_to_ext() is not super expensive, and most high-ish performance NICs are aware of extpgs these days (iflib + mlx5 + cxgbe covers most 10Gb or more NICs in practice).

Kostik preferred not enabling it all the time.
See thtis:
https://lists.freebsd.org/archives/freebsd-net/2026-May/008766.html

The case of concern is some corner case, where
the "always enabled" causes a regression. I asked on freebsd-net@
to try and determine if such a case exists?

Kostik didn't indicate if he thought such a case exists, but was
concerned that the user wouldn't understand why the regression
happened (and the NFS patch includes a sysctl that turns it off).
--> Right now, it is never enabled, so not enabling it does not

 introduce a regression.
Enabling it unconditionally makes a regression more likely
than only enabling it when this hint thinks the NIC can do it
and that hint is incorrect.

I doubt NFS servers have frequent routing changes, but I do not
know that for certain?

You guys can debate it. If there is no consensus, I'll just leave
it disabled as it is now.

I think assuming a performance reduction from mb_unmapped_to_ext() may be a bad assumption. Eg, in my experience from ~2017 or so (when Netflix had enough non-https traffic to matter), using M_EXTPG mbufs for sendfile and doing the conversion at the edge in ip_output() was *faster* than using plain mbufs. I attributed this to avoiding cold-cache pointer chasing in socket buffers. Note that sendfile's use of ext pgs is gated by kern.ipc.mb_use_ext_pgs, which has been 1 since its inception at Netflix. We should probably default it to 1 upstream as well.

I didn't assume that. In fact, for the limited testing I've done using slow networking
(1Gbps), I see wire speed for both and lower CPU overheads for m_extpg mbufs.
The fear is that someone will hit a corner case that causes severe problems.
(ZFS can be very good at sucking up all the pages in a system and m_extpg
mbufs are using them as well, instead of mbuf clusters from a dedicated pool.)

I really want to avoid more issues like bugzilla PR#292884, 293127. (If you haven't
looked at these and have any insight w.r.t. them, hearing about that would be
appreciated. I think they're the same underlying problem, which is somehow,
somwhere, the socket gets sorele()'d/sofree()d prematurely. I've now gone
through the krpc code in detail and cannot spot anything in it that could cause
this.)

Anyhow, maybe enabling it for all (with a knob to turn it off), is the way to go?

I think it could be set to 'enabled' for machines with DMAP, i.e. amd64/arm64, and might be risc-v, if anybody ever uses nfs server on it. For other arches, the knob almost surely should be kept disabled.
mb_unmapped_to_ext() uses non-privately mapped sfbufs for all extents of mbufs. On DMAP systems, it is free. On other arches, allocating such sfbuf causes global IPI, and the whole chain of sfbufs is freed only after the mbufs are released by the network card. Besides the cost of allocating , this would make sfbufs scarce resource for other consumers and even for nfs server itself. There are around ~1K of sfbufs, and mb_unmapped_to_ext() seems to drop packet if an sfbuf cannot be allocated immediately.

In D55203#1309629, @kib wrote:

I think it could be set to 'enabled' for machines with DMAP, i.e. amd64/arm64, and might be risc-v, if anybody ever uses nfs server on it. For other arches, the knob almost surely should be kept disabled.
mb_unmapped_to_ext() uses non-privately mapped sfbufs for all extents of mbufs. On DMAP systems, it is free. On other arches, allocating such sfbuf causes global IPI, and the whole chain of sfbufs is freed only after the mbufs are released by the network card. Besides the cost of allocating , this would make sfbufs scarce resource for other consumers and even for nfs server itself. There are around ~1K of sfbufs, and mb_unmapped_to_ext() seems to drop packet if an sfbuf cannot be allocated immediately.

Its so far in the past that I no longer have the results, but I recall doing a test on an i386 kernel and being surprised that the overhead from extpgs was still lower on 32b. I was expecting the same thing as you.

Drew

Sorry for widening the discussion, but I think it is better to discuss this more rather then end up with a quick fix. At Netflix Drew is working on a module that is very different from NFS, but is also a fat source of traffic. We have a very similar function in there.

IMHO, in our network stack there are two pulling directions - FreeBSD kernel as a router and FreeBSD kernel as a fat source of traffic. We want to excel at both of course. Ideally, we want to excel at both at the same time. Minimal requirement though is that improving one side shall not make other side dysfunctional, e.g. everything works with a decent performance w/o any tuning. Medium solution would be that with some tuning you can switch for the best performance of either case.

Should I start email thread on that somewhere? Better do that before we have several functions like this one in several different subsystems.

P.S. Writing comment on this particular NFS patch in next message.

For this particular patch, I'd suggest the following.

The _svc_vc_checkextpg() should be made a generic function living somewhere in sys/net/route. It should accept socket address and return a referenced struct nhop_object *. It should call fib[46]_lookup() with NHR_REF flag. It also should be passed a u_int * argument that it shall fill with current route table generation number obtained via rt_tables_get_gen(). Alternatively the generic function may assert the epoch, and then it would be the caller's choice to pull the route generation number together with the lookup or not.

The svc_vc code should call this generic function at setup and store returned struct nhop_object * and rtgen number in its xprt.

On packet generation it should first do rt_tables_get_gen() to validate that routing table did not change. If it did, then dereference the stored nhop and call the generic function again to get a new one. After that it can consult xprt->nh->nh_ifp->if_capenable to decide if to use extended mbufs. AFAIK, we are always in the net epoch at the packet generation.

The above is not 100% beautiful as the rt_* and nhop_* and fib_* KPIs are slightly decoupled, despite they describe the same backend. However, the above is exactly what the generic ip_output() does today.

P.S. Alexander had left a promising comment for bright future in nhop.h:

* TODO: subscribe for the interface notifications and update the nexthops
*  with NHF_INVALID flag.

So some day this is going to be improved.

For this particular patch, I'd suggest the following.

The _svc_vc_checkextpg() should be made a generic function living somewhere in sys/net/route. It should accept socket address and return a referenced struct nhop_object *. It should call fib[46]_lookup() with NHR_REF flag. It also should be passed a u_int * argument that it shall fill with current route table generation number obtained via rt_tables_get_gen(). Alternatively the generic function may assert the epoch, and then it would be the caller's choice to pull the route generation number together with the lookup or not.

The svc_vc code should call this generic function at setup and store returned struct nhop_object * and rtgen number in its xprt.

On packet generation it should first do rt_tables_get_gen() to validate that routing table did not change. If it did, then dereference the stored nhop and call the generic function again to get a new one. After that it can consult xprt->nh->nh_ifp->if_capenable to decide if to use extended mbufs. AFAIK, we are always in the net epoch at the packet generation.

The above is not 100% beautiful as the rt_* and nhop_* and fib_* KPIs are slightly decoupled, despite they describe the same backend. However, the above is exactly what the generic ip_output() does today.

P.S. Alexander had left a promising comment for bright future in nhop.h:

* TODO: subscribe for the interface notifications and update the nexthops
*  with NHF_INVALID flag.

So some day this is going to be improved.

I personally think NFS (or anything else) should *NOT* do this, and that it should just use mb_use_ext_pgs like sendfile does.

In 2021, this was defaulted to enabled for platforms with a direct map.

In D55203#1309629, @kib wrote:

I think it could be set to 'enabled' for machines with DMAP, i.e. amd64/arm64, and might be risc-v, if anybody ever uses nfs server on it. For other arches, the knob almost surely should be kept disabled.
mb_unmapped_to_ext() uses non-privately mapped sfbufs for all extents of mbufs. On DMAP systems, it is free. On other arches, allocating such sfbuf causes global IPI, and the whole chain of sfbufs is freed only after the mbufs are released by the network card. Besides the cost of allocating , this would make sfbufs scarce resource for other consumers and even for nfs server itself. There are around ~1K of sfbufs, and mb_unmapped_to_ext() seems to drop packet if an sfbuf cannot be allocated immediately.

Actually, that is already in the NFS patch (I know, I didn't put that
one up because I didn't think anyone would care), since the code
I did for NFS-over-TLS uses m_extpg mbufs and needs the DMAP.
(I had thought all 64bit arches had DMAP, but I'll admit I didn't look.)

Ok, so I see that, for PMAP_HAS_DMAP arches, all it does is set up
the page as a cluster (sf_buf_alloc() is essentially a no-op).
It looks like the only overhead is a bunch of mget() calls that allocate
the new mbuf list (one for each page) for this case?

This might actually be an improvement over allocating 2K mbuf clusters
(about 1/2 the mbufs in the chain). I did once try a patch using 4K clusters,
but it caused fragmentation problems in the mbuf allocation stuff (I think
that that fragmentation problem is now resolved?).

So, it seems to come down to...

  • Use 4K clusters if the NIC doesn't support IFCAP_MEXTPG and use m_extpg mbufs if it does.

OR

  • Just always use m_extpg mbufs and accept the overhead of allocating a bunch of new mbufs in mb_unmapped_to_ext() for the cases where the NIC doesn't support IFCAP_MEXTPG (only Mellanox and Chelsio fast NICs appear to do so now).

Does that sound reasonable? rick

I personally think NFS (or anything else) should *NOT* do this, and that it should just use mb_use_ext_pgs like sendfile does.

I also think we should *NOT* do this.
nh can be one of the nexthops from a nexthop group.
Even if you pass the inp_flowid, that flowid or the final nhop slot can change via RSS sysctl or nhgrp subsystem.
I can write a more generic route helper for these use cases, if you want.
For example, a function like yours (_svc_vc_checkextpg) but that also takes the desired IF_CAP to verify whether (all) the next hop(s) supports it or not.

But I strongly suggest *NOT* going down this path.

I'll just leave things in their current state.

I agree with kib@ that enabling M_EXTPG
mbufs could result in a regression that the
user would not be able to track down.

Since the gain is 5-15%, I think the risk
outweights the gain for enabling by default.