Page MenuHomeFreeBSD

Replace if_addr_lock rwlock with epoch + mutex
ClosedPublic

Authored by mmacy on May 9 2018, 6:50 AM.
Tags
None
Referenced Files
F107763243: D15366.id42392.diff
Sat, Jan 18, 1:01 AM
Unknown Object (File)
Fri, Jan 17, 2:55 PM
Unknown Object (File)
Sun, Jan 5, 12:59 AM
Unknown Object (File)
Thu, Jan 2, 7:32 PM
Unknown Object (File)
Sun, Dec 29, 7:12 PM
Unknown Object (File)
Dec 2 2024, 7:48 AM
Unknown Object (File)
Dec 2 2024, 7:48 AM
Unknown Object (File)
Dec 2 2024, 7:48 AM

Details

Summary

Mostly mechanical change to use the CK_STAILQ macros and defer frees to an epoch_call

Depends on D15365.

Test Plan

Run on NFLX and LLNW canaries

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Not allwinner specific, hoping this will remove my blocking.

Looks awesome. My only issue is that I'd strongly prefer that the STAILQ_HEAD / STAILQ_INIT / STAILQ_ENTRY macros be prefixed with CK_ so that readers realize that these lists are used with epochs, and cannot be safely used with the normal STAILQ macros. I tried to point out most of them, but I may have missed a few.

The other issue is that it looks like there is at least one occurrence of a hand-rolled STAILQ_FOREACH that may be missing some epoch safety.

This is a huge public service; thanks for doing this!

sys/contrib/ipfilter/netinet/ip_compat.h
230

Should be CK_STAILQ_HEAD

sys/net/if_var.h
94

CK_STAILQ_HEAD

521

CK_STAILQ_ENTRY

sys/netinet/in_var.h
83

For clarity, it seems like this should be CK_STAILQ_ENTRY (yes, I realize definition is the same, but it gives a hint to humans what's going on with this list)

110

For clarity, it seems like this should be CK_STAILQ_HEAD

176

This looks almost like CK_STAILQ_FOREACH, except that does a ck_pr_fence_load(). Do you need that?

sys/netinet6/in6_var.h
130

CK_STAILQ_ENTRY

sys/netinet6/ip6_input.c
225

CK_STAILQ_INIT

sys/contrib/ipfilter/netinet/ip_compat.h
230

This is strictly for use by user, I'm hesitant to break buildworld for the sake of consistency there. But with a 96-way I can give it a whirl without much opportunity cost.

User land includes if_var.h - fix buildworld

  • Change sa_family to AF_UNSPEC after removal to indicate to readers that the address has been logically deleted
  • fix up compile after merging D15365

Under a heavy UDP packet flood this helps quite a bit.

Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

On my dual 8160 with the equivalent workload I see in_broadcast go from 0.86% of samples to 0.24% of samples. However in_pcblookup_hash goes from 0.78% to 1.36%. I can't actually generate enough load to get a performance improvement because FreeBSD UDP tx doesn't actually do any queue hashing:
mmacy@pandemonium [~/FlameGraph|21:44|19] sysctl dev.cc.0 | grep enqueues
dev.cc.0.txq.15.r_enqueues: 843
dev.cc.0.txq.14.r_enqueues: 899
dev.cc.0.txq.13.r_enqueues: 886
dev.cc.0.txq.12.r_enqueues: 887
dev.cc.0.txq.11.r_enqueues: 918
dev.cc.0.txq.10.r_enqueues: 891
dev.cc.0.txq.9.r_enqueues: 910
dev.cc.0.txq.8.r_enqueues: 896
dev.cc.0.txq.7.r_enqueues: 662834
dev.cc.0.txq.6.r_enqueues: 896
dev.cc.0.txq.5.r_enqueues: 855
dev.cc.0.txq.4.r_enqueues: 874
dev.cc.0.txq.3.r_enqueues: 902
dev.cc.0.txq.2.r_enqueues: 891
dev.cc.0.txq.1.r_enqueues: 897
dev.cc.0.txq.0.r_enqueues: 905543423

With the following patch I get uniform packet distribution on the sender

diff --git a/sys/netinet/udp_usrreq.c b/sys/netinet/udp_usrreq.c
index 670182ece8b..1a279382cab 100644
--- a/sys/netinet/udp_usrreq.c
+++ b/sys/netinet/udp_usrreq.c
@@ -1592,6 +1592,8 @@ udp_attach(struct socket *so, int proto, struct thread *td)
        inp = sotoinpcb(so);
        inp->inp_vflag |= INP_IPV4;
        inp->inp_ip_ttl = V_ip_defttl;
+       inp->inp_flowid = arc4random();
+       inp->inp_flowtype = M_HASHTYPE_OPAQUE;
 
        error = udp_newudpcb(inp);
        if (error) {

But the profile is only marginally less embarrassing:
https://github.com/mattmacy/profiling/blob/master/2018.05.11/udpsender2.svg

On going progress on fixing the brokenness in UDP transmit. Shelving this for the day. I'll probably just move to using pkt-gen on Monday.
https://github.com/ScaleBSD/scalebsd.github.io/blob/master/_drafts/2018-05-12-UDP-On-FreeBSD-is-anyone-using-this-at-scale.md

This revision is now accepted and ready to land.May 16 2018, 12:12 AM
hselasky added inline comments.
sys/net/if.c
983

Setting AF_UNSPEC must be after epoch_wait() ? Else you introduce a race.

sys/net/if.c
1001

epoch_wait() first.

1175

epoch_wait() first. Then AF_UNSPEC.

3781

Same here. epoch_wait() must happen before this write.

3794

epoch_wait() first.

It looks like those assignments didn't actually make it in to the branch that's been tested. Look here:
https://github.com/mattmacy/networking/tree/projects/epochifaddrnew