Page MenuHomeFreeBSD

Add link header precomputation for ethernet/infiniband. Make arp/ndp/bpf/flowtable use it.
ClosedPublic

Authored by melifaro on Nov 7 2015, 7:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 8, 9:28 AM
Unknown Object (File)
Mon, Jan 6, 1:03 AM
Unknown Object (File)
Sun, Dec 22, 6:41 PM
Unknown Object (File)
Dec 2 2024, 3:07 PM
Unknown Object (File)
Dec 2 2024, 3:18 AM
Unknown Object (File)
Nov 28 2024, 5:48 PM
Unknown Object (File)
Nov 28 2024, 5:48 PM
Unknown Object (File)
Nov 28 2024, 3:07 AM

Details

Summary

This is projects/routing pre-requisite.

Add if_requestencap() interface method which is capable of calculating various link headers for given interface. Right now there is support for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation (L2 multipath lagg nexthops, tunnel encap nexthops, etc..).

Reshape 'struct route' to be able to pass additional data (with is length) to prepend to mbuf.

These two changes permits routing code to pass cached nexthop data (like L2 header for route w/gateway) down the stack eliminating the need for other lookups.
It also brings us closer to more complex scenarios like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable code (ro_lle) and simplifies handling of existing if_output consumers.

ARP/NDP lookups:
There is free room inside first 64 bytes of struct lle sufficient to store full ethernet/ipoib header. Make arp/ndp stack pre-calculate link header upon installing/updating lle record. Interface link address change are handled by re-calculating headers for all lles based on if_lladdr event. After these changes, arpresolve()/nd6_resolve() returns full pre-calculated header for supported interfaces this simplifying if_output().
Move these lookups to separate (inlined) <ether|ipoib>_resolve_addr() function which ether returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr() compat versions to return link addresses instead of pre-calculated data.

ARP requests:
Arp is control plane protocol, so we don't care about performance that much. Use if_requestencap() to pre-calculate link header for each packet and provide if_output() with full prepend data eliminating the need for AF_ARP case.

BPF:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only difference is that interface source mac has to be filled by OS for AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside BPF and not pollute if_output() routines. After that change, convert BPF to pass prepend data via new 'struct route' mechanism. Note that it does not change for non-optimized families: ro+prepend handling is purely optional.

Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI. It is not needed for ethernet anymore. The only remaining FDDI user is dev/pdq mostly untouched since 2007. DLT_FDDI was eliminated on OpenBSD in 2014: http://openbsd-archive.7691.n7.nabble.com/bpf-4-obsolete-data-link-levels-td246688.html .

Flowtable:
Flowtable violates layering by saving (and not correctly managing) rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated header data from that lle.

Test Plan

ARP/NDP:
checked IPv4/IPv6 connectivity. Tried to change interface mac address - everything worked as expected.
Tried to insert static arp record - works correctly.

BPF: seem to work. See example application I used to check BIOCGHDRCMPLT handling:

1#include <stdio.h>
2#include <stdlib.h>
3#include <string.h>
4#include <sys/types.h>
5#include <sys/time.h>
6#include <unistd.h>
7#include <err.h>
8#include <fcntl.h>
9#include <sysexits.h>
10#include <sys/ioctl.h>
11#include <sys/socket.h>
12#include <netinet/in.h>
13#include <arpa/inet.h>
14#include <net/bpf.h>
15#include <net/ethernet.h>
16#include <net/if.h>
17#include <net/if_arp.h>
18
19#define BPF_PATH "/dev/bpf"
20
21static void
22send_arp(int fd)
23{
24 char buf[128];
25 struct ether_header *eh;
26 struct arphdr *ah;
27 struct in_addr sip, tip;
28 int len, ret;
29
30 memset(buf, 0, sizeof(buf));
31
32 eh = (struct ether_header *)buf;
33 ah = (struct arphdr *)(eh + 1);
34
35 eh->ether_type = htons(ETHERTYPE_ARP);
36 memcpy(eh->ether_shost, ether_aton("00:01:00:02:00:03"), sizeof(eh->ether_shost));
37 memcpy(eh->ether_dhost, ether_aton("00:02:00:04:00:04"), sizeof(eh->ether_dhost));
38
39 inet_pton(AF_INET, "1.2.3.4", &sip);
40 inet_pton(AF_INET, "2.3.4.5", &tip);
41
42 ah->ar_hrd = ARPHRD_ETHER;
43 ah->ar_pro = htons(ETHERTYPE_IP);
44 ah->ar_hln = 6;
45 ah->ar_pln = sizeof(struct in_addr);
46 ah->ar_op = htons(ARPOP_REQUEST);
47 bcopy(eh->ether_shost, ar_sha(ah), ah->ar_hln);
48 bcopy(&sip, ar_spa(ah), ah->ar_pln);
49 bcopy(&tip, ar_tpa(ah), ah->ar_pln);
50
51 len = sizeof(*eh) + sizeof(*ah) + 2 * sizeof(struct in_addr) + 2 * 6;
52
53 if ((ret = write(fd, buf, len)) != len)
54 warn("write() error");
55}
56
57int
58main(int ac, char **av)
59{
60 int fd;
61 struct ifreq ifr;
62 u_int val;
63
64 if (ac < 2)
65 errx(EX_USAGE, "usage: %s iface [hdrcmplt]", av[0]);
66
67 if ((fd = open(BPF_PATH, O_RDWR)) == -1)
68 err(EX_OSFILE, "failed to open %s\n", BPF_PATH);
69
70 memset(&ifr, 0, sizeof(ifr));
71 strlcpy(ifr.ifr_name, av[1], sizeof(ifr.ifr_name));
72 if (ioctl(fd, BIOCSETIF, &ifr) != 0)
73 err(EX_DATAERR, "failed to bind to interface %s", av[1]);
74
75 if (ac > 2) {
76 val = 1;
77 if (ioctl(fd, BIOCSHDRCMPLT, &val) != 0)
78 err(EX_OSERR, "failed to set BIOCSHDRCMPLT");
79 }
80
81 send_arp(fd);
82
83 return (0);
84}

Flowtable:
Build kernel with 'FLOWTABLE' option, ensured it was turned on, tried to establish TCP sessions for IPv4 local/remote destination. Worked.

INFINIBAND: I did no checks except checking compilation.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 1618
Build 1624: arc lint + arc unit

Event Timeline

melifaro retitled this revision from to Add link header precomputation for ethernet/infiniband. Make arp/ndp/bpf/flowtable use it..
melifaro updated this object.
melifaro edited the test plan for this revision. (Show Details)
melifaro edited the test plan for this revision. (Show Details)

Hi,

Please wait until infiniband is tested.

Thank you for your patch!

--HPS

Of course. Testing is really required here, thanks for doing that :)

Hans, any progress on testing infiniband? :)

Hi,

Mellanox is currently working on an ib_core update and need some more time to test/integrate your patches into our patchset, else they might get lost. Do you have a deadline for when you want to push this change?

--HPS

There are other things depending on that change. However, I don't really need IB changes at this stage - converting any specific networking technology is purely optional. Since I proposed sort of generic API that should suit for all cases, it would be strange to convert only ethernet - that's why I needed some other part.
So, if code updating is supposed to take more than several weeks, I will just split this review into ethernet part and IB part.

Hi,

If you can keep the infiniband changes in a separate differential patch, which can be integrated after the coming infiniband changes, that would be great.

--HPS

kmacy added subscribers: rwatson, gnn.

First off - I like the approach this takes. Passing around a pre-computed header is much cleaner than passing an llentry around, obviating the corresponding reference counting and locking.

That said, I apologize in advance if I'm overlooking something fundamental, but this doesn't address or fundamentally conflict with the incpb route caching. In this implementation we still need to go through the overhead of finding the rtentry to get the ifp for every packet and we still need to call lla_lookup (locking the ifp) on every packet. I'm sure eventually that will be cleaned up and we need to pass the rtentry around any more. As that evolves the inpcb route caching can be updated correspondingly.

As things stand the route caching patch will need to be changed to store prepend instead of an lle reference.

I suggest we push to get this patch in and then update the pcb caching patch(es). Nonetheless, the routing patch is smaller and updating it after the fact would hardly interfere with your work. Do you have some other major set of changes ready for _immediate_ import that would explain the strength of your past and present opposition?

Thanks.

Thanks for reviewing that one.

There are no fundamental conflicts with inpcb patch in D4364 - it is a very clean and non-invasive way of doing caching.
I'm okay with your suggestion on merging both D4102 and D4364.

Regarding "opposition": the discussion in last November stopped on doing tests and I wouldn't say I was strictly against this change. My position was that after updated locking situation can be improved to reach the same levels of performance (or be very close to) as in cached version. Unfortunately, we didn't manage to measure the numbers in both approaches. (Also, I made some promises on timescale I didn't keep and I'm sorry for that).

In the current discussion in D4306 I was against that specific implementation (ip_output+arpresolve() parts) which is different from your version.
And, again, there was no performance tests.

To make performance part a bit more specific, I've done some quick measurements in fastforwarding code w/ simple patch (attached) on current -head to measure current situation.
fastforwarding code is not exactly the same case as TCP (combined in/out w/ run to completion, no tso, no TCP per-session locks) but can give pretty good estimate on different approaches. So, here is the graph:

lock_fwd (600×800 px, 65 KB)

"rwlock" line is the current -head without rte locking (comment define in route.c and if_ether.c)
"rmlocked" line is what projects/routing targets (rmlock for radix and lltable) (define _USE_RM in route.c and if_ether.c)
"unlocked" line is radix/lltable w/o lock (define _USE_UNLOCKED in route.c and if_ether.c)
"cached" line is always passing single rte/lle from global "struct route" in ip_findroute() (use route -n get "addr" -proto1 to set rte/lle , route -n get 0.0.0.0/1 -proto1 to clear)
"linear" line is linear approximation of single-core result for "cached" option (the fastest).

Quick conclusion: cached approach is 7% better than head after projects/routing merge can be and this is significant difference.

Actual patch: F323935

Raw data:
Cores,rwlocked,rmlocked,unlocked,cached,linear
1,0.85,0.92,0.96,1.01,1.01
2,1.47,1.84,1.93,1.99,2.02
4,2.74,3.74,3.85,4.03,4.04
8,4.05,6.95,7.13,7.43,8.08
16,2.46,11.2,11.5,12.2,16.16

I like the direction of this patch. It will require me extra work to merge it into projects/ifnet, but it is worth it.

I've put a bunch of minor comments. Please address them and upload updated patch.

sys/net/ethernet.h
401

Why is this declaration needed? As far as I see it should be static to if_ethersubr.c.

sys/net/flowtable.c
703

Please, please, write this down in readable way.

if (lle->la_flags & LLE_IFADDR)
ro->ro_flags |= RT_L2_ME;

sys/net/if_ethersubr.c
314

"inline" here is superfluous, compiler knows better what is optimal.

sys/net/if_llatbl.c
359

This function should either be void, or needs to be fixed. It always returns 0.

sys/net/route.c
1242

Why in route.c?? Should sit static in net/if.c.

sys/net/route.h
396

Please provide more documentation for the structure usage. E.g.:

/*

  • Structure passed to interface if_requestencap method by
  • IP protocols (?) layer. *
  • Input fields: ...
  • Output fields: ...
  • ...
sys/netinet/if_ether.c
1089

Why do we still call ifp->if_output() in case if error is EAFNOSUPPORT?

1285

I believe __noinline here is superfluous. You reference the function by pointer in EVENTHANDLER_REGISTER(), compiler can not inline it.

melifaro edited edge metadata.

Update diff to exclude infiniband part and address comments.

melifaro added inline comments.
sys/net/ethernet.h
401

Initially I thought about some subsystems not using ether_ifattach() but using the same encapsulation type.
It looks like there are no real consumers for that now, so I made it static as you suggested.

sys/net/route.c
1242

Well, actually it should be in rt_nhops.c (nexthops handling routines from projects/routing) but this file does not exists in base yet. Moved to if.c.

sys/netinet/if_ether.c
1089

Well, ARP is extendable protocol and there really are some new proposals on extending like: https://tools.ietf.org/html/draft-kompella-larp-00

Idea was the following: if we're using some loadable module for new protocol/subsystem, we might lack encapsulation for that in base methods, but we can do runtime-based encapsulation in if_output() routine.

Probably this should be handled by further enhancing encap request code, so I removed that part.

I haven't reviewed this as closely as I should. However, if I read this correctly, this is a new caching mechanism that:

  • is not currently used for caching
  • has no cache validation or invalidation

Note that the header, if cached, needs to be invalidated if the L2 destination changes, or if the local MAC changes (yes, we have a mechanism for that). Also, if the L3 destination changes, any L2 cache would need to be invalidated. I don't see mechanisms for any of that.

As such, I don't think this patch can be approved.

Update patch to reflect recent netinet6/ lltable changes.

In D4102#95769, @mike-karels.net wrote:

I haven't reviewed this as closely as I should. However, if I read this correctly, this is a new caching mechanism that:

  • is not currently used for caching

Mike,
Thanks for the comments.
I probably should have written more detailed summary. Let me try to rephrase it.

First of all, I wouldn't say that it is "caching" mechanism. The goal is to get the ability to pass arbitrary prepend via packet output routine.
if_output() doesn't care if this prepend is cached version of something or not.
Right now this is not the case - each protocol has to be supported in if_output() and for AF_INET[6] it _does_ have to know something about lle caching. (And this change eliminates lle depend).

Since this change among other things alters some lltable internals (and if we consider lltable as a sort of cache), then yes, it definitely has some connection with caching,
but I'm not sure if this really matters.

As for lle changes, I prefer the term "precalculated" since the headers are calculated in control plane upon installation/change and not in data path.
These headers are also primary data storage for destination link-layer addresses (at least for now).

  • has no cache validation or invalidation

Not sure what do you mean. lle link-layer headers are pre-calculated prior installing new entry or on dst mac address change.
Invalidation will happen automatically when lle expires.

Note that the header, if cached, needs to be invalidated if the L2 destination changes, or if the local MAC changes (yes, we have a mechanism for that). Also, if the L3 destination changes, any L2 cache would need to be invalidated. I don't see mechanisms for any of that.

You're right, we have working local mac address change notification mechanism since D4004 && D4019 . Please take a look on arp_iflladdr() and nd6_iflladdr() in this diff.

The only consumer using ro_prepend right now (old ro_lle) for L3 is flowtable in AF_INET. There is a line clearing ro_prepend in ip_input, which basically retains the same behaviour for flowtable. Similar line could be added to ip6_output(), that's not a big deal.

As such, I don't think this patch can be approved.

Thanks! More comments :)

sys/net/ethernet.h
390

Now this declaration isn't needed, either.

sys/net/if_ethersubr.c
262

Please write this in a readable way, too :)

361

Is this change just improved spacing? If yes, can you please just commit it before the patch.

sys/net/if_llatbl.h
64

A question not very related to the changeset itself. Why do we keep spares in llentry? Is it exported to userland in any way? Do drivers care about it?

sys/net/route.c
1237

Looks like this file shouldn't be touched by changeset at all.

sys/net/route.h
431

Is it possible for this declaration and for ife_type enum to be moved to if_var.h? The structure is argument to interface method. So it is part of ifnet KPI, not routing KPI.

In fact it is very similar to configuration structures of projects/ifnet, which live in if.h. However, in head the if.h and if_var.h still are very entangled, so for you it'll be easier to put them into if_var.h, rather than in if.h.

P.S. Well, all structure members are simple types, so it can live in if.h even in head. It'll be a trouble to find a proper contextual place for it, however. So if_var.h or if.h - you decide.

sys/netinet/if_ether.c
1089

But now we still call ifp->if_output() after arp_fillheader() failed. Is that correct?

melifaro marked 2 inline comments as done.

Address glebius@ comments and sync to recent HEAD.

Finally fix ip_arpintr() reply error handling.

sys/net/if_ethersubr.c
361

Not exactly. Newly-added RT_L2_ME lle flag makes it possible to avoid src/dst mac comparison and simplify this condition.

sys/net/if_llatbl.h
64

The only reason is to explicitly indicate remaining free space from the first 64-byte on amd64.

sys/net/route.h
431

No problem with that. Let it be if_var.h for now.

sys/netinet/if_ether.c
1089

No, t's not. There is a very similar fragment in arprequest() which has return, so I thought I already fixed that one. Fixed now.

Update once again to clarify arp_fillheader() behavior.

sys/netinet/if_ether.c
1089

Sorry, I was answering wrong question. Twice.

Generally speaking, default encapsulation routine serves 2 purposes:

  1. it ensures we always have some sort of handler (not to bother w/ NULL check)
  2. it should provide compatibility for not (yet) converted parts of stack (IPOIB, FDDI, ARC, etc..).

EAFNOSUPPORT serves the following case:
one needs to send ARP packet via IB/FDDI network. Default encapsulation returns this particular error to indicate that encapsulation is done later, in "traditional way" e.g. interface output routine.

Given that, it is perfectly OK to proceed after EAFNOSUPPORT - output route will ignore all _prepend magic (since it is not converted) and attach ARP header itself.

glebius edited edge metadata.
glebius added inline comments.
sys/net/route.h
390

Spurious blank lines left.

This revision is now accepted and ready to land.Dec 30 2015, 6:48 PM
melifaro edited edge metadata.

Do pre-commit sync.

This revision now requires review to proceed.Dec 31 2015, 4:58 AM
This revision was automatically updated to reflect the committed changes.