Page MenuHomeFreeBSD

[WIP] netlink: add basic netlink support
Needs ReviewPublic

Authored by melifaro on Sun, Jul 31, 10:49 AM.

Details

Reviewers
None
Group Reviewers
network
Summary

What is netlink?

Netlinks is a communication protocol currently used in Linux kernel to modify, read and subscribe for nearly all networking state. Interface state, addresses, routes, firewall, rules, fibs, etc are controlled via netlink.
It is async, TLV-based protocol, providing 1-1 and 1-many communications.

Why netlink is important for FreeBSD?

POSIX defined API for base functions/system calls. There is no such standard for plethora of various protocol/device-level/subsystem-level ioctls. Each subsystem/driver invents its own protocol, handling format and compatibility.
Netlink changes that by providing standard communication layer and basic extendable message formatting. It can serve as a "broker", automatically combining requested data from different sources in a single request (example: interface state dump).

For example, devd can be easily switch to use netlink, retiring one-off protocol. Tools like jail, pfilctl, can be converted to use netlink instead of a bunch of private ioctls. It will be easier for app developers to interact with our network stack.

Immediate drivers for netlink

Nexthop and nexthop-group-related changes in the routing stack opened the way for more effective and feature-rich route-related interaction between userland and kernel. Extending our existing protocol, rtsock(4) is not easy - to provide efficient multipath signalling, one need to introduce a new type of messages, other that RTM_ADD/RTM_DEL for the purposes of signalling route changes. I did a test implementation with extended rtsock and net/bird. De-facto it ended up being an own TLV-based protocol, sharing nothing with rtsock except the socket and base message header. Pushing out a new protocol, which is not even shared by other BSDs doesn't look promising.
Instead, netlink was chosen as a transport.

Implementation overview

Initial implementation was written in GSoC 2021, based on Luigi's work in 2015. While it delivered some working code, it lacked support for group communication, sending large dumps, rtsock specifics and vnets. As a result, most of the code has been rewritten.

Netlink is implemented via loadable/unloadable kernel module, not touching many kernel parts.
To support async operation handling such as interface creation, dedicated tasqueue is created for each netlink socket. All message processing is handled within these task queues.

Handling messages to/from Linux processes requires their modification (address families and rtableid rewrites), so there exists the transparent intercept layer that allows to rewrite messages, including full message reconstruction.

What works

  • dumping interface state and interface addresses
  • interface / interface address notifications
  • add/del/get/dump routes
  • gate between netlink and rtsock (currently only rtsock->netlink)

Next steps

  • To add nhop/nhg support
  • To add interface change notifications
  • To add rtsock -> netlink notifications

Questions

General feasibility

Does it look reasonable to include the functionality (after further polishing/testing/etc) to FreeBSD (and GENERIC)?

Test Plan

Commands:

11:02 [1] m@devel2 ip -V
ip utility, iproute2-5.18.0
11:02 [1] m@devel2 s kldload netlink
11:02 [1] m@devel2 ip -6 r sh
prohibit ::/96 nhid 6 via ::1 dev lo0 proto static
::/0 nhid 7 via 2a01:4f8:13a:70c:ffff::1 dev vtnet0 proto static
::1 nhid 1 dev lo0 proto static
prohibit ::ffff:0.0.0.0/96 nhid 6 via ::1 dev lo0 proto static
2a01:4f8:13a:70c:ffff::/96 nhid 5 dev vtnet0 proto kernel
2a01:4f8:13a:70c:ffff::8 nhid 4 dev lo0 proto static
prohibit fe80::/10 nhid 6 via ::1 dev lo0 proto static
fe80::/64 nhid 5 dev vtnet0 proto kernel
fe80::5054:ff:fe14:e319 nhid 4 dev lo0 proto static
fe80::/64 nhid 3 dev lo0 proto kernel
fe80::1 nhid 2 dev lo0 proto static
prohibit ff02::/16 nhid 6 via ::1 dev lo0 proto static
11:02 [1] m@devel2 ip r show
0.0.0.0/0 nhid 4 via 10.0.0.1 dev vtnet0 proto static
10.0.0.0/24 nhid 2 dev vtnet0 proto kernel
10.0.0.8 nhid 3 dev lo0 proto static
127.0.0.1 nhid 1 dev lo0 proto kernel
11:02 [1] m@devel2 s ip r add 11.0.0.0/24 via 10.0.0.9
11:03 [1] m@devel2 ip r show
0.0.0.0/0 nhid 4 via 10.0.0.1 dev vtnet0 proto static
10.0.0.0/24 nhid 2 dev vtnet0 proto kernel
10.0.0.8 nhid 3 dev lo0 proto static
11.0.0.0/24 nhid 5 via 10.0.0.9 dev vtnet0 proto kernel
127.0.0.1 nhid 1 dev lo0 proto kernel
11:03 [1] m@devel2 s ip r del 11.0.0.0/24
11:03 [1] m@devel2 ip r show
0.0.0.0/0 nhid 4 via 10.0.0.1 dev vtnet0 proto static
10.0.0.0/24 nhid 2 dev vtnet0 proto kernel
10.0.0.8 nhid 3 dev lo0 proto static
127.0.0.1 nhid 1 dev lo0 proto kernel
11:03 [1] m@devel2
11:03 [1] m@devel2
11:03 [1] m@devel2 ip a sh
1: vtnet0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:14:e3:19
    inet 10.0.0.8 peer 10.0.0.255/24 scope global dynamic
    inet6 fe80::5054:ff:fe14:e319/64 scope link dynamic
    inet6 2a01:4f8:13a:70c:ffff::8/96 scope global dynamic
2: lo0: <NO-CARRIER,LOOPBACK,MULTICAST,UP> mtu 16384 qdisc noqueue state UNKNOWN qlen 1000
    link/ieee1394 08
    inet6 ::1/128 scope host dynamic
    inet6 fe80::1/64 scope link dynamic
    inet 127.0.0.1/8 scope global dynamic
11:03 [1] m@devel2 s kldunload netlink
11:03 [1] m@devel2

Events:

-> ifconfig vtnet0.2 create
3: vtnet0.2: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 08

-> ifconfig vtnet0.2 inet 10.11.0.1/24
3: vtnet0.2    inet 10.11.0.1 peer 10.11.0.255/24 scope global dynamic vtnet0.2
3: vtnet0.2    inet6 fe80::5054:ff:fe14:e319/64 scope link dynamic vtnet0.2

-> ifconfig vtnet0.2 inet6 2a02:6b8::35/64
3: vtnet0.2    inet6 2a02:6b8::35/64 scope global dynamic vtnet0.2

-> ifconfig vtnet0.2 inet6 2a02:6b8::35 delete
Deleted 3: vtnet0.2    inet6 2a02:6b8::35/64 scope global dynamic vtnet0.2

-> ifconfig vtnet0.2 -alias 10.11.0.1
Deleted 3: vtnet0.2    inet 10.11.0.1 peer 10.11.0.255/24 scope global dynamic vtnet0.2

-> ifconfig vtnet0.2 destroy
Deleted 3: vtnet0.2    inet6 fe80::5054:ff:fe14:e319/64 scope link dynamic vtnet0.2
Deleted 3: vtnet0.2: <BROADCAST,SLAVE,DYNAMIC,200000> mtu 1500 state UNKNOWN
    link/[135] 52:54:00:14:e3:19

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Warnings
SeverityLocationCodeMessage
Warningsys/netlink/netlink.h:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_ctl.h:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:200SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:204SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:277SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:279SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:370SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:387SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:396SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_domain.c:424SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_iface.c:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_io.c:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_linux.c:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_module.c:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_route.c:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_route.h:4SPELL1Possible Spelling Mistake
Warningsys/netlink/netlink_var.h:4SPELL1Possible Spelling Mistake
Unit
No Unit Test Coverage
Build Status
Buildable 46911
Build 43800: arc lint + arc unit

Event Timeline

melifaro retitled this revision from netlink: add basic netlink support to [WIP] netlink: add basic netlink support.Sun, Jul 31, 11:05 AM
melifaro edited the summary of this revision. (Show Details)
melifaro edited the test plan for this revision. (Show Details)
melifaro added a reviewer: network.
melifaro edited the summary of this revision. (Show Details)

Sync to latest HEAD.

sys/modules/netlink/Makefile
5

first of all, thanks for doing this, NL is a much needed piece of linux in the Linuxulator since glibc moved some of if_XXX code to NL.
Second, would it be better to move the Linux specific code to the Linuxulator? And commit it separately?

sys/modules/netlink/Makefile
5

LinuxKPI has no trouble if it is native code for as long as the KPI stays the same I would hope; it'll also have to interface with it from other (driver) parts and LinuxKPI functions.

I assume Linuxolator needs the KBI parts of it?

So it seems we kind-of need it all -- Linuxolator (KBI) or LinuxKPI (KPI) and we also want it native to do the same as Linux and use it for things instead of overloading more routing sockets (and probably DPDK people want it that way too)?

sys/modules/netlink/Makefile
5

Second, would it be better to move the Linux specific code to the Linuxulator? And commit it separately?

I have mixed feelings around it. As far as I understand, we have the code here to (a) colocate all Linux-related business logic there, (b) avoid affecting native code and (c) have it loadable on demand thus not having it in kernel by default.

Speaking of netlink, the there are (1) some linux-specific "core" code which provides in/out hooks allowing transparent message rewrites and (2) actual message convertors, which mostly rewrites families and attributes. The latter resides in netlink_linux.c
I'm fine with moving netlink_linux.c to compat and committing it separately if that's the preference.
Still I'd prefer to have this code present in netlink module by default, as (a) dealing with loading depends and functions pointers would bring more complexity than needed and (b) at this moment I see the amount of code there being relatively small compared to the current (and upcoming) code in the netlink module.

Thoughts?

5

Re KPI: it will change (e.g. support for dynamically adding new commands not here yet) in the short-term, but then should be relatively stable afterwards. The only notion is that the kernel KPI will be different from Linux kernel KPI.

sys/modules/netlink/Makefile
5

If the KPI, structs, etc. differ from Linux then I fear it's going to be a big name clash for the LinuxKPi bits we need and that means double work and converting everything pointlessly.

That said I see we still have "nla_put", "struct nlattr", .. just with different arguments. I.e. nla_put obviously does not take an skbuff, but why is it called nla_put then as in Linux? It's been a long time but I cannot remember that being part of the RFC? So if we make the KPI differ I kindly ask to make it differ and not just kind-of-the-same-yet-different. I do understand that was work done before your time on this but I hope we can sort those things (even if it is quite a bit of search-replace work) to avoid these conflicts then.

sys/modules/netlink/Makefile
5

I'm intentionally not looking into Linux kernel bits.

Re name clash: that's very valid concern. There was quite a clash in the routing headers that required some renaming. I don't want to make life harder for the LinuxKPI and happy to change naming to avoid clashes. Kernel part is not a part of RFC, so nothing prevents naming our functions whatever we prefer. struct nlattr is a part of the public user-visible headers so should be the same. I will rename nla_put() to something like nla_append(). Any other potential naming clashes that should be resolved?

sys/modules/netlink/Makefile
5

Maybe avoid the nla_ prefix for non-public parts to distinguish between RFC and "us". "netl_" or something maybe? I don't know about rtnl*() yet. We have one or of those too.

melifaro edited the test plan for this revision. (Show Details)
  • Add support for interface/ifaddr events
  • Add event group hook to convert messages to Linux format
  • Rename nla_put*() set of functions to nlattr_add
  • Rename nlmsg_put() to nlmsg_add() for consistency
  • Move all public KPI to netlink_ctl.h
  • Sync to latest HEAD
sys/modules/netlink/Makefile
5

Ila was a short form of "netlink attribute", so nla_put*() was was a pretty convenient name prefix, as these functions are the most commonly-used part of KPI.
I renamed all these to nlattr_add*(). I also moved all public KPI parts to netlink_ctl.h to simplify things for consumers.

Hope it is better now :-)

  • Fix panics when handling events in vnet w/o netlink sockets
  • Switch to the new nexthop KPI
  • Sync to recent HEAD