Page MenuHomeFreeBSD

netlink: Add sync path in user-kernel interface
Needs ReviewPublic

Authored by zishun.yi.dev_gmail.com on Wed, Jun 10, 8:36 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jun 21, 7:26 PM
Unknown Object (File)
Sun, Jun 21, 7:21 PM
Unknown Object (File)
Sat, Jun 20, 9:55 PM
Unknown Object (File)
Tue, Jun 16, 3:54 PM
Unknown Object (File)
Mon, Jun 15, 4:23 PM
Unknown Object (File)
Mon, Jun 15, 4:22 PM
Unknown Object (File)
Mon, Jun 15, 4:21 PM
Unknown Object (File)
Mon, Jun 15, 1:34 PM

Details

Reviewers
melifaro
obiwac
emaste
pouria
markj
Group Reviewers
network
Summary

My GSoC 2026 project is to implement udmabuf, which needs to pass
variable-length structures (flexible arrays) to the kernel. Neither FreeBSD's
native ioctl nor linuxkpi ioctl supports this well. Therefore, we plan to use
generic netlink, which natively supports flexible arrays.

The problem is that FreeBSD's netlink is asynchronous. It uses a taskqueue,
meaning the callback is not executed in the user's thread context. Because of
this, operations that depend on the user's process, such as translating an fd
to a file structure, cannot be implemented.

A similar issue was solved in Linux, see commit cd40b7d3983c
("[NET]: make netlink user -> kernel interface synchronious").

To solve this problem, this diff adds a sync path in nl_sosend and introduces
a new socket option NETLINK_SND_SYNC to turn it on.

And, the success of sync send() does not depend on whether the receive buffer is full. For example, if there is only room for 2 replies in the receive buffer but the user requests 5 messages in a single send(), the send() will still succeed and process all 5 messages. However, it will only queue 2 replies, drop the remaining 3, and record how many replies were dropped. The next recv() call will then return ENOBUFS.

Test Plan
root@ /u/t/s/netlink# kyua test netlink_socket
netlink_socket:membership  ->  passed  [0.002s]
netlink_socket:overflow  ->  passed  [1.089s]
netlink_socket:peek  ->  passed  [0.002s]
netlink_socket:sizes  ->  passed  [0.002s]
netlink_socket:sync  ->  passed  [1.060s]

Results file id is usr_tests_sys_netlink.20260619-102534-840477
Results saved to /root/.kyua/store/results.usr_tests_sys_netlink.20260619-102534-840477.db

5/5 passed (0 broken, 0 failed, 0 skipped)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 73992
Build 70875: arc lint + arc unit

Event Timeline

fix potential deadlock

Deadlock:

  • When the recv buffer is full, the async path stops processing and waits for a signal to resume, but it does not trigger a wakeup.
  • The sync path sleeps indefinitely waiting for the async path to wake it up, so it cannot signal the async path.

Solution:

  • Add a timeout to the sync path sleep and return an ENOBUFS error.
  • Signal the async path before sleeping, so it can resume processing upon the user's next retry.

The feature needs some tests, see the existing ones in tests/sys/netlink.

sys/netlink/netlink_domain.c
617

Having a hard-coded timeout like this is weird. Is it really necessary?

625

ENOBUFS is kind of a weird error for this case though. Maybe EDEADLK. But you don't know for sure that the receive buffer is full.

sys/netlink/netlink_io.c
137

This should only be called if a thread is actually waiting, otherwise we're issuing a spurious wakeup() call for every single message.

Might be that the sender can set SB_WAIT before going to sleep, and this code path can check for it.

After investigating the behavior of netlink send when the buffer is full, I found:
FreeBSD async send path behavior:

  • Send buffer full: Deadlocks (this indicates the recv buffer is also full).
  • Recv buffer full: Send doesn't depend on it, so it will never block. It just keeps filling the send buffer.

Linux send path behavior:

  • Send buffer will never fill up because of sync.
  • Recv buffer full: Send doesn't depend on it, so it will never block, but it will fail on the next recv.

Therefore, I think the FreeBSD sync send path behavior should be:

  • Send buffer full: Deadlock.
  • Recv buffer full: Same as linux

Is this the expected behavior? If so, I will add a test for it.

Add the test mentioned in my previous comment, but without changing the kernel
code yet, so it currently fails as expected.

  • Can pass the test
  • Add SB_WAIT
  • Remove timeout
  • Refactor nl_process_nbuf_sync. The difference from nl_process_nbuf is that the sync version removes nlmsg_ignore_limit and returns the actual error.
  • Make recv return ENOBUF when it sees nl_dropped_messages.

Thank you for working on this!
Added a bit of comments on the MR itself, also a bit more generic questions:

  1. What is the proposed processing mechanism in sync case? Can we clearly state the logic in the description?

For example, if the caller sent multiple nettlink messages on a single send / write call - do we assume the caller should have allocated enough RX buffer for all of the replies? Or put it differently - what should happen when we don't have enough space in the receive buffer to (a) get all of the reply, (b) to send netlink header with error and (c) when we, say, processed and replied to 2 netlink messages out of 5?

  1. If the problem is the requirement around association with the process, it doesn't necessarily require sync - the other option would be, for example, creating the kernel threads associated with the process in questions. See https://reviews.freebsd.org/D39180 for more details. Not necessarily suggesting this direction, more stating that there may be more than 1 solution to the problem.
sys/netlink/netlink_domain.c
598

can we move logic inside this branch to a separate function?

606

Why do we still need taskqueue if we're operating in sync mode?

744

Could you explain why do we need it? Also - are we doing lock/unlock just to write an syslog message in the corner case scenario?

Thank you for working on this!
Added a bit of comments on the MR itself, also a bit more generic questions:

  1. What is the proposed processing mechanism in sync case? Can we clearly state the logic in the description?

For example, if the caller sent multiple nettlink messages on a single send / write call - do we assume the caller should have allocated enough RX buffer for all of the replies? Or put it differently - what should happen when we don't have enough space in the receive buffer to (a) get all of the reply, (b) to send netlink header with error and (c) when we, say, processed and replied to 2 netlink messages out of 5?

Linux is sync. Its behavior is this send() will succeed and process all requests (5, for example). But it drops the replies that exceed the receive buffer(3 replies). And will return ENOBUFS in the next recv() to indicate we have dropped some replies.
In other words, the success of send() does not depend on whether the receive buffer is full. So I align with its behavior, the corresponding part is the nl_dropped_bytes in nl_soreceive().

  1. If the problem is the requirement around association with the process, it doesn't necessarily require sync - the other option would be, for example, creating the kernel threads associated with the process in questions. See https://reviews.freebsd.org/D39180 for more details. Not necessarily suggesting this direction, more stating that there may be more than 1 solution to the problem.

Yes, my requirement is to associate with the process. I see your option drops the taskqueue and creates a kernel thread associated with the process, which I think can solve the problem too. So maybe both options are ok.
But sync might can bring some addition benefits:

  • send() can return error, for example when nlmsghdr is invalid.
  • async will deadlock when the send buffer is full. sync doesn't have send buffer, so we don't have this problem.
sys/netlink/netlink_domain.c
606

This handles the case where previous requests were sent in async mode and are still pending in the send buffer. We must let the async path drain the buffer before executing the current sync request to prevent out-of-order execution.

744

In sync send mode, we don't call nlmsg_ignore_limit(). So we can record how many replies we drop and return ENOBUFS in recv().
And sorry about the lock, I should acquire it inside the branch.

  • Move some logic in sync path to a separate function.
  • Move the lock acquisition inside the branch

Do you see a use case for setting sync mode mid-session? If not, i’d suggest doing conversion in setsockopt handler, tearing down the async thread and returning an error if some messages are pending procesding.

  • async will deadlock when the send buffer is full. sync doesn't have send buffer, so we don't have this problem.

Could you share the async deadlock reproducer? It shoukdn’t really happen.

Do you see a use case for setting sync mode mid-session? If not, i’d suggest doing conversion in setsockopt handler, tearing down the async thread and returning an error if some messages are pending procesding.

I think you're right, it's weird to set sync mid-session. I will change it later

  • async will deadlock when the send buffer is full. sync doesn't have send buffer, so we don't have this problem.

Could you share the async deadlock reproducer? It shoukdn’t really happen.

Sorry, may be deadlock is inappropriate terminology to use here.

	fd = fullsocket();

	/* Both buffers full: block. */
	timer_done = 0;
	ATF_REQUIRE(sigaction(SIGALRM, &sigact, NULL) == 0);
	ATF_REQUIRE(setitimer(ITIMER_REAL, &itv, NULL) == 0);
	ATF_REQUIRE(send(fd, &hdr, sizeof(hdr), 0) == -1);
	ATF_REQUIRE(errno == EINTR);
	ATF_REQUIRE(timer_done == 1);

When the send buffer full, I think send will block indefinitely, waiting for space in the send buffer.
However, send buffer depends on the recv buffer having space. The recv buffer is waiting for the user to call recv(). but user is blocked.

do sync conversion in setsockopt handler, tearing down the taskqueue and returning an error if some messages are pending procesding.