Page MenuHomeFreeBSD

unix: new implementation of unix/stream & unix/seqpacket
ClosedPublic

Authored by glebius on Feb 29 2024, 7:03 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 6, 7:04 AM
Unknown Object (File)
Wed, Dec 25, 1:23 PM
Unknown Object (File)
Mon, Dec 16, 4:47 PM
Unknown Object (File)
Mon, Dec 9, 11:08 AM
Unknown Object (File)
Dec 3 2024, 5:39 PM
Unknown Object (File)
Nov 30 2024, 10:51 PM
Unknown Object (File)
Nov 30 2024, 12:46 AM
Unknown Object (File)
Nov 28 2024, 6:34 AM
Subscribers

Details

Summary

Provide protocol specific pr_sosend and pr_soreceive for PF_UNIX
SOCK_STREAM sockets and implement SOCK_SEQPACKET sockets as an extension
of SOCK_STREAM. The change meets three goals: get rid of unix(4) specific
stuff in the generic socket code, provide a faster and robust unix/stream
sockets and bring unix/seqpacket much closer to specification. Highlights
follow:

  • The send buffer now is truly bypassed. Previously it was always empty,

but the send(2) still needed to acquire its lock and do a variety of
tricks to be woken up in the right time while sleeping on it. Now the
only two things we care about in the send buffer is the I/O sx(9) lock
that serializes operations and value of so_snd.sb_hiwat, which we can read
without obtaining a lock. The sleep of a send(2) happens on the mutex of
the receive buffer of the peer. A bulk send/recv of data with large
socket buffers will make both syscalls just bounce between owning the
receive buffer lock and copyin(9)/copyout(9), no other locks would be
involved.

  • The implementation uses new mchain structure to manipulate mbuf chains.

Note that this required converting to mchain two functions that are shared
with unix/dgram: unp_internalize() and unp_addsockcred() as well as adding
a new shared one uipc_process_kernel_mbuf(). This induces some non-
functional changes in the unix/dgram code as well. There is a space for
improvement here, as right now it is a mix of mchain and manually managed
mbuf chains.

  • unix/seqpacket previously marked as PR_ADDR & PR_ATOMIC and thus treated

as a datagram socket by the generic socket code, now becomes a true stream
socket with record markers.

  • unix/stream loses the sendfile(2) support. This can be brought back,

but requires some work. Let's first see if there is any interest in this
feature, except purely academical.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Note that the review has several dependencies that add new mbuf(9) primitives. Feel free to add yourself as reviewers to them, too.

I have very lame early benchmark data. Just running a simple 2-threaded program in a virtual machine, while the host is practically idle (read: running a ton of desktop apps):

  • ping-pong program, where both thread send a message only after reading a message from peer - 13% speedup
  • bulk send/receive, where one thread will send at maximum speed, while other just receives:
    • 8k buffers - 37% speedup
    • 16k buffers - 47% speedup
    • 128k buffers - 41% speedup

I subscribed Olivier as reviewer, hoping he will be interested to get some more solid data, then a run in bhyve.

Aside from the performance gains, the SOCK_SEQPACKET actually starts to work as it should. See D43775. It will pass on this implementation and miserably fail on the old one.

And after this isolation of unix(4) code to this file, lots of cleanups open up in the generic socket code.

Lame benchmarks used:
{F78217920}
{F78217919}

This revision is now accepted and ready to land.Feb 29 2024, 9:45 AM

On 2 identical servers (Intel Xeon CPU E5-2697A v4 @ 2.60GHz), one running the unpatched and the other this patched review, here are the results of this script (extracting the "real" in seconds from the time command):

#!/bin/sh
set -eu
lenght=100000000000
sizes="8000 32000 64000 128000 256000 512000 1000000"
cmd=""
for s in $sizes; do
        echo "Benching with size $s (lenght: $lenght)..."
        for i in $(seq 3); do
                time -p -o bulk.$s.$i.txt ./unix_bulk $lenght $s $s
        done
        grep real bulk.$s.*.txt | cut -d ' ' -f 2 > bulk.$s.data
        cmd="$cmd bulk.$s.data"
done
ministat -w 74 -s $cmd

Then comparing the results, which give with default 8k sockbuf size:

$ ministat -s -w 74 unpatched.socket/bulk.8000.data patched.socket/bulk.8000.data
x unpatched (real time in seconds)
+ patched (real time in seconds)
+--------------------------------------------------------------------------+
| +                                                                       x|
|++                                                                       x|
|                                                                         A|
||A                                                                        |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3        112.04        112.36        112.22     112.20667    0.16041613
+   3         28.64         29.81          29.3         29.25    0.58660038
Difference at 95.0% confidence
        -82.9567 +/- 0.97468
        -73.932% +/- 0.840008%
        (Student's t, pooled s = 0.430019)

Wow: - 73% time spend !

And with 128k sokbuf size:

x unpatched (real time in seconds)
+ patched (real time in seconds)
+--------------------------------------------------------------------------+
| +                                                                       x|
|++                                                                      xx|
|                                                                        |A|
||A                                                                        |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3         30.88         31.07         31.06     31.003333    0.10692677
+   3         15.27         15.51         15.38     15.386667    0.12013881
Difference at 95.0% confidence
        -15.6167 +/- 0.257768
        -50.3709% +/- 0.67895%
        (Student's t, pooled s = 0.113725)

-50% here.

And with 1M sockbuf size:

x unpatched (real time in seconds)
+ patched (real time in seconds)
+--------------------------------------------------------------------------+
|+++                                                                   xx x|
|                                                                      |A_||
||A|                                                                       |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3         26.88          27.4            27     27.093333    0.27227437
+   3         13.69         13.99         13.87         13.85    0.15099669
Difference at 95.0% confidence
        -13.2433 +/- 0.498994
        -48.8804% +/- 1.21482%
        (Student's t, pooled s = 0.220151)

> -48%

So those results on hardware are equivalent to Gleb’s results on a VM.

This revision now requires review to proceed.Mar 13 2024, 11:14 PM

This diff is hard to read. :( For instance, why not split the changes to unp_internalize() and unp_addsockcred() into different patches?

Do you have a git branch with the changes available?

sys/kern/uipc_usrreq.c
145
146
866

Is this logically related to the rest of the change?

910

IMO, "reset" is more precise than "process" and so would be a better name.

1019–1020

Why not handle the c != NULL case unconditionally? I understand that m != NULL implies c == NULL today, but there's no obvious reason to assume that. If there's an unobvious reason, there should be a comment.

1026
1093
1133
1163
1232

This can be STAILQ_FOREACH_FROM as well.

1349

I don't understand this - how can you set STAILQ_FIRST of a non-empty queue to control without also updating control?

1369

Can this mostly self-contained operation be moved into a separate subroutine? There is too much indentation here.

2468

We don't need the blank line anymore.

3191

It would be better to add MCHAIN_FOREACH(_SAFE) or so instead of open-coding like this.

This diff is hard to read. :( For instance, why not split the changes to unp_internalize() and unp_addsockcred() into different patches?

I understand and sympathize :) Unfortunately we can't split unp_internalize() and unp_addsockcred() out, cause these KPIs are also used by the old code. That would require coding down an intermediate step where uipc_send() and uipc_rcvd() would work on mchains. Not a trivial code, that is destined to be ditched on next step. Worse than that - unp_internalize() prototype is dictated by dom_internalize typedef. And the latter can be removed only after this revision lands. The snake bites its tail.

The unp_internalize(), unp_addsockcred() and uipc_process_kernel_mbuf() MUST be a no-op. The whole change MUST be a no-op for unix/dgram.

Do you have a git branch with the changes available?

Yes! Looking at the new function as whole is much more easy, than going with phabricator. The code is shared here: https://github.com/glebius/FreeBSD/tree/sockbuf-squashed Rebased on already updated mchain reviews. You may question word squashed in the name, asking for a non-squashed version. But this is not what you want :) The non-squashed version is an implementation of uipc_so*_stream_or_seqpacket() without use of mchains, which ended in very long and hairy code, and then several steps to add mchains - a step to add mc_split, a step to add mc_uiotomc(), etc.

glebius marked 10 inline comments as done.

Review from Mark.

sys/kern/uipc_usrreq.c
866

It is. Previously this was done by the generic socket code when a socket switches from a true socket with buffers to a listening. This is first protocol that is PR_SOCKBUF and can listen(). See also 7ee47c3bb7a6d85744e2747ec969161eed5bbaab.

1019–1020

To me it seems more logical and easier to read to separate the user write and kernel write into two separated blocks. I did the same with unix/dgram.

1026

Nope :( MSG_EOR and M_EOR are two different things.

1349

Although we drop the lock and the take it back, we are the only thread that may drain the queue. Exclusion is shutdown(2), but it is covered above. That means that the very last mbuf in the chain we were planning for copyout, but now we are gluing back due to failure, still has m_stailq pointing to the very first mbuf in the queue.

P.S. Got a panic after all fixups and rebases, so definitely broke something. Need to analyze. Will update.

glebius marked an inline comment as not done.Mar 18 2024, 8:08 PM
glebius added inline comments.
sys/kern/uipc_usrreq.c
1232

This can be STAILQ_FOREACH_FROM as well.

Unfortunately no. Entering with first == NULL the existing cycle will not be executed, but STAILQ_FOREACH_FROM() will start from the beginning.

sys/kern/uipc_usrreq.c
974

Call it uipc_stream_sbspace(), for consistency with uipc_dgram_sbspace()?

1049

This might unlock unp. Do we need to check so->so_error != 0 again?

The locking protocol for so_error is not too clear to me. Some code uses the PCB lock to serialize updates, some uses the socket lock.

1055

Don't we want to hold the unp lock while reading the state?

1121

There are more instances of this below.

1122

Hmm, if mc is empty, then after copying this way the tail pointer of mcnext will be incorrect. Does that matter? It seems like a subtle source of bugs.

1160

So why do we add the virtual send buffer size?

1162

The access of sb_hiwat is unlocked. If this is intentional, it should be atomic_load_int(&so->so_snd.sb_hiwat).

1223

Don't we want to check this with the PCB or socket lock held?

1234
1310

Here we are removing a chain of mbufs from the socket buffer. But, isn't the last mbuf in the chain still pointing to the first mbuf still in the buffer? I cannot see where the m_next field of the last mbuf is set to NULL.

Below there is an error path where we m_freem(control), but then it looks like it might free the mbufs still lingering in the socket buffer.

glebius marked 6 inline comments as done and an inline comment as not done.Apr 1 2024, 8:43 PM
glebius added inline comments.
sys/kern/uipc_usrreq.c
1049

Thanks, I will rewrite these two checks to check so_error and so_state after acquiring the second unp lock.

The locking protocol for so_error remains the same as before the change - not very consistent, but working, IMHO. What happens here is that so_error is set either in other syscall on the same socket or in unp_drop(). The latter is protected by unp lock. The race between two syscalls is practically impossible due to socket I/O lock.

1122

This is how any other code operating on STAILQ or TAILQ headers work. It is incorrect to dereference STAILQ_LAST on a queue that is not known to be not empty.

1160

Because now we know that there is space in the receive buffer and our write is guaranteed to commit at least space bytes to receive buffer. Then most likely we will be put to sleep. And we will be sleeping with so->so_snd.sb_hiwat bytes in our hands, which effectively is the send buffer, although it isn't linked to anywhere. Note that very first "fast path" mc_uiotomc() at the beginning of the function is done only with send buffer space size, cause we don't yet know about receive buffer.

1162

I don't agree. If we got two threads chaotically doing two syscalls: setsockopt(SO_SNDBUF) and send() they can race anyway, either in the userland or when locking the socket.

If we are talking about a normal application that sets buffer and only then does sends, it should see the updated value in the send(). Because setsockopt() updated hiwat with socket buffer lock, and then send() acquired the same mutex at least once before it came to this place.

P.S. unix/dgram isn't special here. This applies to all other protocols.

1223

Using unp lock in this function would definitely affect performance. The check is basically against a socket that was never connected at all. So only an application that chaotically does 3 syscalls: socket(2), connect(2), recv(2) would ever see a problem here. A normal application that would do a recv() only after previous connect() returned, wouldn't observe any problem here. In the later lifetime of a socket this check is always false. Even if we do not see the clearing of SS_ISCONNECTED and raising of SS_ISDISCONNECTED by unp_soisdisconnected(), we would still behave correct here. So the race with unp_soisdisconnected() is harmless, and end result is the same - we see SBS_CANTRCVMORE few lines later and return 0.

For reference: the soreceive_generic() does this check under socket receive buffer lock, however it does it after SBS_CANTRCVMORE check. End result for application is the same - ENOTCONN only on a never connected socket. Also, generic code doesn't protect these bits of so_state with socket receive buffer lock. We can do that easily in unp_soisdisconnected(), but we don't really need to.

The SS_ISCONNECTED doesn't need locking as it is 0 on a new born socket and is set on connect(). The SS_ISDISCONNECTED is set in unp_soisdisconnected(). If we miss update of the flag here than we will see SBS_CANTRCVMORE later. So nothing critical is going to happen except we return 0 instead of ENOTCONN. Potentially I can move this check down into the cycle and check with receive buffer lock. This is actually what soreceive_generic() does! But, soreceive_generic() checks that after SBS_CANTRCVMORE check, so it will also return 0 instead of ENOTCONN in case of a race.

1310

Thanks, that's a good find! As I was working on a fix, I realized that the code that covers race with shutdown(2) and does m_freem(control) was added here after I did an incorrect 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3 and before I reverted it. Now this race can't happen and the error handling code disappears.

glebius marked 2 inline comments as done.
  • Rename uipc_sbspace() -> uipc_stream_sbspace()
  • send: check so_error and so_state after unp_pcb_lock_peer()
  • recv: the race with shutdown can't happen
sys/kern/uipc_usrreq.c
927

I'm a bit worried that this is too expensive. Did you try the patch on a desktop system?

It might be worth adding a sysctl to disable this checking.

1160

Could I ask you to explain this in a comment? IMO the one that is there is too terse.

1162

My suggestion is just to acknowledge the existence of races here by using atomic_load_int(). It is not required for correctness, but it IMO makes the code easier to understand and also helps annotate "harmless" races for concurrency sanitizers.

1223

I think some summary of this explanation should be added as a comment, and the check should be written:

if (__predict_false((atomic_load_short(&so->so_state) &
	    (SS_ISCONNECTED|SS_ISDISCONNECTED)) == 0))

It's too difficult for readers to understand why these kinds of races are safe, so IMHO we should try harder to annotate them in new code.

1223–1228
1374

Oh, that's weird. Isn't this also simply incorrect, given that the sizes of an internalized message and its externalized form are not always the same?

1391
1394

You could write this loop like this to make it simpler:

error = 0;
for (m = first; m != last; m = next) {
    next = STAILQ_NEXT(m, m_stailq);
    if (__predict_true(error == 0))
        error = uiomove(mtod(m, char *), m->m_len, uio);
    if (!peek)
        m_free(m);
}
if (__predict_false(error != 0)) {
    SOCK_IO_RECV_UNLOCK(so);
    return (error);
}

This way, you hold on to the lock in the error case, but I think that's ok.

1404

Can this be written last->m_data - M_START(last) instead?

glebius added inline comments.
sys/kern/uipc_usrreq.c
927

Right now typing on a system running with the patch and INVARIANTS. But that's Threadripper PRO. But I agree. I will just move this under SOCKBUF_DEBUG, cause it does exactly same function.

1160

There is a wordy comment on that above the declaration of socket buffer space sysctls. I can refer from here to up there.

1162

I can't agree with that. Adding the atomic is not just a documentation, it affects performance. It also doesn't look like a correct way to fix the race if the race wasn't harmless. It may help to shut up sanitizers, but it will also mislead a new human reader of the code. The existence of atomic would make them think that a) there is a race here that must be addressed b) it is addressed in the right way. I'd call that anti-documentation.

1223

I will provide a comment.

1374

Absolutely! Let's fix this after.

1394

I did that, looked at and returned back. May I leave as is? IMHO, a human reader prefers to see in the code the streamlined non-error fast path with stub error branches. A cycle that covers both requires certain mind wrap. I guess compiler would make it his way in either case.

glebius marked 3 inline comments as done.
  • Hide uipc_stream_sbcheck() under SOCKBUF_DEBUG
  • More comments on virtual send buffer space
  • More comments on lockless access to socket fields
  • Use M_START()

I don't have any more comments on the diff. I didn't find any bugs while fuzzing the patch.

IMO, unix.4 should have some explanation of the new semantics. We should also document the loss of PF_UNIX+sendfile() support in RELNOTES.

sys/kern/uipc_usrreq.c
1162

atomic_load_int() is just a volatile load from the supplied address, if it has any impact on performance at all it will be extremely minimal.

I think your take is exactly backwards. Without the use of atomic_load or a volatile qualifier, a human reader has to think and decide whether the race is harmless or not; they cannot see whether the author understood that there is a race at all. In many parts of the tree, we use atomic_load_* to 1) demonstrate that the author acknowledges the existence of a race, 2) to help automated tools like sanitizers. The point is not to "fix" the race, but rather to document its existence in a consistent way. Again, we do this quite a lot in some parts of the tree.

A comment explaining why the race is good is even better, but we should do both.

1224
This revision is now accepted and ready to land.Apr 5 2024, 3:43 PM
sys/kern/uipc_usrreq.c
1162

I has always assumed that any atomic(9) adds a lock prefix and I was wrong. I tried compiling with atomic and difference was minimal, before:

42eb: 03 90 ac 02 00 00             addl    0x2ac(%rax), %edx

after:

42eb: 8b 80 ac 02 00 00             movl    0x2ac(%rax), %eax
42f1: 01 c2                         addl    %eax, %edx

I will add atomic_load_int() here and atomic_load_short() where we check (SS_ISCONNECTED|SS_ISDISCONNECTED).