Page MenuHomeFreeBSD

Ensure strict error handling of tcp_usr_send
AbandonedPublic

Authored by jason_eggnet.com on Sep 25 2017, 4:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 21, 9:29 PM
Unknown Object (File)
Jan 23 2024, 1:45 AM
Unknown Object (File)
Dec 31 2023, 9:31 PM
Unknown Object (File)
Dec 19 2023, 8:31 PM
Unknown Object (File)
Nov 26 2023, 3:59 PM
Unknown Object (File)
Oct 4 2023, 11:46 PM
Unknown Object (File)
Sep 29 2023, 12:27 AM
Unknown Object (File)
Sep 4 2023, 9:45 PM
Subscribers

Details

Reviewers
jtl
glebius
sbruno
rrs
bz
Group Reviewers
transport
Summary

Callers, critically sendfile, to tcp_usr_send expect it to either
successfully add the mbufs to the send buffer or to return an error.
Not both. This patch ensures that tcp_usr_send either adds data to
the socket buffer, or returns an error. Errors encountered after
data has been added to the socket are stored in so->so_error
which can be checked by socket user functions as appropriate.
Since sendfile can return both errors and bytes written simultaneously,
sendfile can check this on every call to tcp_usr_send. write checks
so_error only before adding data to the socket. Errors encountered
after adding data will have to be handled on the next call to write
or other i/o syscall.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 11736
Build 12080: arc lint + arc unit

Event Timeline

Can you explain how these are different? Both functionally set error to so->so_error, and then set so->so_error to 0. If (so) is properly locked, it looks like these should be equivalent. If (so) is not properly locked, then the new formulation is not guaranteed to overcome this deficiency.

But, I could be missing something obvious... (It is one of "those days" today. :-) )

In D12492#258950, @jtl wrote:

Can you explain how these are different? Both functionally set error to so->so_error, and then set so->so_error to 0. If (so) is properly locked, it looks like these should be equivalent. If (so) is not properly locked, then the new formulation is not guaranteed to overcome this deficiency.

But, I could be missing something obvious... (It is one of "those days" today. :-) )

I think you are correct. I've reverted that change.

Higher up in this same function is the following code:

} else if (so->so_error) {
        error = so->so_error;
        so->so_error = 0;
        SOCKBUF_UNLOCK(&so->so_snd);
        goto done;

Which is also performed without acquiring a lock.

The problem is, a bunch of code in the kernel modifies so_error without acquiring SOCK_LOCK(so), probably because the declaration of struct socket basically says you don't need to acquire a lock because so_error is an int. Which doesn't make sense to me. Even if we acquired locks in vn_sendfile, it wouldn't help unless the rest of the readers and writers of vn_sendfile also acquired locks.

In that spirit, I'll make sure this diff is focused on the problem I'm trying to solve which is the consistency of tcp_usr_send, adding to buffers and returning errors.

Edit: due to problems I was having understanding how arc worked, there is a period of time where the only visible change in this diff was this minor so_error issue. But that was a patch on top of a much larger patch... which is now the only patch up for review here.

bz requested changes to this revision.Sep 25 2017, 9:46 PM
bz added a subscriber: bz.

Just asking for the whitespace changes ;)

sys/kern/kern_sendfile.c
873

Minor nit: comments start upper case and end with punctuation. Also above and below.

This revision now requires changes to proceed.Sep 25 2017, 9:46 PM
jtl requested changes to this revision.Sep 25 2017, 11:27 PM

I'm not sure I understand the problem this patch is trying to solve. Can you explain the problem in sufficient detail so we can understand how this patch addresses it? Thanks!

Specific notes in-line.

sys/netinet/tcp_usrreq.c
892

There is no need to initialize this. It is set before its use in any branch where it is used.

930

I believe this should have the same condition as above (m && (flags & PRUS_NOTREADY) == 0). @glebius can confirm. (This is relevant to another comment below.)

955

You need to free m in the error path. Given that this is now three times we need to do this, it might be worth combining the logic in one place. (See my comment above about the correct logic.)

1001

You need to free m. See above.

1026

I understand why this makes sense in the sendfile case: sendfile actually checks so->so_error. However, I'm not sure this change is obviously good for ever other caller. Absent other changes, other callers of pru_send might need to wait for a further syscall or do further probing to discover the error.

In fact, I think what you are proposing is correctly categorized as an API change for the pru_send function. That requires modifying every caller to check so->so_error after the pru_send call and take appropriate action. (It might also require modifying other pru_send functions to behave similarly.)

@jtl

The primary goal is to ensure that userspace and the kernel are in lock step as far as how many bytes were actually added to the socket, as long as sendfile returns no error or an error that is not fatal to the connection.

This means callers of pru_send aka tcp_usr_send for tcp need to somehow know if sbappendstream was called or not.

In the case of pru_send, sendfile assumes that as long as errno != 0, sbappendstream was called.

On further investigation, here is a rough call map:

send*->kern_sendit->sosend->sosend_generic->tcp_usr_send
write*->dofilewrite->soo_write->sosend->sosend_generic->tcp_usr_send
sendfile*->sendfile->vn_sendfile->tcp_usr_send

kern_sendit and dofilewrite have a special condition that might result in a more focused fix.

It's not a simple errno == 0 check in those functions. It's this:

if (auio.uio_resid != len && (error == ERESTART ||
    error == EINTR || error == EWOULDBLOCK))
        error = 0;

looking at tcp_usr_send, I do see a case where EAGAIN / EWOULDBLOCK could be returned. Specifically tcp_usr_send -> tcp_connect -> in_pcbbind. If that happened, vn_sendfile would misreport sbytes, user space would receive EAGAIN, and continue, thinking sbytes is correct... resulting in corruption. With, as far as I can tell, no counters incremented or other evidence it occurred.

Maybe a simple patch in vn_sendfile to handle receiving EAGAIN from tcp_usr_send / pru_send is enough? I don't think it's possible for pru_send to return ERESTART or EINTR, are you aware of such a possibility?

I've created another diff for the simpler approach. If that looks better, we can abandon the more extensive changes.

https://reviews.freebsd.org/D12501

The primary goal is to ensure that userspace and the kernel are in lock step as far as how many bytes were actually added to the socket, as long as sendfile returns no error or an error that is not fatal to the connection.

As I wrote elsewhere this morning, I finally understand the problem: User space calls send() to send 100 bytes. tcp_usr_send() adds it to the send buffer. tcp_output() produces a non-fatal error which is passed on to the user. The user then retries sending the 100 bytes.

I think the key distinction here is fatal vs. non-fatal errors.

We want to return fatal errors to the user right away. (And, I am concerned that this patch does not do that.) As I understand it, the current tcp_output() in upstream code only returns errors that appear fatal. (However, I understand this will be harder to manage across the various tcp_output() functions used for different TCP stacks.) If we believe the error is fatal, we also should actually go ahead and tear down the connection at this point, too.

On the other hand, we probably want to completely suppress non-fatal errors. tcp_output() already updates tp->t_softerror when it encounters some non-fatal errors. We could consider extending that here, too, although the main reason to do that would be to ensure consistency between the various TCP stacks.

sys/netinet/tcp_usrreq.c
930

A control shall never be M_NOTREADY

Based on evaluating this more, feedback in https://reviews.freebsd.org/D12501, and some production analysis, I think the only potential exposure here is tcp_usr_send adding bytes to the socket with sbappendstream followed by tcp_connect returning EAGAIN.

But, that can't happen with vn_sendfile because a destination address is never supplied. And if it happened with sendto , kern_sendit would clean it up, here:

len = auio.uio_resid;
error = sosend(so, mp->msg_name, &auio, 0, control, flags, td);
if (error != 0) {
        if (auio.uio_resid != len && (error == ERESTART ||
            error == EINTR || error == EWOULDBLOCK))
                error = 0;

Because we do add data to the socket before calling tcp_connect.

In short... I think in both this and the other review, I didn't find any issues that would be exposed in practice... unless there is some kind of "soft" error that isn't one of ERESTART, EINTR, or EAGAIN/EWOULDBLOCK. But that doesn't appear to be the case.

This might have just been an exercise in validating the current behavior. I will say, the parts of code that decide how many bytes were added to the socket buffer, actually adding them to the socket buffer, and telling userspace about those additions, are pretty far from each other, requiring a rather comprehensive view of what's going on to (maybe) understand. But it does appear that I didn't find any issues in the kernel. Which is a good thing :)

Abandoning this due to insufficient proof of a problem.