Page MenuHomeFreeBSD

count bytes in vn_sendfile when pru_send returns EAGAIN
AbandonedPublic

Authored by jason_eggnet.com on Sep 26 2017, 6:54 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 15, 6:37 AM
Unknown Object (File)
Oct 26 2024, 12:01 AM
Unknown Object (File)
Oct 26 2024, 12:00 AM
Unknown Object (File)
Oct 26 2024, 12:00 AM
Unknown Object (File)
Oct 3 2024, 6:17 AM
Unknown Object (File)
Oct 3 2024, 12:55 AM
Unknown Object (File)
Oct 2 2024, 5:34 PM
Unknown Object (File)
Oct 1 2024, 8:42 AM

Details

Reviewers
jtl
glebius
Group Reviewers
transport
Summary

There appears to be at least one code path where pru_send, in
particular tcp_usr_send, can return EAGAIN.

tcp_connect -> in_pcbbind

If that happens, we should make sure that sbytes reflects the
data pru_send added. There does not appear to be a way for
tcp_usr_send to return EAGAIN without having added the bytes.

This is generally consistent with the send*() and write*() family
of functions. In particular, kern_sendit() and dofilewrite() both
check for EAGAIN and clear it if any data was sent, to ensure
user space is notified of the data added to the socket send
buffer.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 11757
Build 12101: arc lint + arc unit

Event Timeline

I'm not sure what I think of this change. It seems like an API change again: now EAGAIN always means that pru_send added the bytes instead of freeing them, while other errors are either fatal or mean that the bytes were lost.

This might already be true, but it seems like it will be a bit of work to ensure this is the case.

I believe that you've identified a valid problem here (or, perhaps, more than one problem). Its just a case of finding the best solution...

Ok, let's dig in. pru_send, aka tcp_usr_send calls both tcp_connect in some cases, and tcp_output in most cases. Both calls are after sbappendstream() is called. tcp_connect has a pretty clear if rare path to EAGAIN via in_pcbbind and might be the source of the issues we are seeing.

The call chains up to pru_send:

send*->kern_sendit->sosend->sosend_generic->pru_send
write*->dofilewrite->soo_write->sosend->sosend_generic->pru_send
sendfile*->vn_sendfile->pru_send

sendfile's handling of pru_send hasn't changed since at least 10.3, for the record. Probably much farther back than that. So we don't really have to talk about the 12 sendfile vs before.

There are two main questions:

Is sendfile inconsistent with the send* and write* handling of pru_send now and in 10.3? My assertion is yes. Fixing this would improve correctness. I assert, that's my quick EAGAIN check.

The second question is, is there a problem in the send* and write* paths, and potentially also in a "fixed" vn_sendfile.

Let's deal with the first question first.

To answer that question, we basically have to describe how the send* and write* paths decide how many bytes were "sent" and if an error should be returned or not. Returning an error hides any bytes sent. If an error is returned when bytes are sent, then the connection is burned.

On progressing through the call chain, sosend_generic calls m_uiotombuf(). This function creates the mbuf chain to append to the socket (but doesn't do the append) as well as subtracting from uio->uio_resid. It is this subtraction that kern_sendit and dofilewrite use to determine how many bytes were sent, and whether to mask an error.

Let's look at the code. In both kern_sendit and dofilewrite, in the return path, there is this logic:

len = auio.uio_resid;
error = sosend(so, mp->msg_name, &auio, 0, control, flags, td);
if (error != 0) {
        if (auio.uio_resid != len && (error == ERESTART ||
            error == EINTR || error == EWOULDBLOCK))
                error = 0;

If m_uiotombuf() copied at least some of the data to mbufs, then ERESTART, EINTR, and EAGAIN/EWOULDBLOCK are masked, no matter where the errors came from. Which would include pru_send.

In the vn_sendfile case, the equivalent of "masking" is to make sure that sbytes is updated. We don't actually have to hide the error from userspace, unlike the write* and send* cases, because sendfile can describe bytes moved and an error at the same time.

I don't think pru_send can return EINTR or ERESTART, which just leaves EAGAIN.

This is the inspiration of this patch, which I assert is not an implied change to the pru_send API.

I think it is enough of a patch for sendfile to gain parity in behavior with the send* and write* paths.

What's left, then, if we applied the small patch. Where are we still vulnerable.

  1. In the send* and write* paths, if m_uiotombuf() alters uio->uio_resid, and EAGAIN, EINTR, or ERESTART is returned by m_uiotombuf() or before pru_send() calls sbappendstream().
  2. in the sendfile* path if pru_send returns EAGAIN without calling sbappendstream()

I don't believe either of those two things can happen.

If m_uiotombuf() copied at least some of the data to mbufs, then ERESTART, EINTR, and EAGAIN/EWOULDBLOCK are masked, no matter where the errors came from. Which would include pru_send.

In the vn_sendfile case, the equivalent of "masking" is to make sure that sbytes is updated. We don't actually have to hide the error from userspace, unlike the write* and send* cases, because sendfile can describe bytes moved and an error at the same time.

I don't think pru_send can return EINTR or ERESTART, which just leaves EAGAIN.

This is the inspiration of this patch, which I assert is not an implied change to the pru_send API.

OK, thanks! I agree this concept appears consistent with the current user-space send equivalent.

I'm not positive that pru_send is guaranteed to not return EINTR or ERESTART, so it might be worth considering them.

Also, in this particular context (TCP behavior), note that the EAGAIN about which you are concerned is probably not even possible, since it only occurs in the implicit connect case, which sendfile doesn't support.

sys/kern/kern_sendfile.c
870

Why 'goto done' instead of completing the write?

In fact, shouldn't you actually overwrite EAGAIN here with 0? You were able to add all bytes to the socket buffer, so this does not actually require an EAGAIN for sendfile's syscall. As far as sendfile is concerned this is a successful send.

In D12501#259391, @jtl wrote:

OK, thanks! I agree this concept appears consistent with the current user-space send equivalent.

I'm not positive that pru_send is guaranteed to not return EINTR or ERESTART, so it might be worth considering them.

Also, in this particular context (TCP behavior), note that the EAGAIN about which you are concerned is probably not even possible, since it only occurs in the implicit connect case, which sendfile doesn't support.

Yeah, it wouldn't be hard to additionally handle EINTR and ERESTART, I'll do that.

And, yes, if nam is null, then pru_send won't do an implicit connect. I missed that, ugh.

sys/kern/kern_sendfile.c
870

Yeah, I think we can do better, new patch coming.

  • detect when EINTR and EAGAIN can be squelched in sendfile

I think a reasonable question here is, why the gate on rem <= 0 for goto noerr;

The answer is, we don't know why pru_send would return EINTR, EAGAIN, or ERESTART. If we're going to ignore the error condition, we should make sure we won't be looping. Let user space do the looping based on socket events.

That's an interesting finding! However, I don't like that we are blindly following write() or send() logic. Copy-pasting the ERESTART is definitely superfluous. I'm pretty sure that protocols shall never return that. More complex question is that why are we sure that EAGAIN confirms that all bytes sent were actually sent? Could be bug not in sendfile, but in write :) Let me dig more on that.

How is that possible that tcp_usr_send leads to tcp_connect and in_pcbbind? Shouldn't we get immediate ENOTCONN if we try to sendfile on a not connected socket?

We're running a modified patch of this in production where a counter is incremented, and we aren't seeing a counter increment.

Since the risk seems to be confined to the tcp_connect() call within tcp_usr_send, and vn_sendfile can never hit that because the destination address isn't supplied... I'm slowly coming to the realization that this patch is probably completely pointless, protecting against something that can't happen.

If there is an issue it's going to be in the sendto() path with a supplied destination address with a tcp socket.

Implicit connects for TCP seem like a bad idea in general.

Abandoning this due to insufficient proof of a problem.