Page MenuHomeFreeBSD

hastd: fix fd passing over socketpair
Needs ReviewPublic

Authored by xtronom_gmail.com on Tue, Jun 9, 4:23 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jun 18, 12:38 AM
Unknown Object (File)
Wed, Jun 17, 10:51 PM
Unknown Object (File)
Wed, Jun 17, 6:16 AM
Unknown Object (File)
Wed, Jun 17, 5:04 AM
Unknown Object (File)
Tue, Jun 16, 11:04 PM
Unknown Object (File)
Sun, Jun 14, 2:57 PM
Unknown Object (File)
Sat, Jun 13, 11:19 AM
Unknown Object (File)
Tue, Jun 9, 6:25 PM

Details

Reviewers
glebius
pjd
Summary

The connection migration path transfers a protocol name and a connected
socket descriptor over a local socketpair.

The protocol name is currently transmitted using send(2), while the
descriptor is transmitted using a separate sendmsg(2) call carrying
SCM_RIGHTS ancillary data. The receiver expects both pieces of information
to arrive together and uses a receive buffer sized for the maximum
protocol name length.

On FreeBSD 15 this can cause the receive side to block indefinitely,
preventing the migrated connection from being delivered to the worker
process. As a result, the primary node never sends the HAST protocol
header and the secondary node times out waiting for it.

Transmit the protocol name and descriptor together using a single
sendmsg(2) operation and receive them using a single recvmsg(2)
operation.

Test Plan

On two FreeBSD 15 systems configured as a HAST pair:

  1. Start hastd on both nodes.
  2. Set the resource role to secondary on one node.
  3. Set the resource role to primary on the other node.
  4. Verify that /dev/hast/<resource> is created on the primary.
  5. Verify that geom gate list and ggatel list no longer block on primary.
  6. Verify that the secondary no longer logs "Unable to receive header".
  7. Verify that hastctl status reports the resource connected and synchronized.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 73768
Build 70651: arc lint + arc unit

Event Timeline

I really can't understand how this worked before FreeBSD 15. If one side sends less than 128 bytes, but other side says recv(s, buf, 128, MSG_WAITALL), then the other side shall hang always. I don't remember that my changes to unix(4) in FreeBSD 15 had fixed ignored MSG_WAITALL. Was that some other data sent after that woke up the receiver?

Anyway, thanks a lot for diagnosing the problem. The fix looks correct, but not beautiful. I would refactor out those proto_common_*. IMHO, the proto_socketpair and proto_uds should just use the proto_descriptor_send() and proto_descriptor_recv() that are capable to send data and descriptor or just or just data. These two functions can be renamed to proto_unix_send and proto_unix_recv and the proto_common left for TCP.

Let's hear what Pawel thinks.

I really can't understand how this worked before FreeBSD 15. If one side sends less than 128 bytes, but other side says recv(s, buf, 128, MSG_WAITALL), then the other side shall hang always. I don't remember that my changes to unix(4) in FreeBSD 15 had fixed ignored MSG_WAITALL. Was that some other data sent after that woke up the receiver?

I wrote a small socketpair test matching the HAST sequence:

send("tcp\0", 4)
sendmsg(SCM_RIGHTS only, no iov)
recv(..., 127, MSG_WAITALL)
recvmsg(...)

On FreeBSD 13 the second sendmsg() wakes the MSG_WAITALL recv, which returns short with the 4-byte protocol name, and the following recvmsg() receives the fd.

parent: send('tcp\0') = 4 errno=0
parent: sendmsg(fd-only) = 0 errno=0
child: recv(MSG_WAITALL, 127) = 4 errno=0 buf='tcp'
child: recvmsg(fd-only) = 0 errno=0 flags=0x0
child: received fd=3

On FreeBSD 15 the MSG_WAITALL recv remains blocked.

parent: send('tcp\0') = 4 errno=0
parent: sendmsg(fd-only) = 0 errno=0

One could argue that either behavior is correct. FreeBSD 13 wakes the waiting recv(2) and returns a short read when the ancillary data arrives, while FreeBSD 15 continues waiting (no signal, recv() not interrupted) for the requested byte count.

However, the parent/worker communication channel is a SOCK_STREAM socketpair. The descriptor transfer is performed using sendmsg(2) with SCM_RIGHTS and no payload data.

The proposed change avoids depending on this interaction by transmitting the protocol name and descriptor as a single logical message.

Anyway, thanks a lot for diagnosing the problem. The fix looks correct, but not beautiful. I would refactor out those proto_common_*. IMHO, the proto_socketpair and proto_uds should just use the proto_descriptor_send() and proto_descriptor_recv() that are capable to send data and descriptor or just or just data. These two functions can be renamed to proto_unix_send and proto_unix_recv and the proto_common left for TCP.

The proto layer supports three transports and there is no clean way to use transport-specific helpers directly once the abstraction layer is in use. The proto_common_send()/proto_common_recv() helpers also carry several overloaded semantics:

Transport      Used for              Exact I/O   NULL direction       NULL shutdown   FD passing
tcp            replication           yes         not supported        unused          prohibited
uds            hastctl control       yes         not supported        unused          not used
socketpair     parent/worker IPC     yes         yes (after fork)     not possible    yes

The failing path is the socketpair parent/worker connection migration path. A cleaner refactor would likely separate exact-size stream I/O from UNIX descriptor-passing I/O and reduce the overloaded semantics in the common helpers.

However, doing so would require broader changes to the transport abstraction and its implementations. This revision keeps the change limited to the failing descriptor-passing path and avoids modifying the rest of the protocol layer.

Let's hear what Pawel thinks.

On FreeBSD 13 the second sendmsg() wakes the MSG_WAITALL recv, which returns short with the 4-byte protocol name, and the following recvmsg() receives the fd.

Hmm, looks like new unix(4) had fixed a bug. The bug was kinda documented in recv(2):

The MSG_WAITALL flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned.

The page doesn't explain what is data type, but we could guess that control message is referred to.

However SUS doesn't say that:

MSG_WAITALL On SOCK_STREAM sockets this requests that the function block until the full amount of data can be returned. The function may return the smaller amount of data if the socket is a message-based socket, if a signal is caught, if the connection is terminated, if MSG_PEEK was specified, or if an error is pending for the socket.

I don't think I should restore the old behavior, better stick to the behavior specified by the standard.

The proto layer supports three transports and there is no clean way to use transport-specific helpers directly once the abstraction layer is in use. The proto_common_send()/proto_common_recv() helpers also carry several overloaded semantics:
The failing path is the socketpair parent/worker connection migration path. A cleaner refactor would likely separate exact-size stream I/O from UNIX descriptor-passing I/O and reduce the overloaded semantics in the common helpers.
However, doing so would require broader changes to the transport abstraction and its implementations. This revision keeps the change limited to the failing descriptor-passing path and avoids modifying the rest of the protocol layer.

That's why I propose the remove this abstraction as it creates more trouble and obfuscation then benefit, IMHO. What's the point of abstraction if certain protocol calls are immediately short-circuited to a different function and the main sending (or receiving) loop is not entered at all?

Ok, let's hear from Pawel.

That's why I propose the remove this abstraction as it creates more trouble and obfuscation then benefit, IMHO. What's the point of abstraction if certain protocol calls are immediately short-circuited to a different function and the main sending (or receiving) loop is not entered at all?

Ok, let's hear from Pawel.

I agree with the direction. I'd like to hear Pawel's thoughts before reworking the abstraction, since he may have context on why it was structured this way originally.