Only bring down clone interfaces at shutdown
Needs ReviewPublic
Actions

Authored by cy on Dec 3 2020, 4:09 PM.

Details

Reviewers

jhb
emaste
markj
tuexen
jtl
rrs

Summary

r366857 resolved two PRs (158734, 109980) and a local issue (here) where WOL failed to enable because lagg0 wasn't destroyed prior to the destruction of a physical interface. Running netif at shutdown created new issues: issues reported by jhb@ and emaste@ and PR 251351 (dhclient running on a VLAN printing error messages).

Test Plan

This has been tested here.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 35167

Event Timeline

cy created this revision.Dec 3 2020, 4:09 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 3 2020, 4:09 PM

cy requested review of this revision.Dec 3 2020, 4:09 PM

Harbormaster completed remote builds in B35167: Diff 80265.Dec 3 2020, 4:09 PM

cy mentioned this in D27464: Fix hung TCP sessions on shutdown.Dec 4 2020, 2:50 PM

cy added reviewers: tuexen, jtl.Dec 4 2020, 3:17 PM

cy added a reviewer: rrs.

This does not fix the regression I am experiencing in my test setup. I am testing with a machine which uses a LAGG interface to communicate with the outside world. Shutting this interface down still makes my SSH sessions hang.

My earlier diagnosis was that this is caused by the following sequence:

init gets the signal to shutdown.
init runs the shutdown scripts. The sshd shutdown script kills off the daemon, but not the forked processes to handle individual sessions. After rS366857, we shut down the network here. (And, this change still shuts down the network as part of the shutdown scripts.)
Only after all shutdown scripts run does init kill off the remaining processes. This is where the child sshd processes now get killed off. However, because the network is gone, the system has no way to close its TCP, SCTP, etc. sessions.

If my analysis is correct, further work is needed to allow shutting down interfaces without this regression. One path is what I proposed in D27464. Another path is to do the interface shutdown after killing off other processes (however, that seems like a relatively major change). Another path is to have the kernel shutdown any TCP/SCTP/etc. sessions when an interface is going down which would make it so the remote side of the session is no longer reachable (however, that also seems like a relatively major change). And, there are probably other alternatives, as well.

In D27459#613956, @jtl wrote:

This does not fix the regression I am experiencing in my test setup. I am testing with a machine which uses a LAGG interface to communicate with the outside world. Shutting this interface down still makes my SSH sessions hang.

My earlier diagnosis was that this is caused by the following sequence:

init gets the signal to shutdown.

init runs the shutdown scripts. The sshd shutdown script kills off the daemon, but not the forked processes to handle individual sessions. After rS366857, we shut down the network here. (And, this change still shuts down the network as part of the shutdown scripts.)

Only after all shutdown scripts run does init kill off the remaining processes. This is where the child sshd processes now get killed off. However, because the network is gone, the system has no way to close its TCP, SCTP, etc. sessions.

If my analysis is correct, further work is needed to allow shutting down interfaces without this regression. One path is what I proposed in D27464. Another path is to do the interface shutdown after killing off other processes (however, that seems like a relatively major change). Another path is to have the kernel shutdown any TCP/SCTP/etc. sessions when an interface is going down which would make it so the remote side of the session is no longer reachable (however, that also seems like a relatively major change). And, there are probably other alternatives, as well.

I think the third option would probably be a bad idea. As it is today, you can do things like re-dhclient and if you get the same IP, existing connections stay open (this matters when drivers do dumb things like reset for MTU or other changes which result in a link down/up cycle which resets dhclient. I think the Intel drivers have done this in the past if not still today).

I wonder if there is a better way to handle the original PRs. For example, if lagg and vlan interfaces should forward WOL requests down to the all the child interfaces. That will matter for the case where you want to use WOL to wake from suspend or low-power idle without shutting down the host.

Yes, your analysis is correct.

Or the other alternative, should we choose to revert it entirely is to either,

The kernel to tear down cloned interfaces before it tears down physical interfaces. OR,

Cloned interfaces assume attributes such as WOL. Without this any slave (physical) interface with WOL flag set will be ignored.

IMO it's simpler to to shut down sessions in the script than either having the kernel close sockets or order the tear down of interfaces to allow WOL to be recognized. Or support WOL within lagg.

The initial problem is caused by lagg masking WOL.

In D27459#614027, @jhb wrote:

In D27459#613956, @jtl wrote:

This does not fix the regression I am experiencing in my test setup. I am testing with a machine which uses a LAGG interface to communicate with the outside world. Shutting this interface down still makes my SSH sessions hang.

My earlier diagnosis was that this is caused by the following sequence:

init gets the signal to shutdown.

init runs the shutdown scripts. The sshd shutdown script kills off the daemon, but not the forked processes to handle individual sessions. After rS366857, we shut down the network here. (And, this change still shuts down the network as part of the shutdown scripts.)

Only after all shutdown scripts run does init kill off the remaining processes. This is where the child sshd processes now get killed off. However, because the network is gone, the system has no way to close its TCP, SCTP, etc. sessions.

If my analysis is correct, further work is needed to allow shutting down interfaces without this regression. One path is what I proposed in D27464. Another path is to do the interface shutdown after killing off other processes (however, that seems like a relatively major change). Another path is to have the kernel shutdown any TCP/SCTP/etc. sessions when an interface is going down which would make it so the remote side of the session is no longer reachable (however, that also seems like a relatively major change). And, there are probably other alternatives, as well.

I think the third option would probably be a bad idea. As it is today, you can do things like re-dhclient and if you get the same IP, existing connections stay open (this matters when drivers do dumb things like reset for MTU or other changes which result in a link down/up cycle which resets dhclient. I think the Intel drivers have done this in the past if not still today).

I wonder if there is a better way to handle the original PRs. For example, if lagg and vlan interfaces should forward WOL requests down to the all the child interfaces. That will matter for the case where you want to use WOL to wake from suspend or low-power idle without shutting down the host.

That's probably preferred. Should we abandon this, revert r366857, and teach lagg(4) and vlan(4) about WOL instead?

In D27459#614035, @cy wrote:

In D27459#614027, @jhb wrote:

I wonder if there is a better way to handle the original PRs. For example, if lagg and vlan interfaces should forward WOL requests down to the all the child interfaces. That will matter for the case where you want to use WOL to wake from suspend or low-power idle without shutting down the host.

That's probably preferred. Should we abandon this, revert r366857, and teach lagg(4) and vlan(4) about WOL instead?