Page MenuHomeFreeBSD

dhclient(8) issues unneeded ioctl(SIOCSIFMTU) on every lease renew
ClosedPublic

Authored by sobomax on Dec 14 2018, 1:13 AM.

Details

Summary

We've observed some weird behavior with ena(4) interface being reset periodically on our r4.xlarge EC2 instances. Turns out our DHCP server supplies non-default MTU (9001), which causes dhclient(8) to issue ioctl(SIOCSIFMTU) on every renew. This is further aggravated by the fact that most if not all network drivers don't bother checking the current value and most of them are quite complex beasts these days doing all kind of weird and wonderful stuff when MTU being changed, such as shutting down and starting up kernel threads etc.

This patch fixes this problem by only issuing MTU update on initial configuration or if new value received from the DHCP server differs from a previous one.

Dec 11 05:51:13 builder kernel: ena0: device is going DOWN
Dec 11 05:51:14 builder kernel: ena0: device is going UP
Dec 11 05:51:14 builder kernel: ena0: queue 0 - cpu 0
Dec 11 05:51:14 builder kernel: ena0: queue 1 - cpu 1
Dec 11 05:51:14 builder kernel: ena0: queue 2 - cpu 2
Dec 11 05:51:14 builder kernel: ena0: queue 3 - cpu 3
Dec 11 06:21:13 builder kernel: ena0: device is going DOWN
Dec 11 06:21:14 builder kernel: ena0: device is going UP
Dec 11 06:21:14 builder kernel: ena0: queue 0 - cpu 0
Dec 11 06:21:14 builder kernel: ena0: queue 1 - cpu 1
Dec 11 06:21:14 builder kernel: ena0: queue 2 - cpu 2
Dec 11 06:21:14 builder kernel: ena0: queue 3 - cpu 3
Dec 11 06:51:13 builder kernel: ena0: device is going DOWN
Dec 11 06:51:14 builder kernel: ena0: device is going UP
Dec 11 06:51:14 builder kernel: ena0: queue 0 - cpu 0
Dec 11 06:51:14 builder kernel: ena0: queue 1 - cpu 1
Dec 11 06:51:14 builder kernel: ena0: queue 2 - cpu 2
Dec 11 06:51:14 builder kernel: ena0: queue 3 - cpu 3
Dec 11 07:21:14 builder kernel: ena0: device is going DOWN
Dec 11 07:21:14 builder kernel: ena0: device is going UP
Dec 11 07:21:14 builder kernel: ena0: queue 0 - cpu 0
Dec 11 07:21:14 builder kernel: ena0: queue 1 - cpu 1
Dec 11 07:21:14 builder kernel: ena0: queue 2 - cpu 2
Dec 11 07:21:14 builder kernel: ena0: queue 3 - cpu 3
Dec 11 07:51:13 builder kernel: ena0: device is going DOWN
Dec 11 07:51:14 builder kernel: ena0: device is going UP
Dec 11 07:51:14 builder kernel: ena0: queue 0 - cpu 0
Dec 11 07:51:14 builder kernel: ena0: queue 1 - cpu 1
Dec 11 07:51:14 builder kernel: ena0: queue 2 - cpu 2
Dec 11 07:51:14 builder kernel: ena0: queue 3 - cpu 3

With some extra debug added:

Dec 13 09:26:16 builder kernel: Trying to mount root from zfs:vol-000bbd3ff1dd89617 []...
Dec 13 09:26:16 builder kernel: random: unblocking device.
Dec 13 09:26:16 builder kernel: ena_ioctl: SIOCSIFFLAGS
Dec 13 09:26:16 builder kernel: ena0: device is going UP
Dec 13 09:26:16 builder kernel: ena0: queue 0 - cpu 0
Dec 13 09:26:16 builder kernel: ena0: queue 1 - cpu 1
Dec 13 09:26:16 builder kernel: ena0: queue 2 - cpu 2
Dec 13 09:26:16 builder kernel: ena0: queue 3 - cpu 3
Dec 13 09:26:16 builder kernel: ena_ioctl: SIOCGIFMEDIA
Dec 13 09:26:16 builder kernel: ena_ioctl: SIOCSIFFLAGS
Dec 13 09:26:16 builder kernel: ena_ioctl: SIOCADDMULTI
Dec 13 09:26:16 builder kernel: ena_ioctl(SIOCSIFMTU): Changing MTU from 1500 to 9001
Dec 13 09:26:16 builder kernel: ena0: device is going DOWN
Dec 13 09:26:16 builder kernel: ena0: device is going UP
Dec 13 09:26:16 builder kernel: ena0: queue 0 - cpu 0
Dec 13 09:26:16 builder kernel: ena0: queue 1 - cpu 1
Dec 13 09:26:16 builder kernel: ena0: queue 2 - cpu 2
Dec 13 09:26:16 builder kernel: ena0: queue 3 - cpu 3
Dec 13 09:26:16 builder kernel: ena_ioctl: SIOCDELMULTI
Dec 13 09:26:16 builder kernel: ena_ioctl: SIOCADDMULTI
Dec 13 09:26:16 builder kernel: ena_ioctl: SIOCGIFMEDIA
Dec 13 09:26:26 builder last message repeated 38 times
Dec 13 09:26:26 builder kernel: ena1: device is going UP
Dec 13 09:26:26 builder kernel: ena1: queue 0 - cpu 0
Dec 13 09:26:26 builder kernel: ena1: queue 1 - cpu 1
Dec 13 09:26:26 builder kernel: ena1: queue 2 - cpu 2
Dec 13 09:26:26 builder kernel: ena1: queue 3 - cpu 3
Dec 13 09:26:26 builder kernel: ena_ioctl: SIOCADDMULTI
Dec 13 09:26:26 builder kernel: ena_ioctl: SIOCGIFMEDIA
Dec 13 09:26:26 builder last message repeated 2 times
Dec 13 09:28:48 builder kernel: ena_ioctl: SIOCGIFMEDIA
Dec 13 09:28:48 builder last message repeated 5 times
Dec 13 09:56:13 builder kernel: ena_ioctl(SIOCSIFMTU): Changing MTU from 9001 to 9001
Dec 13 09:56:13 builder kernel: ena0: device is going DOWN
Dec 13 09:56:13 builder kernel: ena0: device is going UP
Dec 13 09:56:13 builder kernel: ena0: queue 0 - cpu 0
Dec 13 09:56:13 builder kernel: ena0: queue 1 - cpu 1

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

sobomax created this revision.Dec 14 2018, 1:13 AM
sobomax edited the summary of this revision. (Show Details)Dec 14 2018, 1:17 AM
cem added a comment.Dec 14 2018, 2:02 AM

The general idea seems sound to me. Ideally NIC drivers would also handle it gracefully, but this is nice in that it is close to the source (well, one common source). In fact, iflib handles this gracefully already:

3949         case SIOCSIFMTU:
3950                 CTX_LOCK(ctx);
3951                 if (ifr->ifr_mtu == if_getmtu(ifp)) {
3952                         CTX_UNLOCK(ctx);
3953                         break;
3954                 }

In fact, all interfaces are dispatched via ifp->if_ioctl anyway. Maybe it makes sense to hook that. And ifhwioctl()'s SIOCSIFMTU handler already is aware of old_mtu; it could skip calling if_ioctl if the new mtu was identical. That doesn't account for if_ioctl callers in in_control and in6_control; so, I get it, that gets messy.

This seems fine for now.

sbin/dhclient/dhclient.c
853 ↗(On Diff #51982)

could this be simplified to } else if (mtu != old_mtu) {? What is the benefit of the state check?

Hmm, there was https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229432 and corresponding commit https://svnweb.freebsd.org/base?view=revision&revision=336195 fixing the problem.

Was it insufficient or your tree does not have that fix?

Hmm, there was https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229432 and corresponding commit https://svnweb.freebsd.org/base?view=revision&revision=336195 fixing the problem.
Was it insufficient or your tree does not have that fix?

Yes, you are right. We were looking at the FreeBSD 11.2 code here, I have not noticed there is another change in a trunk to fix the same issue. :(

However, I still feel my approach is bit cleaner, as it does result in interface_set_mtu_priv() not being called at all unless it's really necessary. Not a critical difference, but saves us some extra effort to create/destroy socket and issuing SIOCGIFMTU upon every lease renew.

sobomax marked an inline comment as done.Dec 14 2018, 4:11 PM
sobomax added inline comments.
sbin/dhclient/dhclient.c
853 ↗(On Diff #51982)

No, the check is actually needed here. The dhclient(8) saves its lease info, so upon reboot (state == S_REBOOTING) you might still see the same old_mtu, while interface would be in its default state.

sobomax added a comment.EditedDec 14 2018, 6:08 PM

@eugen_grosbein.net actually upon thinking a little bit more about the difference between this proposed change and r336195 I came to a conclusion that those two are not necessarily mutually exclusive, but rather complementing each other. The problem with just r336195 alone is that it invalidates implicit expectations that dhclient should not try to touch interface at all after initial configuration, unless there are some changes on the DHCP side. As an administrator, I'd not expect dhclient to be actively policing interface parameters and try to enforce them, which would be the case for MTU now. Consider the following hypothetical scenario:

  1. System boots, obtains parameters from DHCP server and provision interface, including setting MTU to some value X. dhclient continues to run in the background.
  2. At some point administrator logins and changes MTU from value X prescribed by the DHCP to some other value Y of his liking.
  3. After some random time dhcpclient kicks in to renew the lease and switches MTU back from Y to X.

As far as I know, the dhclient does not do it for any other parameter. So unless there is some objections, I'd want to proceed with this change anyhow.

On the other hand, r336195 has a merit on its own, as it avoids unnecessary MTU change if the interface happens to be in the correct state at the startup already, which is a good thing considering how heavy this operation is for many drivers at the time being.

sobomax marked an inline comment as done.Dec 14 2018, 6:11 PM

Hmm, there was https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229432 and corresponding commit https://svnweb.freebsd.org/base?view=revision&revision=336195 fixing the problem.
Was it insufficient or your tree does not have that fix?

Yes, you are right. We were looking at the FreeBSD 11.2 code here, I have not noticed there is another change in a trunk to fix the same issue. :(

The change was merged to 11.2-STABLE branch too.

  1. At some point administrator logins and changes MTU from value X prescribed by the DHCP to some other value Y of his liking.

JFYI: the PR 229432 mentiones working way to ignore option 26 (MTU) supplied by DHCP server:

interface "em0" {

supersede interface-mtu 0;

}

sobomax added a comment.EditedDec 14 2018, 8:39 PM
  1. At some point administrator logins and changes MTU from value X prescribed by the DHCP to some other value Y of his liking.

JFYI: the PR 229432 mentiones working way to ignore option 26 (MTU) supplied by DHCP server:
interface "em0" {

supersede interface-mtu 0;

}

This is not quite what I am talking about here. We want the value returned by the DHCP server to have an effect when system boots, but we don't want dhclient to continuously restore the state back in case it has been changed by the administrator. My point is that dhclient does not do this periodic "check-up/fix-up" for any other part of configuration that it makes (e.g. ip address, netmask, default gateway, dns server), so there is no reason for an administrator to expect MTU to be in a different bucket. I believe "leave interface alone unless there are some changes on the DHCP side" is the proper approach.

cem added inline comments.Dec 14 2018, 8:49 PM
sbin/dhclient/dhclient.c
853 ↗(On Diff #51982)

I see, thanks.

That makes sense, thanks.

This revision is now accepted and ready to land.Dec 14 2018, 8:58 PM
cem accepted this revision.Dec 15 2018, 4:49 AM

This is not quite what I am talking about here. We want the value returned by the DHCP server to have an effect when system boots, but we don't want dhclient to continuously restore the state back in case it has been changed by the administrator. My point is that dhclient does not do this periodic "check-up/fix-up" for any other part of configuration that it makes (e.g. ip address, netmask, default gateway, dns server), so there is no reason for an administrator to expect MTU to be in a different bucket. I believe "leave interface alone unless there are some changes on the DHCP side" is the proper approach.

I don't think this bizarre scenario needs so much attention and extra logic. Either administrator knows better, or dhcpd does.

If admin knows better, he can have DHCP ignore advertised MTU with supersede interface-mtu 0. The MTU provided by DHCP will still be logged to /var/db/dhclient.leases.<iface> file in case he is curious what was provided to update his static configuration.

IMO, it is not unreasonable for an automatic network client configuration tool like dhclient or rtsold to override interface parameters that happen to have been manually set when the behavior can be disabled and isn't. It would be more surprising to me if they didn't do that. If you don't want network client auto-configuration, disable DHCP/rtadv/etc.

That said, the proposed patch looks great to me, and complementary to eugen's change in 12.x.

This revision was automatically updated to reflect the committed changes.
sobomax marked an inline comment as done.