Page MenuHomeFreeBSD

Teach lagg(4) to change MTU
AbandonedPublic

Authored by rpokala on Feb 28 2015, 3:32 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 3, 5:44 PM
Unknown Object (File)
Fri, Jan 3, 5:55 AM
Unknown Object (File)
Sat, Dec 28, 12:43 AM
Unknown Object (File)
Thu, Dec 26, 4:20 PM
Unknown Object (File)
Mon, Dec 23, 3:45 AM
Unknown Object (File)
Sun, Dec 22, 5:00 AM
Unknown Object (File)
Dec 4 2024, 8:28 AM
Unknown Object (File)
Dec 4 2024, 12:49 AM

Details

Summary

Teach lagg(4) to accept ioctl(SIOCSIFMTU) and change the MTU on all the component interfaces. If for any reason a component interface's MTU cannot be changed as requested, all component interfaces are returned to the original MTU.

Discussed on freebsd-hackers@:

https://lists.freebsd.org/pipermail/freebsd-hackers/2015-February/thread.html#46986

There is a LOR, but I don't know how to interpret it:

https://lists.freebsd.org/pipermail/freebsd-hackers/2015-February/047012.html

Putting rstone@ down as a reviewer, per his offer on the list.

Test Plan

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 1681
Build 1687: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/net/if_lagg.c
1811

I find the flow control here a bit confusing (my first read through, I thought that err2 could be used unitinialized). Given that you have a continue in the if block, I would find it clearer to not have an else here

In D1986#7, @rstone wrote:

RLOCK only gets a read lock. You want WLOCK to get a write lock to ensure serialization.

Also we can use another lock in the lagg_ioctl, that will prevent simultaneous MTU changing.

rpokala-panasas.com edited edge metadata.

Updating D1986: Teach lagg(4) to change MTU

Addressing review comments from rstone:

Fix style(9) violations
Replace LAGG_RLOCK() w/ LAGG_WLOCK()
Clarify logic for roll-back
Better comments throughout.

4079 addresses @rstone's comments. However, the LOR remains:

Mar  2 17:00:43 fbsd-X root: ifconfig lagg0 mtu 9000
Mar  2 17:00:43 fbsd-X kernel: lock order reversal:
Mar  2 17:00:43 fbsd-X kernel: 1st 0xfffff80004960c08 if_lagg rmlock (if_lagg rmlock) @ /usr/src/sys/modules/if_lagg/../../net/if_lagg.c:1779
Mar  2 17:00:43 fbsd-X kernel: 2nd 0xfffffe0000d50748 em1 (EM Core Lock) @ /usr/src/sys/dev/e1000/if_lem.c:1035
Mar  2 17:00:43 fbsd-X kernel: KDB: stack backtrace:
Mar  2 17:00:43 fbsd-X kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00003f86c0
Mar  2 17:00:43 fbsd-X kernel: witness_checkorder() at witness_checkorder+0xe45/frame 0xfffffe00003f8750
Mar  2 17:00:43 fbsd-X kernel: __mtx_lock_flags() at __mtx_lock_flags+0xa8/frame 0xfffffe00003f87a0
Mar  2 17:00:43 fbsd-X kernel: lem_ioctl() at lem_ioctl+0x3e8/frame 0xfffffe00003f87e0
Mar  2 17:00:43 fbsd-X kernel: lagg_ioctl() at lagg_ioctl+0xafd/frame 0xfffffe00003f88e0
Mar  2 17:00:43 fbsd-X kernel: ifioctl() at ifioctl+0x102e/frame 0xfffffe00003f89a0
Mar  2 17:00:43 fbsd-X kernel: kern_ioctl() at kern_ioctl+0x2c0/frame 0xfffffe00003f8a00
Mar  2 17:00:43 fbsd-X kernel: sys_ioctl() at sys_ioctl+0x153/frame 0xfffffe00003f8ae0
Mar  2 17:00:43 fbsd-X kernel: amd64_syscall() at amd64_syscall+0x27f/frame 0xfffffe00003f8bf0
Mar  2 17:00:43 fbsd-X kernel: Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe00003f8bf0
Mar  2 17:00:43 fbsd-X kernel: --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8011e1eea, rsp = 0x7fffffffe288, rbp = 0x7fffffffe2a0 ---

One of my colleagues dug into the first LOR - the one that happens when adding em1 to lagg0, which happens both with and without the patch. It looks like this is a known issue:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=194109

We (Panasas) tried reproducing the problem by applying the patch and enabled WITNESS in freebsd stable (10.1) kernel.
While testing with this patched kernel, LOR not observed on ixl and igb ports.

Based on the LOR logs mentioned earlier, the issue seems to be observed on device specific driver like em (e1000).
We tried doing code analysis on em driver and could not find anything obvious. Our platform do not use em driver anymore.

The issue seems to be on device specific driver and not the lagg driver changes that we introduced.
So, we kindly request you to accept these changes to mainline.

We (Panasas) tried reproducing the problem by applying the patch and enabled WITNESS in freebsd stable (10.1) kernel.
While testing with this patched kernel, LOR not observed on ixl and igb ports.

Based on the LOR logs mentioned earlier, the issue seems to be observed on device specific driver like em (e1000).
We tried doing code analysis on em driver and could not find anything obvious. Our platform do not use em driver anymore.

The issue seems to be on device specific driver and not the lagg driver changes that we introduced.
So, we kindly request you to accept these changes to mainline.

lem_ioctl() at lem_ioctl+0x3e8/frame 0xfffffe00003f87e0

This actually means sys/dev/e1000/if_lem.c ... I'll take a look at the LOR and see if its fixable.

@sbruno, A gentle remainder on em driver's LOR issue. Can you please share your findings.

It is true that this LOR is driver-specific but calling SIOCSIFMTU after acquiring a lock in lagg ioctl is not always safe. Change of lladdr suffers from the same situation and it was solved by using an asynchronous task queue to update addresses on each port. What do you think about piggybacking an MTU change to the queue by extending struct lagg_llq to a more generic one which makes it possible to handle per-port properties?

After digging into lagg internals on updating lladdrs on lagg ports, I'd also vote for extenging llq to deal with MTU changes for underlying interfaces

Thanks @hrs and @melifaro for the suggestions.

I am working on the code changes to handle the MTU asynchronously. I will update the tested patch for review by this week.

Updated patch from LN, addressing comments from hrs@, melifaro@, and internal Panasas review.

Some general style nits and a question re-loss of the mtu set error.

sys/net/if_lagg.c
706

style(9) bool use of pointer type.

731

style(9) four space additional indent only should be used, more below.

769

Looks like this attempts to change more that would have been done above before the error, is that intended?

783

Looks like the error gets lost, although printed, is there no way we can avoid this?

812

style(9) bracing around return.

818

style(9) init of vars in declaration should be avoided.

Moving to down to where its first needed can avoid setting it at all.

837

style(9) bool use of pointer type.

843

style(9) bool use of pointer type.

852

consider: "it might have been updated."

861

style(9) bool use of pointer type.

sys/net/if_lagg.c
728

What if we have multiple events queued on tasq? e.g mtu AND mac change

763

Not that easy, unfortunately.
At this moment original ioctl returned 0, so other things/events were fired:
rtsock notification, IPv6 nd mtg update, route table mtg update.. see net/if.c for more details.
Trying to silently revert part of it without dealing with other things is not a good idea.

sys/net/if_lagg.c
753

Please separate a llq loop from a handler for per-port configuration. A llq traversal should be required only once in lagg_port_ops() if the handlers process a single lagg_llq entry.

837

Is this (llq == NULL), not (llq != NULL)?

840

Why is cleanup required here? This removes all of tasks not limited to MTU change.

861

free(NULL) does nothing. Checking if NULL or not is useless.

872

This traversal and freeing an entry after processing it should be done in lagg_port_ops().

sys/net/if_lagg.h
220

Please add "llq_" prefix to the members.

221

Is there any reason to have ifr as a pointer? malloc is generally expensive in kernel, and overhead of struct ifreq is acceptable for me even if every llq has one. I feel this complicates the error handling at least.

rpokala marked 8 inline comments as done.

Address review comments from smh@, melifaro@, and hrs@.

I'm proxying for LN, who asserts that all the comments have been addressed. So, marking them done.

sys/net/if_lagg.c
769

For the interfaces after the one that failed (and therefore which still have the old MTU), this is a noop.

812

I rather hate that idiom, but fine. :-P

sys/kern/kern_condvar.c
466 ↗(On Diff #11984)

style(9) there should be blank line at the beginning of methods that don't have any vars.

468 ↗(On Diff #11984)

No need for else here

sys/net/if_lagg.c
106

style(9) should be 4 space indented, more below

sys/net/if_lagg.h
233

not sure but looks like there are alignment issues here?

sys/kern/kern_condvar.c
465 ↗(On Diff #11984)

ASSERT that the lock is held?

sys/net/if_lagg.c
714

Not sure if its possible but if we have no llq_entries then err will be uninitialised on return.

737

style(9) blank line after vars.

760

static void should be on a separate line, few more below.

Also not keen on the name for this method may be something like lagg_llq_free_entries?

793

No need for == TRUE on bool.

800

I assume this is checked after cv_has_waiters as that indicates an mtu change is in progress hence there's no easy way to confirm what it will become?

Its a shame ioctl doesn't currently have EBUSY as a possible return error as that may be simpler.

rpokala marked an inline comment as done.

Address smh@'s comments from 2016-01-11.

smh@ - I just uploaded an updated patch from LN, which addressed your latest comments.

Also still some older style(9) comments still not addressed.

sys/net/if_lagg.c
801

This looks a bit messy given you only need LAGG_WUNLOCK(sc); I would do something like (also eliminates done var too).

Alternatively just use goto out, which while it sets busy to false that should be fine.

if (SLIST_EMPTY(&sc->sc_ports)) {
        LAGG_WUNLOCK(sc);
        return (EIO);
} else if (sc->sc_mtu_ctxt.busy) {
        LAGG_WUNLOCK(sc);
        return (EBUSY);
} else if (ifp->if_mtu == ifr->ifr_mtu) {
        LAGG_WUNLOCK(sc);
        return (0);
}

sc->sc_mtu_ctxt.busy = TRUE;
sys/net/if_lagg.h
215

Any reason for the blank line here?

Hi Steven Hartland,
Thanks for your review.
I have attached the revised diffs here, as I cannot update them myself.
Ravi, who owns this review, works on a different timezone, and I have
copied him here.

Looking forward to hear your comments on the same.

Thanks,
LN

@smh: "Also still some older style(9) comments still not addressed."

Other than the blank line you mentioned, what other style(9) issues are there?

rpokala added inline comments.
sys/net/if_lagg.h
233

I think it's an artifact of Phabricator; it's aligned correctly in the file.

@smh: "Also still some older style(9) comments still not addressed."

Other than the blank line you mentioned, what other style(9) issues are there?

There where a few previous comments not marked as done, which still appeared valid. I think however this was due to diff upload so hopefully just the one I've re-highlighted.

sys/net/if_lagg.c
106

This still seems incorrect (tab intended)

rpokala added inline comments.
sys/net/if_lagg.c
106

Ah, thanks.

I did it the way it is because other stuff in the file is done that way. For example, see the prototypes for lagg_port_output(), lagg_setflag(), lagg_rr_input(), lagg_fail_input(), lagg_lb_input(), lagg_bcast_input(), lagg_lacp_input().

If you still want me to change this, I can, but I'd rather go with the in-file precedent.

sys/net/if_lagg.c
106

If that's the case then all good :)

sys/net/if_lagg.c
106

Excellent. Can you mark this approved for the record, so I can commit?

smh added a reviewer: smh.

Happy with this from a style perspective, however I haven't tested and this area is not one of my strong points. So for commit I'd say I've only reviewed (in part).

My other concern it that the actions performed on failure may well not result in the system being in the same state before the update was attempted, which could cause unexpected issues.

This revision is now accepted and ready to land.Jan 16 2016, 2:18 AM

Please allow me one more day to review the changes.

Thanks Steve for your comments.
I would appreciate if you can explain more on "... actions performed on failure may well not result in the system being in the same state before the update was attempted, which could cause unexpected issues."

The changes introduce MTU changes on lagg, addresses concerns of inconsistent/partial rollback on failures by synchronizing calls to wait for response from async taskq. And additionally, when operations are in progress, newer calls are rejected with busy status. Is there any flow/window you have seen where the state is not restored?

The operation of MAC change on lagg is left as it was earlier. And these are the only 2 operations currently defined on lagg port.

Sorry for the delay. Will try to review today/tomorrow.

Sorry for the delay. Will try to review today/tomorrow.

Ping?

melifaro added a reviewer: melifaro.

Sorry for taking that long.

First off, I like the idea w/ synchronous completion - it is really a good way to go.
However, I have some comments/suggestions on the implementation.

Changing MTU is not different from changing MAC or changing capabilities - each needs to be run async, on the list of interfaces, commands can fail, you might want rollback and completion signal. This should be done in generic way.

So, I'd suggest the following:

  1. use single queue item per task queue "task"
  2. embed all interface pointers w/ their ioctl handlers (and maybe other flags there) (maybe using some standard e.g. non mtu-specific function) into that item
  3. embed/init mutex/condvar there
  4. embed callback/free routine pointers

So, the process can look like the following:

  • lag_change_mtu() allocates new structure, stores there all current interfaces, along with lagg_llq_action_mtu() and initialised lagg_signal.
  • do mtx lock, put-to-task, cv_wait()
  • then, llq_action is called by generic lagg dequeue code, it handles mtu changes and signals back to cv
  • initial ioctl wakes up, reclaims memory/do other cleanup and returns the result
  • (there could be a "destroy" handler which simply signals cv w/ error on enqueued items, when lagg destroy is requested)

Doing it that way allows to have more than 1 item on the taskq and permits extending this approach to other code.

This revision now requires changes to proceed.Jan 26 2016, 7:04 AM

This turned out to be way more complicated than anyone expected, the people who were working on it for Panasas are no longer with the company, the changes have bitrotted due to six months of neglect from me, and we came up with a reasonable workaround in the scripting layer. For all those reasons, I'm abandoning this diff.