epair(4): disable per-IF fallback queuing and draining
AbandonedPublic
Actions

Authored by bz on Mar 11 2020, 9:02 PM.

Details

Reviewers

Group Reviewers

manpages

Summary

Over the last years people have reported "hangs" of epairs
not recovering once the hardware queue overflowed.

It turns out there are multiple problems both with the epair
code and its interactions with the netisr framework.
While these are not fixed do not compile in the "drain"
framework anymore.
This comes at a penalty of possibly dropping more packets
faster again as we only have the per-CPU netisr queue for
all interfaces and no per-interface "fallback" queue anymore.

While touching the code also update the epair(4) man page and
add tuning notes.

PR: 227100

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

No Lint Coverage

Unit

No Test Coverage

Build Status

Buildable 29887
Build 27707: arc lint + arc unit

Event Timeline

bz created this revision.Mar 11 2020, 9:02 PM

Herald added a reviewer: manpages. · View Herald TranscriptMar 11 2020, 9:02 PM

Herald added subscribers: melifaro, ae, imp. · View Herald Transcript

Harbormaster completed remote builds in B29887: Diff 69404.Mar 11 2020, 9:02 PM

I ran into this panic running the pf (forward:v4) test:

panic: epair_clone_destroy: ifp=0xfffff80051b6a800 scb->refcount!=1: 3
cpuid = 7
time = 1584005959
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00aff5a630
vpanic() at vpanic+0x182/frame 0xfffffe00aff5a680
panic() at panic+0x43/frame 0xfffffe00aff5a6e0
epair_clone_destroy() at epair_clone_destroy+0x1c1/frame 0xfffffe00aff5a730
if_clone_destroyif() at if_clone_destroyif+0x175/frame 0xfffffe00aff5a780
if_clone_destroy() at if_clone_destroy+0x1f5/frame 0xfffffe00aff5a7d0
ifioctl() at ifioctl+0x371/frame 0xfffffe00aff5a8a0
kern_ioctl() at kern_ioctl+0x27b/frame 0xfffffe00aff5a900
sys_ioctl() at sys_ioctl+0x12f/frame 0xfffffe00aff5a9d0
amd64_syscall() at amd64_syscall+0x803/frame 0xfffffe00aff5aaf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00aff5aaf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80048573a, rsp = 0x7fffffffe228, rbp = 0x7fffffffe240 ---
KDB: enter: panic
[ thread pid 1420 tid 100582 ]
Stopped at      kdb_enter+0x37: movq    $0,0x10928e6(%rip)

That's with net.link.epair.netisr_maxqlen=2 because that used to provoke the error quickly.

sys/net/if_epair.c
474	Should we be releasing references?
611	Should we not be releasing references here as well?

olevole_olevole.ru added a subscriber: olevole_olevole.ru.Apr 8 2020, 8:20 PM

I've started on the inevitable try to rewrite epair(4) two weeks ago (need to say it's UBR work) and ridden it of the netisr.
I do have a prototype which seems working; need to figure out how to scale things up to hundreds of epairs or 100(s) of CPU threads.