Change Details

The main goal of this patch is eliminating rwlock from fast path in BPF code. By fast path I mean inbound packets processing, when `bpf_mtap` function is invoked by the kernel to catch mbufs matching to some filter. The problem with rwlock can be observed on multi-core systems, where multi-queues NIC handles packets rate about several Mpps. Just for example: 28 cores Xeon, Mellanox 100G mlx5en, 6Mpps inbound traffic. Executing of `tcpdump -npi mce3 host 198.18.0.1` immediately produces packets drop around 500kpps-1.5Mpps depending to traffic due to rwlock contention. Several times we faced with this problem on production systems, when an engineer tried to analyze some network traffic with tcpdump. This how it looks like: ``` # netstat -hw1 -I mce3 input mce3 output packets errs idrops bytes packets errs bytes colls 5.9M 0 0 339M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 341M 0 0 0 0 5.8M 0 0 330M 0 0 0 0 5.7M 0 0 325M 0 0 0 0 5.7M 0 0 328M 0 0 0 0 4.1M 0 1.9M 236M 0 0 0 0 4.0M 0 1.9M 226M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 3.9M 0 1.9M 225M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 5.9M 0 107k 339M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 ^C ``` What was changed: * all lists were changed to use CK_LIST * from `struct bpf_if` removed `bif_lock` and `bif_flags` fields, added `bif_refcnt` and `epoch_ctx`. Now when an interface registers using `bpfattach`, its pointer is referenced by `bif_if` structure. And we hold this reference until `bif_if` structure will be detached and freed. * each `bpf_d` descriptor references `bpf_if` structure when it is attached. In conjunction with epoch(9) this prevents accessing to freed `bpf_if`, `ifnet` and `bpf_d` structures. * `bpf_freelist` and `ifnet_departure` event are no longer needed. BPF interfaces can be linked and unlinked from `bpf_iflist` only when global `BPF_LOCK` is held. When interface is unlinked, it will be freed() using `epoch_call()` when this will be safe. * new `struct bpf_epoch_buffer` is introduced to keep BPF filter programs. It is used to allow change BPF program without interface lock. New buffer is allocated for new program, and then old pointer is freed using `epoch_call()`. * since `BPF_LOCK()` is `sx` lock, we can avoid extra locking in `bpf_zero_counters()`, `bpf_getdltlist()`, `bpfstats_fill_xbpf()`. * also some functions were merged into another to avoid extra locking/relocking: `bpf_attachd()` now calls `reset_d()`; `bpf_upgraded()` moved into `bpf_setf()`; `bpf_detachd_locked()` can invoke `bpf_wakeup()` when `bpf_if` is detached.

ThThe main goal of this patch is under WIP.atch is eliminating rwlock from fast path in BPF code. By fast path I mean inbound packets processing, The main goalwhen `bpf_mtap` function is to eliminate rwlock from fast pathinvoked by the kernel to catch mbufs matching to some filter. The problem with rwlock can be observed on multi-core systems, where multi-queues NIC handles packets rate about several Mpps. Just for example: 28 cores Xeon, Mellanox 100G mlx5en, 6Mpps inbound traffic. Executing of `tcpdump -npi mce3 host 198.18.0.1` immediately produces packets drop around 500kpps-1.5Mpps due to rwlock contention5Mpps depending to traffic due to rwlock contention. Several times we faced with this problem on production systems, when an engineer tried to analyze some network traffic with tcpdump. This how it looks like: ``` # netstat -hw1 -I mce3 input mce3 output packets errs idrops bytes packets errs bytes colls 5.9M 0 0 339M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 341M 0 0 0 0 5.8M 0 0 330M 0 0 0 0 5.7M 0 0 325M 0 0 0 0 5.7M 0 0 328M 0 0 0 0 4.1M 0 1.9M 236M 0 0 0 0 4.0M 0 1.9M 226M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 3.9M 0 1.9M 225M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 5.9M 0 107k 339M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 ^C ``` What was changed: * all lists were changed to use CK_LIST * from `struct bpf_if` removed `bif_lock` and `bif_flags` fields, added `bif_refcnt` and `epoch_ctx`. Now when an interface registers using `bpfattach`, its pointer is referenced by `bif_if` structure. And we hold this reference until `bif_if` structure will be detached and freed. * each `bpf_d` descriptor references `bpf_if` structure when it is attached. In conjunction with epoch(9) this prevents accessing to freed `bpf_if`, `ifnet` and `bpf_d` structures. * `bpf_freelist` and `ifnet_departure` event are no longer needed. BPF interfaces can be linked and unlinked from `bpf_iflist` only when global `BPF_LOCK` is held. When interface is unlinked, it will be freed() using `epoch_call()` when this will be safe. * new `struct bpf_epoch_buffer` is introduced to keep BPF filter programs. It is used to allow change BPF program without interface lock. New buffer is allocated for new program, and then old pointer is freed using `epoch_call()`. * since `BPF_LOCK()` is `sx` lock, we can avoid extra locking in `bpf_zero_counters()`, `bpf_getdltlist()`, `bpfstats_fill_xbpf()`. * also some functions were merged into another to avoid extra locking/relocking: `bpf_attachd()` now calls `reset_d()`; `bpf_upgraded()` moved into `bpf_setf()`; `bpf_detachd_locked()` can invoke `bpf_wakeup()` when `bpf_if` is detached.