The main goal of this patch is eliminating rwlock from fast path in BPF code. By fast path I mean inbound packets processing, when bpf_mtap function is invoked by the kernel to catch mbufs matching to some filter.
The problem with rwlock can be observed on multi-core systems, where multi-queues NIC handles packets rate about several Mpps.
Just for example: 28 cores Xeon, Mellanox 100G mlx5en, 6Mpps inbound traffic. Executing of tcpdump -npi mce3 host 198.18.0.1 immediately produces packets drop around 500kpps-1.5Mpps depending to traffic due to rwlock contention. Several times we faced with this problem on production systems, when an engineer tried to analyze some network traffic with tcpdump. This how it looks like:
# netstat -hw1 -I mce3 input mce3 output packets errs idrops bytes packets errs bytes colls 5.9M 0 0 339M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 6.0M 0 0 341M 0 0 0 0 5.8M 0 0 330M 0 0 0 0 5.7M 0 0 325M 0 0 0 0 5.7M 0 0 328M 0 0 0 0 4.1M 0 1.9M 236M 0 0 0 0 4.0M 0 1.9M 226M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 3.9M 0 1.9M 225M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 4.1M 0 1.9M 234M 0 0 0 0 5.9M 0 107k 339M 0 0 0 0 6.0M 0 0 345M 0 0 0 0 ^C
What was changed:
- all lists were changed to use CK_LIST
- from struct bpf_if removed bif_lock and bif_flags fields, added bif_refcnt and epoch_ctx. Now when an interface registers using bpfattach, its pointer is referenced by bif_if structure. And we hold this reference until bif_if structure will be detached and freed.
- each bpf_d descriptor references bpf_if structure when it is attached. In conjunction with epoch(9) this prevents accessing to freed bpf_if, ifnet and bpf_d structures.
- bpf_freelist and ifnet_departure event are no longer needed. BPF interfaces can be linked and unlinked from bpf_iflist only when global BPF_LOCK is held. When interface is unlinked, it will be freed() using epoch_call() when this will be safe.
- new struct bpf_program_buffer is introduced to keep BPF filter programs. It is used to allow change BPF program without interface lock. New buffer is allocated for new program, and then old pointer is freed using epoch_call().
- since BPF_LOCK() is sx lock, we can avoid extra locking in bpf_zero_counters(), bpf_getdltlist(), bpfstats_fill_xbpf().
- also some functions were merged into another to avoid extra locking/relocking: bpf_attachd() now calls reset_d(); bpf_upgraded() moved into bpf_setf(); bpf_detachd_locked() can invoke bpf_wakeup() when bpf_if is detached.