Page MenuHomeFreeBSD

Eliminate rwlock from fast path processing in BPF code
ClosedPublic

Authored by ae on May 10 2019, 8:52 AM.
Tags
None
Referenced Files
F106399102: D20224.id57340.diff
Mon, Dec 30, 2:06 AM
F106398806: D20224.id57359.diff
Mon, Dec 30, 1:59 AM
Unknown Object (File)
Fri, Dec 20, 3:44 PM
Unknown Object (File)
Oct 3 2024, 6:00 AM
Unknown Object (File)
Oct 3 2024, 6:00 AM
Unknown Object (File)
Oct 1 2024, 8:00 PM
Unknown Object (File)
Sep 29 2024, 6:01 PM
Unknown Object (File)
Sep 12 2024, 2:31 AM
Subscribers

Details

Summary

The main goal of this patch is eliminating rwlock from fast path in BPF code. By fast path I mean inbound packets processing, when bpf_mtap function is invoked by the kernel to catch mbufs matching to some filter.
The problem with rwlock can be observed on multi-core systems, where multi-queues NIC handles packets rate about several Mpps.
Just for example: 28 cores Xeon, Mellanox 100G mlx5en, 6Mpps inbound traffic. Executing of tcpdump -npi mce3 host 198.18.0.1 immediately produces packets drop around 500kpps-1.5Mpps depending to traffic due to rwlock contention. Several times we faced with this problem on production systems, when an engineer tried to analyze some network traffic with tcpdump. This how it looks like:

# netstat -hw1 -I mce3
            input           mce3           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      5.9M     0     0       339M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       341M          0     0          0     0
      5.8M     0     0       330M          0     0          0     0
      5.7M     0     0       325M          0     0          0     0
      5.7M     0     0       328M          0     0          0     0
      4.1M     0  1.9M       236M          0     0          0     0
      4.0M     0  1.9M       226M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      3.9M     0  1.9M       225M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      5.9M     0  107k       339M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
^C

What was changed:

  • all lists were changed to use CK_LIST
  • from struct bpf_if removed bif_lock and bif_flags fields, added bif_refcnt and epoch_ctx. Now when an interface registers using bpfattach, its pointer is referenced by bif_if structure. And we hold this reference until bif_if structure will be detached and freed.
  • each bpf_d descriptor references bpf_if structure when it is attached. In conjunction with epoch(9) this prevents accessing to freed bpf_if, ifnet and bpf_d structures.
  • bpf_freelist and ifnet_departure event are no longer needed. BPF interfaces can be linked and unlinked from bpf_iflist only when global BPF_LOCK is held. When interface is unlinked, it will be freed() using epoch_call() when this will be safe.
  • new struct bpf_program_buffer is introduced to keep BPF filter programs. It is used to allow change BPF program without interface lock. New buffer is allocated for new program, and then old pointer is freed using epoch_call().
  • since BPF_LOCK() is sx lock, we can avoid extra locking in bpf_zero_counters(), bpf_getdltlist(), bpfstats_fill_xbpf().
  • also some functions were merged into another to avoid extra locking/relocking: bpf_attachd() now calls reset_d(); bpf_upgraded() moved into bpf_setf(); bpf_detachd_locked() can invoke bpf_wakeup() when bpf_if is detached.
Test Plan

For the same traffic flow executing tcpdump -npi mce3 host 198.18.0.1 does not produce any packets drop.

 # netstat -hw1 -I mce3
            input           mce3           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      5.9M     0     0       338M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
^C

Of course this does not mean that adding BPF filter is costless. But at least the patch reduces overhead and grows the limit when such task is safe.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 24184
Build 23031: arc lint + arc unit

Event Timeline

There is at least one problem that with this patch becomes easy reproducible. With default optimize_writers=0 bpt_mtap() can catch several packets in the time between bpf_setif() and bpf_setf(), because empty filter means "accept all" by bpf_filter(). Maybe it is time to remove optimize_writers variable and use this behavior by default? I.e. by default link new bpf_if into writers only list, and re-link it into readers list when application sets filter? Or change its value to be 1 by default?

ae edited the summary of this revision. (Show Details)

move bpf_updated() into bpf_setf() to reduce BPF_LOCK() flipping

ae edited the test plan for this revision. (Show Details)
sys/net/bpf.c
119–125

Would it be possible to have slightly different name for the struct? Say, bpf_program_buffer ? At least to me, 'buffer' by default links to the actual userland<>kernel buffers inside the descriptor.

315

why don't we typecast to struct bpf_epoch_buffer *?

689

Worth calling out that the descriptor is reset here?

1213

Would it be possible to refcount on d so we don't contend bp for other readers/writers?

2012

s/jusr/just/? :-)

ae edited the test plan for this revision. (Show Details)
  • s/bpf_epoch_buffer/bpf_program_buffer/g
  • update some comments
  • add copyright line
  • add refcount to bpf_d and use it in bpfwrite
ae marked 5 inline comments as done.May 12 2019, 5:25 PM
ae edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.May 12 2019, 5:40 PM
This revision was automatically updated to reflect the committed changes.

Let's try the benefit of this patch on a 8-core Atom with a 10G Chelsio, during a forwarding bench with tcdpump running:

x r347519: inet4 smallest size packets-per-second forwarded with tcpdump running
+ r347519 and D20224: inet4 smallest size packets-per-second forwarded with tcpdump running
+--------------------------------------------------------------------------+
|   xx              x                                          +  +   +  ++|
||___M__A______|                                                           |
|                                                                |___AM___||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       3682162       3816554       3684703     3711296.2     58889.489
+   5     4179733.5       4274492       4240164     4234216.7     39505.425
Difference at 95.0% confidence
        522920 +/- 73130.8
        14.09% +/- 2.16578%
        (Student's t, pooled s = 50143)