Page MenuHomeFreeBSD

Eliminate rwlock from fast path processing in BPF code
ClosedPublic

Authored by ae on May 10 2019, 8:52 AM.

Details

Summary

The main goal of this patch is eliminating rwlock from fast path in BPF code. By fast path I mean inbound packets processing, when bpf_mtap function is invoked by the kernel to catch mbufs matching to some filter.
The problem with rwlock can be observed on multi-core systems, where multi-queues NIC handles packets rate about several Mpps.
Just for example: 28 cores Xeon, Mellanox 100G mlx5en, 6Mpps inbound traffic. Executing of tcpdump -npi mce3 host 198.18.0.1 immediately produces packets drop around 500kpps-1.5Mpps depending to traffic due to rwlock contention. Several times we faced with this problem on production systems, when an engineer tried to analyze some network traffic with tcpdump. This how it looks like:

# netstat -hw1 -I mce3
            input           mce3           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      5.9M     0     0       339M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       341M          0     0          0     0
      5.8M     0     0       330M          0     0          0     0
      5.7M     0     0       325M          0     0          0     0
      5.7M     0     0       328M          0     0          0     0
      4.1M     0  1.9M       236M          0     0          0     0
      4.0M     0  1.9M       226M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      3.9M     0  1.9M       225M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      4.1M     0  1.9M       234M          0     0          0     0
      5.9M     0  107k       339M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
^C

What was changed:

  • all lists were changed to use CK_LIST
  • from struct bpf_if removed bif_lock and bif_flags fields, added bif_refcnt and epoch_ctx. Now when an interface registers using bpfattach, its pointer is referenced by bif_if structure. And we hold this reference until bif_if structure will be detached and freed.
  • each bpf_d descriptor references bpf_if structure when it is attached. In conjunction with epoch(9) this prevents accessing to freed bpf_if, ifnet and bpf_d structures.
  • bpf_freelist and ifnet_departure event are no longer needed. BPF interfaces can be linked and unlinked from bpf_iflist only when global BPF_LOCK is held. When interface is unlinked, it will be freed() using epoch_call() when this will be safe.
  • new struct bpf_program_buffer is introduced to keep BPF filter programs. It is used to allow change BPF program without interface lock. New buffer is allocated for new program, and then old pointer is freed using epoch_call().
  • since BPF_LOCK() is sx lock, we can avoid extra locking in bpf_zero_counters(), bpf_getdltlist(), bpfstats_fill_xbpf().
  • also some functions were merged into another to avoid extra locking/relocking: bpf_attachd() now calls reset_d(); bpf_upgraded() moved into bpf_setf(); bpf_detachd_locked() can invoke bpf_wakeup() when bpf_if is detached.
Test Plan

For the same traffic flow executing tcpdump -npi mce3 host 198.18.0.1 does not produce any packets drop.

 # netstat -hw1 -I mce3
            input           mce3           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      5.9M     0     0       338M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
      6.0M     0     0       345M          0     0          0     0
^C

Of course this does not mean that adding BPF filter is costless. But at least the patch reduces overhead and grows the limit when such task is safe.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ae created this revision.May 10 2019, 8:52 AM
ae added a comment.EditedMay 11 2019, 8:58 AM

There is at least one problem that with this patch becomes easy reproducible. With default optimize_writers=0 bpt_mtap() can catch several packets in the time between bpf_setif() and bpf_setf(), because empty filter means "accept all" by bpf_filter(). Maybe it is time to remove optimize_writers variable and use this behavior by default? I.e. by default link new bpf_if into writers only list, and re-link it into readers list when application sets filter? Or change its value to be 1 by default?

ae updated this revision to Diff 57309.May 11 2019, 9:31 AM
ae edited the summary of this revision. (Show Details)

move bpf_updated() into bpf_setf() to reduce BPF_LOCK() flipping

Harbormaster completed remote builds in B24199: Diff 57309.
ae set the repository for this revision to rS FreeBSD src repository.
ae edited the summary of this revision. (Show Details)May 12 2019, 9:50 AM
ae edited the test plan for this revision. (Show Details)
melifaro added inline comments.May 12 2019, 4:19 PM
sys/net/bpf.c
119 ↗(On Diff #57309)

Would it be possible to have slightly different name for the struct? Say, bpf_program_buffer ? At least to me, 'buffer' by default links to the actual userland<>kernel buffers inside the descriptor.

315 ↗(On Diff #57309)

why don't we typecast to struct bpf_epoch_buffer *?

689 ↗(On Diff #57309)

Worth calling out that the descriptor is reset here?

1181 ↗(On Diff #57309)

Would it be possible to refcount on d so we don't contend bp for other readers/writers?

1979 ↗(On Diff #57309)

s/jusr/just/? :-)

ae updated this revision to Diff 57340.May 12 2019, 5:23 PM
ae edited the test plan for this revision. (Show Details)
  • s/bpf_epoch_buffer/bpf_program_buffer/g
  • update some comments
  • add copyright line
  • add refcount to bpf_d and use it in bpfwrite
ae marked 5 inline comments as done.May 12 2019, 5:25 PM
ae edited the summary of this revision. (Show Details)
melifaro accepted this revision as: melifaro.May 12 2019, 5:40 PM
This revision is now accepted and ready to land.May 12 2019, 5:40 PM
This revision was automatically updated to reflect the committed changes.

Let's try the benefit of this patch on a 8-core Atom with a 10G Chelsio, during a forwarding bench with tcdpump running:

x r347519: inet4 smallest size packets-per-second forwarded with tcpdump running
+ r347519 and D20224: inet4 smallest size packets-per-second forwarded with tcpdump running
+--------------------------------------------------------------------------+
|   xx              x                                          +  +   +  ++|
||___M__A______|                                                           |
|                                                                |___AM___||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       3682162       3816554       3684703     3711296.2     58889.489
+   5     4179733.5       4274492       4240164     4234216.7     39505.425
Difference at 95.0% confidence
        522920 +/- 73130.8
        14.09% +/- 2.16578%
        (Student's t, pooled s = 50143)