Page MenuHomeFreeBSD

ktls: start a thread to keep the 16k ktls buffer zone populated
AcceptedPublic

Authored by gallatin on Wed, Jul 21, 1:54 PM.

Details

Reviewers
markj
jhb
Summary

Ktls recently received an optimization where we allocate 16k physically contiguous crypto destination buffers. This provides a large (more than 5%) reduction in CPU use in our workload. However, after several days of uptime, the performance benefit disappears because we have frequent allocation failures from the ktls buffer zone.

It turns out that when load drops off, the ktls buffer zone is trimmed, and some 16k buffers are freed back to the OS. When load picks back up again, re-allocating those 16k buffers fails after some number of days of uptime because physical memory has become fragmented. This causes allocations to fail, because they are intentionally done without M_NORECLAIM, so as to avoid pausing the ktls crytpo work thread while the VM system defragments memory.

To work around this, this change starts one thread per VM domain to allocate ktls buffers with M_NORECLAIM , as we don't care if this thread is paused while memory is defragged. The thread then frees the buffers back into the ktls buffer zone, thus allowing future allocations to succeed.

Note that waking up the thread is intentionally racy, but neither of the races really matter. In the worst case, we could have either spurious wakeups or we could have to wait 1 second until the next rate-limited allocation failure to wake up the thread.

This patch has been in use at Netflix on a handful of servers, and seems to fix the issue.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

gallatin created this revision.
gallatin edited the summary of this revision. (Show Details)

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

In D31260#703987, @mjg wrote:

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

That was my first thought, but after consideration, I did not want to reserve this memory. Deciding how much to reserve is hard, and workload / scenario dependent and would be a nightmare to tune. For something with tiny items (eg, pbufs) reserving more than enough is fine, but for this, we have upwards of 12GB used for ktls buffers on our faster machines. I don't want to have reserve more than that generally, or try to tune for specific scenarios.

In D31260#703987, @mjg wrote:

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

That was my first thought, but after consideration, I did not want to reserve this memory. Deciding how much to reserve is hard, and workload / scenario dependent and would be a nightmare to tune. For something with tiny items (eg, pbufs) reserving more than enough is fine, but for this, we have upwards of 12GB used for ktls buffers on our faster machines. I don't want to have reserve more than that generally, or try to tune for specific scenarios.

uma_zone_reserve() won't work here - it applies to kegs, but this is a cache zone which imports directly from the physical memory allocator. Generally, I think we want to make some effort to defragment memory when allocation failures occur.

sys/kern/uipc_ktls.c
432

I think we shouldn't bother creating threads for domains for which VM_DOMAIN_EMPTY(domain) is true.

446

Extra whitespace.

2007

I would also suggest using atomic_load_int() here and atomic_store_int() in the thread.

2213
2216

Extra whitespace.

2217

Perhaps make this conditional on bootverbose.

2237

I'd suggest rewriting this so that ktls_max_alloc is loaded once, with atomic_load_int(). Otherwise there's nothing preventing the compiler from issuing multiple loads of ktls_max_alloc even though the code is trying to make sure it gets loaded only once.

2241

I think a comment explaining what's going on here would be worthwhile.

2242

Do you have a local change to ktls_buffer_import() which has it try to reclaim if M_WAITOK is specified?

2246

One downside of this approach is that the items only get freed to the current CPU. I don't have a better alternative to suggest though. It might be possible to synchronize with ktls_buffer_import() such that it checks a per-domain pool if vm_page_alloc_contig() fails, and the thread tries to populate the pool on demand.

sys/kern/uipc_ktls.c
2242

I forgot how M_NORECLAIM was passed. This question doesn't make sense, please ignore.

gallatin marked 5 inline comments as done.

Addressed Mark's feedback.

sys/kern/uipc_ktls.c
433

I would be tempted to use a separate loop for these threads instead of doing them inside the CPU loop. I'm not sure if we have an easy way to enumerate domains, (presumably there is a way), so something like:

CPU_FOREACH(i) {
    ...
}

if (ktls_sw_buffer_cache) {
   for (domain = 0; domain < MAXMEMDOM; domain++) {
       if (VM_DOMAIN_EMPTY(domain))
           continue;
       error = kproc_kthread_add(ktls_alloc_thread, ...);
       ...
}

This avoids the need for checking alloc_td.td for NULL for example, but I think it's also a bit more obvious to the reader that we are creating a single thread per domain vs the other loop creating a thread per CPU.

2008–2010
2244

Maybe s/at runtime/normally/ in these two sentences, or even be more specific and say "when encrypting" or some such. I'd also perhaps adjust the ordering a bit to be something like:

/*
  * When encrypting TLS records, ktls buffers are allocated with M_NOWAIT and
  * M_NORECLAIM to avoid overhead in the ktls worker threads.  If ktls worker
  * threads block, a backlog of TLS buffers can develop leading to surges of
  * traffic and potential NIC output drops.  Over time, however, memory can
  * become fragmented and ktls buffer allocations fail.
  *
  * This thread allocates ktls buffers with M_WAITOK and without
  * M_NORECLAIM to force memory defragmentation and ensure there is an
  * adequate pool of ktls buffers available for the ktls worker threads.
  */
This revision is now accepted and ready to land.Fri, Jul 30, 8:52 PM