Details

Reviewers

markj
jhb

Commits

rG98215005b747: ktls: start a thread to keep the 16k ktls buffer zone populated

Summary

Ktls recently received an optimization where we allocate 16k physically contiguous crypto destination buffers. This provides a large (more than 5%) reduction in CPU use in our workload. However, after several days of uptime, the performance benefit disappears because we have frequent allocation failures from the ktls buffer zone.

It turns out that when load drops off, the ktls buffer zone is trimmed, and some 16k buffers are freed back to the OS. When load picks back up again, re-allocating those 16k buffers fails after some number of days of uptime because physical memory has become fragmented. This causes allocations to fail, because they are intentionally done without M_NORECLAIM, so as to avoid pausing the ktls crytpo work thread while the VM system defragments memory.

To work around this, this change starts one thread per VM domain to allocate ktls buffers with M_NORECLAIM , as we don't care if this thread is paused while memory is defragged. The thread then frees the buffers back into the ktls buffer zone, thus allowing future allocations to succeed.

Note that waking up the thread is intentionally racy, but neither of the races really matter. In the worst case, we could have either spurious wakeups or we could have to wait 1 second until the next rate-limited allocation failure to wake up the thread.

This patch has been in use at Netflix on a handful of servers, and seems to fix the issue.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

gallatin requested review of this revision.Jul 21 2021, 1:54 PM

gallatin created this revision.

gallatin edited the summary of this revision. (Show Details)

gallatin edited the summary of this revision. (Show Details)Jul 21 2021, 2:02 PM

gallatin edited the summary of this revision. (Show Details)Jul 21 2021, 4:17 PM

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

In D31260#703987, @mjg wrote:

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

That was my first thought, but after consideration, I did not want to reserve this memory. Deciding how much to reserve is hard, and workload / scenario dependent and would be a nightmare to tune. For something with tiny items (eg, pbufs) reserving more than enough is fine, but for this, we have upwards of 12GB used for ktls buffers on our faster machines. I don't want to have reserve more than that generally, or try to tune for specific scenarios.

In D31260#704001, @gallatin wrote:

In D31260#703987, @mjg wrote:

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

That was my first thought, but after consideration, I did not want to reserve this memory. Deciding how much to reserve is hard, and workload / scenario dependent and would be a nightmare to tune. For something with tiny items (eg, pbufs) reserving more than enough is fine, but for this, we have upwards of 12GB used for ktls buffers on our faster machines. I don't want to have reserve more than that generally, or try to tune for specific scenarios.

uma_zone_reserve() won't work here - it applies to kegs, but this is a cache zone which imports directly from the physical memory allocator. Generally, I think we want to make some effort to defragment memory when allocation failures occur.

sys/kern/uipc_ktls.c
433	I think we shouldn't bother creating threads for domains for which `VM_DOMAIN_EMPTY(domain)` is true.
447	Extra whitespace.
2017	I would also suggest using atomic_load_int() here and atomic_store_int() in the thread.
2225
2228	Extra whitespace.
2229	Perhaps make this conditional on `bootverbose`.
2249	I'd suggest rewriting this so that `ktls_max_alloc` is loaded once, with `atomic_load_int()`. Otherwise there's nothing preventing the compiler from issuing multiple loads of `ktls_max_alloc` even though the code is trying to make sure it gets loaded only once.
2253	I think a comment explaining what's going on here would be worthwhile.
2254	Do you have a local change to ktls_buffer_import() which has it try to reclaim if M_WAITOK is specified?
2258	One downside of this approach is that the items only get freed to the current CPU. I don't have a better alternative to suggest though. It might be possible to synchronize with ktls_buffer_import() such that it checks a per-domain pool if vm_page_alloc_contig() fails, and the thread tries to populate the pool on demand.

markj added inline comments.Jul 29 2021, 6:43 PM

sys/kern/uipc_ktls.c
2254	I forgot how M_NORECLAIM was passed. This question doesn't make sense, please ignore.

Addressed Mark's feedback.

jhb added inline comments.Jul 30 2021, 8:50 PM

sys/kern/uipc_ktls.c

433

I would be tempted to use a separate loop for these threads instead of doing them inside the CPU loop. I'm not sure if we have an easy way to enumerate domains, (presumably there is a way), so something like:

CPU_FOREACH(i) {
    ...
}

if (ktls_sw_buffer_cache) {
   for (domain = 0; domain < MAXMEMDOM; domain++) {
       if (VM_DOMAIN_EMPTY(domain))
           continue;
       error = kproc_kthread_add(ktls_alloc_thread, ...);
       ...
}

This avoids the need for checking alloc_td.td for NULL for example, but I think it's also a bit more obvious to the reader that we are creating a single thread per domain vs the other loop creating a thread per CPU.

2017–2019

2255

Maybe s/at runtime/normally/ in these two sentences, or even be more specific and say "when encrypting" or some such. I'd also perhaps adjust the ordering a bit to be something like:

/*
  * When encrypting TLS records, ktls buffers are allocated with M_NOWAIT and
  * M_NORECLAIM to avoid overhead in the ktls worker threads.  If ktls worker
  * threads block, a backlog of TLS buffers can develop leading to surges of
  * traffic and potential NIC output drops.  Over time, however, memory can
  * become fragmented and ktls buffer allocations fail.
  *
  * This thread allocates ktls buffers with M_WAITOK and without
  * M_NORECLAIM to force memory defragmentation and ensure there is an
  * adequate pool of ktls buffers available for the ktls worker threads.
  */

markj accepted this revision.Jul 30 2021, 8:52 PM

This revision is now accepted and ready to land.Jul 30 2021, 8:52 PM

Incorporated jhb's feedback
fixed a typo in a comment
changed alloc thread's name so that after truncation, its name appears different from the worker threads in 'ps axH'

This revision now requires review to proceed.Aug 3 2021, 10:02 PM

gallatin marked 2 inline comments as done.Aug 3 2021, 10:03 PM

jhb accepted this revision.Aug 4 2021, 10:39 PM

jhb added inline comments.

sys/kern/uipc_ktls.c
444	I'm not sure (Mark might know) if this might not also need a check like: if (CPU_EMPTY(&cpuset_domain[domain])) continue;
447	So using "KTLS_alloc" here instead of "KTLS" while still using &ktls_proc for both is a bit odd. Probably this should just be "KTLS", or if you want a completely separate kproc, then use a separate struct proc (ktls_alloc_proc or the like).

This revision is now accepted and ready to land.Aug 4 2021, 10:39 PM

markj added inline comments.Aug 5 2021, 1:09 PM

sys/kern/uipc_ktls.c
444	Hmm, I think that's possible. At least, I don't think there's any reason you can't have a PXM with memory but no CPUs. So I agree that we should have such a check.

Address new feedback from jhb:

put proc name back to KTLS
check for domains with no CPUs

This revision now requires review to proceed.Aug 5 2021, 2:13 PM

gallatin marked 3 inline comments as done.Aug 5 2021, 2:13 PM

This revision was not accepted when it landed; it landed in state Needs Review.Aug 5 2021, 2:23 PM

Closed by commit rG98215005b747: ktls: start a thread to keep the 16k ktls buffer zone populated (authored by gallatin). · Explain Why

This revision was automatically updated to reflect the committed changes.

gallatin added a commit: rG98215005b747: ktls: start a thread to keep the 16k ktls buffer zone populated.

ktls: start a thread to keep the 16k ktls buffer zone populated
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 93272

sys/kern/uipc_ktls.c

ktls: start a thread to keep the 16k ktls buffer zone populatedClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 93272

sys/kern/uipc_ktls.c

ktls: start a thread to keep the 16k ktls buffer zone populated
ClosedPublic
Actions

Revision Contents
Changeset List