Page MenuHomeFreeBSD

ktls: start a thread to keep the 16k ktls buffer zone populated
ClosedPublic

Authored by gallatin on Jul 21 2021, 1:54 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Apr 16, 7:38 AM
Unknown Object (File)
Tue, Apr 16, 7:38 AM
Unknown Object (File)
Tue, Apr 16, 7:38 AM
Unknown Object (File)
Mar 5 2024, 12:39 AM
Unknown Object (File)
Feb 11 2024, 3:48 PM
Unknown Object (File)
Feb 11 2024, 3:48 PM
Unknown Object (File)
Feb 11 2024, 3:48 PM
Unknown Object (File)
Feb 9 2024, 7:42 PM
Subscribers

Details

Summary

Ktls recently received an optimization where we allocate 16k physically contiguous crypto destination buffers. This provides a large (more than 5%) reduction in CPU use in our workload. However, after several days of uptime, the performance benefit disappears because we have frequent allocation failures from the ktls buffer zone.

It turns out that when load drops off, the ktls buffer zone is trimmed, and some 16k buffers are freed back to the OS. When load picks back up again, re-allocating those 16k buffers fails after some number of days of uptime because physical memory has become fragmented. This causes allocations to fail, because they are intentionally done without M_NORECLAIM, so as to avoid pausing the ktls crytpo work thread while the VM system defragments memory.

To work around this, this change starts one thread per VM domain to allocate ktls buffers with M_NORECLAIM , as we don't care if this thread is paused while memory is defragged. The thread then frees the buffers back into the ktls buffer zone, thus allowing future allocations to succeed.

Note that waking up the thread is intentionally racy, but neither of the races really matter. In the worst case, we could have either spurious wakeups or we could have to wait 1 second until the next rate-limited allocation failure to wake up the thread.

This patch has been in use at Netflix on a handful of servers, and seems to fix the issue.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

gallatin created this revision.
gallatin edited the summary of this revision. (Show Details)

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

In D31260#703987, @mjg wrote:

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

That was my first thought, but after consideration, I did not want to reserve this memory. Deciding how much to reserve is hard, and workload / scenario dependent and would be a nightmare to tune. For something with tiny items (eg, pbufs) reserving more than enough is fine, but for this, we have upwards of 12GB used for ktls buffers on our faster machines. I don't want to have reserve more than that generally, or try to tune for specific scenarios.

In D31260#703987, @mjg wrote:

Would uma_zone_reserve take care of this? The patch as proposed should not be necessary.

That was my first thought, but after consideration, I did not want to reserve this memory. Deciding how much to reserve is hard, and workload / scenario dependent and would be a nightmare to tune. For something with tiny items (eg, pbufs) reserving more than enough is fine, but for this, we have upwards of 12GB used for ktls buffers on our faster machines. I don't want to have reserve more than that generally, or try to tune for specific scenarios.

uma_zone_reserve() won't work here - it applies to kegs, but this is a cache zone which imports directly from the physical memory allocator. Generally, I think we want to make some effort to defragment memory when allocation failures occur.

sys/kern/uipc_ktls.c
433

I think we shouldn't bother creating threads for domains for which VM_DOMAIN_EMPTY(domain) is true.

447

Extra whitespace.

2017

I would also suggest using atomic_load_int() here and atomic_store_int() in the thread.

2225
2228

Extra whitespace.

2229

Perhaps make this conditional on bootverbose.

2249

I'd suggest rewriting this so that ktls_max_alloc is loaded once, with atomic_load_int(). Otherwise there's nothing preventing the compiler from issuing multiple loads of ktls_max_alloc even though the code is trying to make sure it gets loaded only once.

2253

I think a comment explaining what's going on here would be worthwhile.

2254

Do you have a local change to ktls_buffer_import() which has it try to reclaim if M_WAITOK is specified?

2258

One downside of this approach is that the items only get freed to the current CPU. I don't have a better alternative to suggest though. It might be possible to synchronize with ktls_buffer_import() such that it checks a per-domain pool if vm_page_alloc_contig() fails, and the thread tries to populate the pool on demand.

sys/kern/uipc_ktls.c
2254

I forgot how M_NORECLAIM was passed. This question doesn't make sense, please ignore.

gallatin marked 5 inline comments as done.

Addressed Mark's feedback.

sys/kern/uipc_ktls.c
433

I would be tempted to use a separate loop for these threads instead of doing them inside the CPU loop. I'm not sure if we have an easy way to enumerate domains, (presumably there is a way), so something like:

CPU_FOREACH(i) {
    ...
}

if (ktls_sw_buffer_cache) {
   for (domain = 0; domain < MAXMEMDOM; domain++) {
       if (VM_DOMAIN_EMPTY(domain))
           continue;
       error = kproc_kthread_add(ktls_alloc_thread, ...);
       ...
}

This avoids the need for checking alloc_td.td for NULL for example, but I think it's also a bit more obvious to the reader that we are creating a single thread per domain vs the other loop creating a thread per CPU.

2017–2019
2255

Maybe s/at runtime/normally/ in these two sentences, or even be more specific and say "when encrypting" or some such. I'd also perhaps adjust the ordering a bit to be something like:

/*
  * When encrypting TLS records, ktls buffers are allocated with M_NOWAIT and
  * M_NORECLAIM to avoid overhead in the ktls worker threads.  If ktls worker
  * threads block, a backlog of TLS buffers can develop leading to surges of
  * traffic and potential NIC output drops.  Over time, however, memory can
  * become fragmented and ktls buffer allocations fail.
  *
  * This thread allocates ktls buffers with M_WAITOK and without
  * M_NORECLAIM to force memory defragmentation and ensure there is an
  * adequate pool of ktls buffers available for the ktls worker threads.
  */
This revision is now accepted and ready to land.Jul 30 2021, 8:52 PM
gallatin marked 2 inline comments as done.
  • Incorporated jhb's feedback
  • fixed a typo in a comment
  • changed alloc thread's name so that after truncation, its name appears different from the worker threads in 'ps axH'
This revision now requires review to proceed.Aug 3 2021, 10:02 PM
jhb added inline comments.
sys/kern/uipc_ktls.c
444

I'm not sure (Mark might know) if this might not also need a check like:

if (CPU_EMPTY(&cpuset_domain[domain]))
    continue;
447

So using "KTLS_alloc" here instead of "KTLS" while still using &ktls_proc for both is a bit odd. Probably this should just be "KTLS", or if you want a completely separate kproc, then use a separate struct proc (ktls_alloc_proc or the like).

This revision is now accepted and ready to land.Aug 4 2021, 10:39 PM
sys/kern/uipc_ktls.c
444

Hmm, I think that's possible. At least, I don't think there's any reason you can't have a PXM with memory but no CPUs. So I agree that we should have such a check.

Address new feedback from jhb:

  • put proc name back to KTLS
  • check for domains with no CPUs
This revision now requires review to proceed.Aug 5 2021, 2:13 PM
This revision was not accepted when it landed; it landed in state Needs Review.Aug 5 2021, 2:23 PM
This revision was automatically updated to reflect the committed changes.