mlx5: Preallocate ktls tags asynchronously
ClosedPublic
Actions

Authored by gallatin on Tue, Nov 4, 12:02 AM.

Details

Reviewers

kib
markj
glebius

Commits

rG9349214a2815: mlx5: Preallocate ktls tags asynchronously

Summary

Pre-allocating a large number of mlx5 ktls tags can greatly increase the boot time when done synchronously. At Netflix, we've seen watchdog resets due to slow boottimes that I tracked to tls tag preallocation using @cperciva 's boot timestamping / flamegraphs.

Change tag preallocation to happen asynchronously when an interface is brought up

a new mlx5-tls-prealloc_wq is allocated when preallocation is desired, and started when an interface is opened
the bulk of the prealloc code remains the same, except the allocations are now M_NOWAIT M_NOWAIT is needed because, since the preallocation is done asynchronously, and since tag allocation is not instant, we could race with a real TLS session trying to allocate a tag. Note that in this case, we take allocation failure as a sign that we were unable to obtain the entire zone due to there being other consumers. This was suggested by @markj as a way to keep things simple, after discussing why uma_zone_get_cur() didn't immediately report a fully allocated zone. If this turns out to be problematic, we could use uma_zone_set_maxaction() to stop pre-allocations (also suggested by Mark)

Test Plan

Boot a machine repeatedly with hw.mlx5.tls_max_tags=524288 and real tls users (nginx workers) and verify using vmstat -z that the zone is fully populated.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

gallatin created this revision.Tue, Nov 4, 12:02 AM

Herald added subscribers: ae, imp. · View Herald TranscriptTue, Nov 4, 12:02 AM

gallatin requested review of this revision.Tue, Nov 4, 12:02 AM

markj added inline comments.Tue, Nov 4, 1:30 PM

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
106	I think this deserves a comment explaining that we rely on the zone limit to stop the loop.
sys/dev/mlx5/mlx5_en/mlx5_en_main.c
3339	Do you also want to do this work if these capabilities are toggled via mlx5e_ioctl()?

added comment regarding allocation failure behavior, as suggested by markj

gallatin marked an inline comment as done.Tue, Nov 4, 1:59 PM

gallatin added inline comments.

sys/dev/mlx5/mlx5_en/mlx5_en_main.c
3339	No, not really. That's overkill for our scenario.

markj accepted this revision.Tue, Nov 4, 2:00 PM

This revision is now accepted and ready to land.Tue, Nov 4, 2:00 PM

glebius accepted this revision.Tue, Nov 4, 4:27 PM

kib added inline comments.Tue, Nov 4, 4:45 PM

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108	The 'and release the tags ...' part of comment is slightly misleading. We release the tags unconditionally, not only in case of premature zone depletion.
115	Would it make sense to only iterate up to `i` there, to avoid pointless calls to uma_zfree() in case of early prefill loop termination?

gallatin added inline comments.Wed, Nov 5, 12:37 AM

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108	Not entirely true. If we do an M_WAITOK allocation and if it blocks because the zone is full (at max), we can wait an unbounded amount of time until all other consumers of tags release what they are holding and never release anything. Similarly, if we loop on M_WAITOK without something like a uma_zone_set_maxaction() that sets a flag, we can loop potentially forever and not release anything. I can try to reword slightly if you feel strongly, but I'm not sure it will make it more clear...
115	I considered this, but I checked uma_zfree_arg() and on a standard, non-debug kernel, the first instruction is a check for a NULL pointer. So I'd prefer to keep it simple rather than increase complexity for a very minor benefit.

kib accepted this revision.Wed, Nov 5, 1:18 AM

kib added inline comments.

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108	I actually do not understand your reply. I am saying that we always release what was allocated (back to zone), regardless of when we finished allocations. But the comment placed there makes (some) impression that we do it only when uma_zalloc() failed. Ignore it if you want.
115	There is no complexity, the loop would be written as for (i = 0; j < i; j++) uma_zfree(priv->tls.zone, tags[j]); and that is. Otherwise the thread might do 100K calls + checks, which is IMO suboptimal. But again, ignore me if you prefer so.

tried to address kib's concerns

This revision now requires review to proceed.Wed, Nov 5, 12:40 PM

gallatin added inline comments.Wed, Nov 5, 12:46 PM

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108	I tried to re-word the comment to make it more clear. If you have suggestions on wording, I'm open to them.
115	Lets compromise. I added a check for tags[i] == NULL to terminate the loop. When reading the code, doing what you suggest adds a little bit more mental overhead for my brain (because I tend to want to ensure 2nd loop is actually correct), which i'd like to avoid. I think checking for a null tag is simple enough, but using a different loop variable increases mental complexity more than its worth (at least for me).

markj added inline comments.Wed, Nov 5, 2:11 PM

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
106
109
114	I don't think this compiler barrier serves any purpose, BTW.

gallatin added inline comments.Wed, Nov 5, 5:01 PM

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
114	You might be correct, but its a pre-existing mistake with little overhead that I'd prefer to leave in place and not mix its removal into this change.

kib accepted this revision.Wed, Nov 5, 11:07 PM

kib added inline comments.

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
114	I think that the thought was the same as for hosted environment a too smart compiler could optimize `p = malloc(100); free(p);` into nothing already. uma_zalloc() has additional side effects so such optimization would be wrong but the membar is explicit in this regard.