Page MenuHomeFreeBSD

mlx5: Preallocate ktls tags asynchronously
ClosedPublic

Authored by gallatin on Tue, Nov 4, 12:02 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 5, 3:22 AM
Unknown Object (File)
Tue, Nov 4, 9:51 PM
Unknown Object (File)
Tue, Nov 4, 1:56 PM
Unknown Object (File)
Tue, Nov 4, 1:56 PM
Unknown Object (File)
Tue, Nov 4, 1:47 PM
Unknown Object (File)
Tue, Nov 4, 1:43 PM
Unknown Object (File)
Tue, Nov 4, 1:39 PM
Unknown Object (File)
Tue, Nov 4, 1:38 PM

Details

Summary

Pre-allocating a large number of mlx5 ktls tags can greatly increase the boot time when done synchronously. At Netflix, we've seen watchdog resets due to slow boottimes that I tracked to tls tag preallocation using @cperciva 's boot timestamping / flamegraphs.

Change tag preallocation to happen asynchronously when an interface is brought up

  • a new mlx5-tls-prealloc_wq is allocated when preallocation is desired, and started when an interface is opened
  • the bulk of the prealloc code remains the same, except the allocations are now M_NOWAIT M_NOWAIT is needed because, since the preallocation is done asynchronously, and since tag allocation is not instant, we could race with a real TLS session trying to allocate a tag. Note that in this case, we take allocation failure as a sign that we were unable to obtain the entire zone due to there being other consumers. This was suggested by @markj as a way to keep things simple, after discussing why uma_zone_get_cur() didn't immediately report a fully allocated zone. If this turns out to be problematic, we could use uma_zone_set_maxaction() to stop pre-allocations (also suggested by Mark)
Test Plan

Boot a machine repeatedly with hw.mlx5.tls_max_tags=524288 and real tls users (nginx workers) and verify using vmstat -z that the zone is fully populated.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
106

I think this deserves a comment explaining that we rely on the zone limit to stop the loop.

sys/dev/mlx5/mlx5_en/mlx5_en_main.c
3339

Do you also want to do this work if these capabilities are toggled via mlx5e_ioctl()?

  • added comment regarding allocation failure behavior, as suggested by markj
gallatin added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_main.c
3339

No, not really. That's overkill for our scenario.

This revision is now accepted and ready to land.Tue, Nov 4, 2:00 PM
sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108

The 'and release the tags ...' part of comment is slightly misleading. We release the tags unconditionally, not only in case of premature zone depletion.

115

Would it make sense to only iterate up to i there, to avoid pointless calls to uma_zfree() in case of early prefill loop termination?

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108

Not entirely true. If we do an M_WAITOK allocation and if it blocks because the zone is full (at max), we can wait an unbounded amount of time until all other consumers of tags release what they are holding and never release anything. Similarly, if we loop on M_WAITOK without something like a uma_zone_set_maxaction() that sets a flag, we can loop potentially forever and not release anything. I can try to reword slightly if you feel strongly, but I'm not sure it will make it more clear...

115

I considered this, but I checked uma_zfree_arg() and on a standard, non-debug kernel, the first instruction is a check for a NULL pointer. So I'd prefer to keep it simple rather than increase complexity for a very minor benefit.

kib added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108

I actually do not understand your reply. I am saying that we always release what was allocated (back to zone), regardless of when we finished allocations. But the comment placed there makes (some) impression that we do it only when uma_zalloc() failed.

Ignore it if you want.

115

There is no complexity, the loop would be written as

for (i = 0; j < i; j++)
    uma_zfree(priv->tls.zone, tags[j]);

and that is. Otherwise the thread might do 100K calls + checks, which is IMO suboptimal.

But again, ignore me if you prefer so.

  • tried to address kib's concerns
This revision now requires review to proceed.Wed, Nov 5, 12:40 PM
sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
108

I tried to re-word the comment to make it more clear. If you have suggestions on wording, I'm open to them.

115

Lets compromise. I added a check for tags[i] == NULL to terminate the loop.

When reading the code, doing what you suggest adds a little bit more mental overhead for my brain (because I tend to want to ensure 2nd loop is actually correct), which i'd like to avoid. I think checking for a null tag is simple enough, but using a different loop variable increases mental complexity more than its worth (at least for me).

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
106
109
114

I don't think this compiler barrier serves any purpose, BTW.

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
114

You might be correct, but its a pre-existing mistake with little overhead that I'd prefer to leave in place and not mix its removal into this change.

kib added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
114

I think that the thought was the same as for hosted environment a too smart compiler could optimize p = malloc(100); free(p); into nothing already. uma_zalloc() has additional side effects so such optimization would be wrong but the membar is explicit in this regard.

This revision is now accepted and ready to land.Wed, Nov 5, 11:07 PM
This revision was automatically updated to reflect the committed changes.