Page MenuHomeFreeBSD

Break out IPv4 reassembly locking into per-hash-bucket locks.
ClosedPublic

Authored by adrian on Mar 18 2015, 8:55 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Aug 18, 2:25 PM
Unknown Object (File)
Fri, Aug 9, 11:57 PM
Unknown Object (File)
Fri, Aug 9, 11:48 AM
Unknown Object (File)
Tue, Aug 6, 3:07 AM
Unknown Object (File)
Jul 10 2024, 11:14 AM
Unknown Object (File)
Jul 10 2024, 7:11 AM
Unknown Object (File)
Jul 9 2024, 4:49 AM
Unknown Object (File)
Jun 1 2024, 12:02 PM

Details

Summary

The current IPv4 reassembly code uses a single lock to protect
all the hash table buckets. This cause massive lock contention
on multi-core servers if a lot of reassembly occurs.

This may happen more frequently than others in LAN environments
with UDP heavy traffic, eg memcached.

Test Plan

Tested with local UDP traffic, with / without fragments

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

adrian retitled this revision from to Break out IPv4 reassembly locking into per-hash-bucket locks..
adrian updated this object.
adrian edited the test plan for this revision. (Show Details)
adrian added a reviewer: network.
hiren added a reviewer: hiren.
hiren added a subscriber: hiren.

"something" in the name of the function ip_reass_purge_something() seems...not clean but I don't have a better suggestion. :-)
Looks okay to me otherwise.

This revision is now accepted and ready to land.Mar 18 2015, 11:54 PM

Change "something" to "element" and then the name will be clearer. The locking itself looks OK from first glance.

See inlined comments, please.

sys/netinet/ip_input.c
174 ↗(On Diff #4279)

I'd suggest to just include the mutex into the hashrow header, instead of using padalign. Anyway, if we are just about to hit the cacheline with the hashrow header if we lock the corresponding mutex.

426 ↗(On Diff #4279)

Why do you need IPQ_LOCK_ALL? Why can't you go hash rows one by one, lock one, drain it, unlock it, go for next.

btw, mutex initialization lacks MTX_DUPOK, but current code clearly requires it. Definitely not tested with WITNESS :)

1241 ↗(On Diff #4279)

So, we always drain from a bucket next to the current one? Can this be exploited? For example, attacker knows that a certain peer of a victim host always sends fragments. Attacker calculates peers hash slot, then generates traffic that overflows the previous slot.

My suggestion is to simply use a 'static int i' inside the ip_reass_purge_something(), and always start with current i value. It is okay that access to it is racy. The ip_reass_purge_something() should only try to lock hash slots, skipping locked ones.

Probably outside scope of this patch, but it would be nice if we had IP fragment reassembly as library, utilized by both ip_input() and by pf(4), without functionality duplication.

... yup, I think the next thing should be to migrate this into a library and have it be reusable / unit testable.

But that'll come next.

.. and yes, it's running on HEAD + WITNESS, but it didn't panic.

The reason why I've not gone and completely unrolled things is to keep the diffs minimal, but yes, I should really just lock/drain/unlock.

adrian edited edge metadata.

Address / clarify things!

This revision now requires review to proceed.Mar 20 2015, 4:47 PM

Hi Gleb:

  • It now individually locks things inside drain-all, rather than lock-all then drain-all.
  • It only initialises the locks once - the IP init path only does the bottom half of the function for the default vnet, not /all/ vnets.

Thanks.

Do you agree on my comments on embedding mutexes into hash slots? And about ip_reass_purge_element() logic? Do you plan to implement that?

sys/netinet/ip_input.c
1352 ↗(On Diff #4302)

The function name should probably be ip_drain_vnet().

I'll look at updating the locks to be per hash bucket slot in a follow-up commit. I'll have to pull apart the initialiser and create a specific struct for it.

If it possible, doing all the stuff properly in one commit is preferred.

What are the benefits of having the locks be per-vnet right now, rather than just per-bucket?

I don't suggest this. I suggest to embed locks into buckets, instead of making array of locks.

I tested it out with a handful of vnets; there wasn't any performance benefit.

I'd like to not bloat things out for people running vnets on the smaller embedded platforms.

You probably misunderstand me, since my suggestion reduces code, rather than bloats. And my suggestion is orthogonal to vnets at all. I suggest to embed locks into buckets, not make them per-vnet.

Right, but the buckets are per-vnet, so if I embed them into buckets, it'll end up being per-vnet anyway.

Yes, that would be side effect. But do we optimize for people who run numerous jails on platforms with small memory? If we optimize for the common case, keeping the mutex and data it protects together is the right way.

adrian updated this revision to Diff 4732.

Closed by commit rS281239 (authored by @adrian).