Page MenuHomeFreeBSD

During SYN floods, fallback exclusively to SYN cookies for a small period
ClosedPublic

Authored by jtl on Sep 13 2019, 7:09 PM.

Details

Summary

This contains 3 proposed commits:

  1. Remove the unused sch parameter to syncache_respond().
  1. Access the syncache secret directly from the V_tcp_syncache variable, rather than indirectly through the backpointer to the tcp_syncache structure stored in the hashtable bucket.

This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.

  1. Add new functionality to switch to using cookies exclusively when we are under attack. This code uses an overflow of a SYN cache hash bucket as a heuristic to detect an attack. When an attack is detected, the code falls back to using SYN cookies only for 15 seconds. If the attack continues, the fall back time is increased exponentially until it reaches a maximum (16 minutes). When an attack is detected, the code logs a message so the user can decide whether any action is necessary.
Test Plan

Tested with a 6Mpps SYN flood. Before the change, CPU was at ~33%, packet loss was quite high, and user-space transfers stopped. After the change, CPU was at ~30%, packet loss was roughly 50% less, and user-space transfers continued.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

sys/netinet/tcp_syncache.c
2403 ↗(On Diff #62057)

This is kind of tricky. It seems like we'll only detect an extension of the attack and advance the backoff if we refill the syn cache (and hence try to pause) in the same second as the callout was scheduled to run. It seems like under an attack, a callout might run "late" and unpause after the deadline, leading to us never advancing the backoff. Have you considered adding some slop? Or am I missing something?

sys/netinet/tcp_syncache.c
2403 ↗(On Diff #62057)

Let me describe how the code is supposed to work. You can then let me know if I've somehow managed to write the code incorrectly.

When we detect an overflow, V_tcp_syncache.pause_until is set to time_uptime + a pause time. Once we reach V_tcp_syncache.pause_until, the callout will run and clear the paused flag. If we detect another overflow before V_tcp_syncache.pause_until + the last pause time we used, we will consider it an extension of the same attack.

To give a practical example, assume we detect an overflow at time_uptime = 60. V_tcp_syncache.pause_until will be set to 75. At time_uptime = 75, the callout will run and clear the paused flag. If we detect a new overflow at time_uptime = 80, we will consider it part of the same attack (because 80 is less than 75 + 15) and set V_tcp_syncache.pause_until to 110 (80 + 30). OTOH, if we detect a new overflow at time_uptime = 95, we will consider it a new attack (because 95 is more than 75 + 15) and set V_tcp_syncache.pause_until to 110 (95 + 15).

For another example, assume we detect an overflow at time_uptime = 60. We pause until time_uptime = 75. At time_uptime = 80, we detect another overflow. We will pause until time_uptime = 110 (80 + 30). At time_uptime = 115, we detect another overflow. We will pause until time_uptime = 175 (115 + 60). If we detect another overflow at time_uptime = 225, we will consider it part of the same attack (because 225 is less than 175 + 60).

Does this clear things up? Or, do you think the code works differently than this?

gallatin added inline comments.
sys/netinet/tcp_syncache.c
2403 ↗(On Diff #62057)

Yes, that makes sense.

rrs added inline comments.
sys/netinet/tcp_syncache.c
307 ↗(On Diff #62057)

Any particular reason you did not use callout_init_mtx and just hook the mutex to the callout?

This revision is now accepted and ready to land.Sep 17 2019, 12:55 PM
sys/netinet/tcp_syncache.c
307 ↗(On Diff #62057)

The syncache_pause() function holds this mutex longer than (and uses it to synchronize more than) scheduling the callout. I assumed that associating this with the callout using callout_init_mtx would cause an unnecessary recursive acquisition of the callout when scheduling it; however, I now see that is not the case. I can change this to using callout_init_mtx.

Switch to using callout_init_mtx to let the callout system acquire the pause lock.

This revision now requires review to proceed.Sep 18 2019, 12:00 PM
jtl marked 3 inline comments as done.Sep 18 2019, 12:01 PM
This revision is now accepted and ready to land.Sep 20 2019, 1:15 PM