Page MenuHomeFreeBSD

pf: if a new RDR state connect be created, modulate src port
ClosedPublic

Authored by markj on Mar 24 2024, 3:57 AM.
Tags
None
Referenced Files
F103098388: D44488.diff
Wed, Nov 20, 11:12 PM
Unknown Object (File)
Thu, Nov 14, 9:44 PM
Unknown Object (File)
Thu, Nov 14, 6:47 PM
Unknown Object (File)
Thu, Nov 14, 3:28 PM
Unknown Object (File)
Wed, Nov 13, 7:46 AM
Unknown Object (File)
Wed, Nov 13, 4:24 AM
Unknown Object (File)
Sun, Nov 10, 4:06 PM
Unknown Object (File)
Wed, Nov 6, 5:45 PM

Details

Summary

If multiple inbound connections through a RDR rule cannot be accomodated
due to conflicting src/dst ip/port combination, modulate the src port in
the state entry to allow the connection rather than dropping it.

Co-authored-by: Allan Jude <allanjude@freebsd.org>
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.

Test Plan
all tcp 203.0.113.50:8000 (198.51.100.1:7777) <- 198.51.100.50:5454       FIN_WAIT_2:ESTABLISHED
all tcp 203.0.113.50:8000 (198.51.100.2:7777) <- 198.51.100.50:5710 (198.51.100.50:5454)       FIN_WAIT_2:ESTABLISHED
all tcp 203.0.113.50:8000 (198.51.100.3:7777) <- 198.51.100.50:5966 (198.51.100.50:5454)       FIN_WAIT_2:ESTABLISHED

The first connection is RDR'd normally. The second and 3rd have the src port on the 'far' side of the RDR rewritten (with state entries to route it back correctly)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 58337
Build 55225: arc lint + arc unit

Event Timeline

This also really needs a test case.

sys/netpfil/pf/pf.c
5028 ↗(On Diff #136142)

Declarations go at the top of the block.

5049 ↗(On Diff #136142)

There's a pf SDT namespace, so that really wants to be in there.

5053 ↗(On Diff #136142)

That could roll over into port 0.

It probably also needs a bailout condition to avoid spinning forever if there are no more available ports.
See PR 233867 for a similar bug.

allanjude edited the test plan for this revision. (Show Details)

Add test case

tests/sys/netpfil/pf/rdr.sh
194

Why not just nc?

199

That really needs to be compiled by the makefile, not on every test run.

See tests/sys/net's randsleep for an example.

allanjude marked an inline comment as done.

Address feedback

sys/netpfil/pf/pf.c
5053 ↗(On Diff #136142)

Is there a specific threshold that makes sense to give up at? Or just when we would overflow and loop back to port 0?

sys/netpfil/pf/pf.c
5053 ↗(On Diff #136142)

I was inclined to say once we've looped though all 65K ports, because then we know we can't possibly handle this state, but that's 65K hashes and lock / unlocks and that's going to be very, very expensive.

So for we may want to limit it much more than that. Given that we only try once now just trying the next say 32 ports may be enough to fix most of the occurrences.

This also suggests two test cases to add: using port 65535, so we have to roll over to 0 (and test that we do not use port 0), as well as a scenario where there are no ports available in whatever range of tries we settle on.

What real-life situation is this fix for? If this is for connections coming from behind a 3rd party SNAT, where SNAT reuses source ports faster than pf expires states, then maybe tuning pf timeouts would be enough. Or we could allow pf states to transition from TCPS_FIN_WAIT_2 back to TCPS_SYN_SENT, basically implementing SO_REUSEPORT for pf.

What real-life situation is this fix for? If this is for connections coming from behind a 3rd party SNAT, where SNAT reuses source ports faster than pf expires states, then maybe tuning pf timeouts would be enough. Or we could allow pf states to transition from TCPS_FIN_WAIT_2 back to TCPS_SYN_SENT, basically implementing SO_REUSEPORT for pf.

It is for behind a load balancer, where the initial connection comes in one public IP #1, but then all future connections come in on public IP #2, but the external source port from the incoming IP stays the same.
So when PF NAT's it to the web server behind the PF RDR rules, the incoming source ports conflict.

markj added inline comments.
sys/netpfil/pf/pf.c
5050 ↗(On Diff #136146)

Can we get away with doing this on demand, i.e., try to insert a new state without checking and try with a different port if pf_state_insert() returns EEXIST?

(I'll try to answer my own question, but posing it here in case someone has a quick reason to say "no".)

sys/netpfil/pf/pf.c
5050 ↗(On Diff #136146)

I think that ought to work.

There's a log entry that will potentially fire in that case, but we can either fix that or live with it, given that collisions ought to be rare anyway.

It's probably even the best approach, because trying to find a free port here and then running pf_state_insert() is potentially racy.

I've spent some time looking into the underlying problem, and I'm not sure that this approach (quietly rewriting the src port) is correct. The pf.conf manual page says:

In addition to modifying the address, some translation rules may modify
source or destination ports for tcp(4) or udp(4) connections; implicitly
in the case of nat rules and explicitly in the case of rdr rules.  Port
numbers are never translated with a binat rule.

but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

As an aside, I also found that if a nat rule matches an outbound packet, but we fail to allocate a src port for it (e.g., because static-port is configured and a collision occurs, or pf_get_sport() fails to find a free port), then pf will pass the packet out without performing any translation. That is, we end up passing out unmodified packets from the internal network. This happens in the "NAT proxy port allocation failed" path in pf_get_translation(); I'd really expect pf to drop the packet instead. Is this a bug?

but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

That'd be things like rdr on $ext_if proto tcp from any to any port 80 -> 127.0.0.1 port 8080, where we redirect incoming connections on port 80 to localhost port 8080.
When we're doing explicit rewrites it's always going to be the destination port, of course.

As an aside, I also found that if a nat rule matches an outbound packet, but we fail to allocate a src port for it (e.g., because static-port is configured and a collision occurs, or pf_get_sport() fails to find a free port), then pf will pass the packet out without performing any translation. That is, we end up passing out unmodified packets from the internal network. This happens in the "NAT proxy port allocation failed" path in pf_get_translation(); I'd really expect pf to drop the packet instead. Is this a bug?

Yeah, I'd say that's a bug. I suspect it mostly stems from confusion between 'there is no translation required' and 'something went wrong', because both return NULL.

I've spent some time looking into the underlying problem, and I'm not sure that this approach (quietly rewriting the src port) is correct. The pf.conf manual page says:

In addition to modifying the address, some translation rules may modify
source or destination ports for tcp(4) or udp(4) connections; implicitly
in the case of nat rules and explicitly in the case of rdr rules.  Port
numbers are never translated with a binat rule.

but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

There may be a subtlety here, the static-port flag is mostly talking about 'NAT' rules, and the purpose of the static-port flag is to keep the source-port used by internal machines (on the inside) when translated to the public IP (outside).
In the case this patch is addressing, we're modulating the source-port of an inbound connection, and we are only changing it 'inside' our network, because the only other choice is to drop the packet.

In D44488#1031217, @kp wrote:

but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

That'd be things like rdr on $ext_if proto tcp from any to any port 80 -> 127.0.0.1 port 8080, where we redirect incoming connections on port 80 to localhost port 8080.
When we're doing explicit rewrites it's always going to be the destination port, of course.

Ah, I had read that as meaning that it was possible for rdr rules to rewrite source ports, but upon a second look, it seems not.

There may be a subtlety here, the static-port flag is mostly talking about 'NAT' rules, and the purpose of the static-port flag is to keep the source-port used by internal machines (on the inside) when translated to the public IP (outside).
In the case this patch is addressing, we're modulating the source-port of an inbound connection, and we are only changing it 'inside' our network, because the only other choice is to drop the packet.

Right, I guess what I'm getting at is that the rdr rule syntax probably ought to provide some control over how source port rewriting is done. For instance, one might want to provide a range of source ports to use, or to disable rewriting entirely (ala static-port for nat rules).

markj added a reviewer: allanjude.
  • Handle source port translation in pf_get_translation(), so that it applies only to RDR rules.
  • Make the test case more reliable; the helper program is now much smaller and doesn't depend on the internals of the test.
In D44488#1031217, @kp wrote:

Yeah, I'd say that's a bug. I suspect it mostly stems from confusion between 'there is no translation required' and 'something went wrong', because both return NULL.

I posted https://reviews.freebsd.org/D45672 to fix this, though I still need to write a test case.

sys/netpfil/pf/pf_lb.c
774

Ideally, this range could be passed from userspace, but we'd need to extend the rdr rule syntax to do so.

I suspect we should also honour static-port on rdr rules to mean, "don't do this modulation." That is, we rewrite the source port if necessary by default, but you can ask pf not to do that, just as with nat rules. Does that seem reasonable?

sys/netpfil/pf/pf_lb.c
774

Ideally, this range could be passed from userspace, but we'd need to extend the rdr rule syntax to do so.

Yeah, or a sysctl or something, but I'm not sure it's worth it for a very marginal use case.

I suspect we should also honour static-port on rdr rules to mean, "don't do this modulation." That is, we rewrite the source port if necessary by default, but you can ask pf not to do that, just as with nat rules. Does that seem reasonable?

It does, yes. If I'm reading pfctl's parse.y correctly we need to check for r->rpool.proxy_port[0] == 0 && r->rpool.proxy_port[1] == 0

794

That seems like left over debug code?
Oh, no, that's the 'we found a new port' handing code. I'm not fond of this construct. Can we have a bool found' or something to make this a bit more obvious?

I'm also a bit unclear on why we assign to *nportp in the normal case but to (*nkp)->port[0] in the conflict case. Oh, right, we normally only change the destination port, but in the conflict case we modify the source port. That wants a comment explaining that, because there's zero chance I'll remember any of this the next time I have to look at this code.

tests/sys/netpfil/pf/rdr.sh
205

What's the point in saving that pid?

  • Add some comments.
  • Update pf.conf.5.
sys/netpfil/pf/pf_lb.c
774

Oh hmm, that proxy_port check won't work here since those fields refer to the desired destination port (range) for redirect rules, whereas I want rdr ... static-port to mean, "don't rewrite the src port". Perhaps it would be best to leave this behaviour unconditional.

794

I ended up changing this in D45672, is that sufficient?

I added some comments.

kp added inline comments.
sys/netpfil/pf/pf_lb.c
794

Yeah, that's better.

This revision is now accepted and ready to land.Jun 25 2024, 12:16 PM

Beating a closed ticket here, but in case anyone is reading this - wouldn't there be a way to DoS this?
If you know anything at all about the target network, it would seem that you could hammer the load balanced address 15k times to create states for all the source ports in the range, and then hit one of the other addresses that have rdr rules that translate to the same address. Each such hit would cause the loop to run 15k times, for as long as the initial states remain valid. Or am I completely misreading this?

Being the one who raised this issue in the first place I should probably have chimed in earlier. Sorry 'bout that.

Beating a closed ticket here, but in case anyone is reading this - wouldn't there be a way to DoS this?
If you know anything at all about the target network, it would seem that you could hammer the load balanced address 15k times to create states for all the source ports in the range, and then hit one of the other addresses that have rdr rules that translate to the same address. Each such hit would cause the loop to run 15k times, for as long as the initial states remain valid. Or am I completely misreading this?

Yes, that's true. The loop is effectively the same as the one used by NAT rules to assign a source port, but that would typically apply to outbound connections and so is less concerning. I'm not sure if 15k iterations is expensive enough for this to be a fatal problem, but it could well be.

Probably the easiest mitigation is to give up after a short number of attempts, on the basis that these kinds of state conflicts are rare in practice. I had been working on an extension to this patch to ignore conflicts if the conflicting state is in a FIN_WAIT_2 or TIME_WAIT state, as pf_state_key_attach() will simply reuse the state in that case, but wasn't sure if that was really necessary. If that seems desirable, I'll finish it.

Yes, that's true. The loop is effectively the same as the one used by NAT rules to assign a source port, but that would typically apply to outbound connections and so is less concerning. I'm not sure if 15k iterations is expensive enough for this to be a fatal problem, but it could well be.

Probably not for random drive-by, but we've seen (D)DoS attack vectors requiring a high degree of knowledge about our particular network (not to mention motivation well beyond trophy-hunting). If *I* could think of it...

Probably the easiest mitigation is to give up after a short number of attempts, on the basis that these kinds of state conflicts are rare in practice. I had been working on an extension to this patch to ignore conflicts if the conflicting state is in a FIN_WAIT_2 or TIME_WAIT state, as pf_state_key_attach() will simply reuse the state in that case, but wasn't sure if that was really necessary. If that seems desirable, I'll finish it.

Sounds like a good idea. And perhaps a configurable loop count limiter? We were seeing this on a pretty large percentage of connections from a specific (and deeply troubled..) source due to the way the client worked: Connections to the LB and to the RDR target handled by the same process, using the same starting port number, increasing more or less in sync.