Details

Reviewers

kp
allanjude

Commits

rG08d0ebe560a8: pf: Let rdr rules modify the src port if doing so would avoid a conflict
rG9897a66923a3: pf: Let rdr rules modify the src port if doing so would avoid a conflict

Summary

If multiple inbound connections through a RDR rule cannot be accomodated
due to conflicting src/dst ip/port combination, modulate the src port in
the state entry to allow the connection rather than dropping it.

Co-authored-by: Allan Jude <allanjude@freebsd.org>
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.

Test Plan

all tcp 203.0.113.50:8000 (198.51.100.1:7777) <- 198.51.100.50:5454       FIN_WAIT_2:ESTABLISHED
all tcp 203.0.113.50:8000 (198.51.100.2:7777) <- 198.51.100.50:5710 (198.51.100.50:5454)       FIN_WAIT_2:ESTABLISHED
all tcp 203.0.113.50:8000 (198.51.100.3:7777) <- 198.51.100.50:5966 (198.51.100.50:5454)       FIN_WAIT_2:ESTABLISHED

The first connection is RDR'd normally. The second and 3rd have the src port on the 'far' side of the RDR rewritten (with state entries to route it back correctly)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 58337
Build 55225: arc lint + arc unit

Event Timeline

allanjude created this revision.Mar 24 2024, 3:57 AM

Herald added subscribers: vegeta_tuxpowered.net, glebius, melifaro and 2 others. · View Herald TranscriptMar 24 2024, 3:57 AM

allanjude requested review of this revision.Mar 24 2024, 3:57 AM

Harbormaster completed remote builds in B56759: Diff 136142.Mar 24 2024, 3:58 AM

This also really needs a test case.

sys/netpfil/pf/pf.c
5028 ↗	(On Diff #136142)	Declarations go at the top of the block.
5049 ↗	(On Diff #136142)	There's a pf SDT namespace, so that really wants to be in there.
5053 ↗	(On Diff #136142)	That could roll over into port 0. It probably also needs a bailout condition to avoid spinning forever if there are no more available ports. See PR 233867 for a similar bug.

allanjude edited the test plan for this revision. (Show Details)Mar 24 2024, 6:55 AM

Add test case

Harbormaster completed remote builds in B56761: Diff 136145.Mar 24 2024, 6:57 AM

kp added inline comments.Mar 24 2024, 7:01 AM

tests/sys/netpfil/pf/rdr.sh
194	Why not just nc?
199	That really needs to be compiled by the makefile, not on every test run. See tests/sys/net's randsleep for an example.

Address feedback

sys/netpfil/pf/pf.c
5053 ↗	(On Diff #136142)	Is there a specific threshold that makes sense to give up at? Or just when we would overflow and loop back to port 0?

Harbormaster completed remote builds in B56762: Diff 136146.Mar 24 2024, 7:34 AM

kp added inline comments.Mar 24 2024, 7:46 AM

sys/netpfil/pf/pf.c
5053 ↗	(On Diff #136142)	I was inclined to say once we've looped though all 65K ports, because then we know we can't possibly handle this state, but that's 65K hashes and lock / unlocks and that's going to be very, very expensive. So for we may want to limit it much more than that. Given that we only try once now just trying the next say 32 ports may be enough to fix most of the occurrences. This also suggests two test cases to add: using port 65535, so we have to roll over to 0 (and test that we do not use port 0), as well as a scenario where there are no ports available in whatever range of tries we settle on.

What real-life situation is this fix for? If this is for connections coming from behind a 3rd party SNAT, where SNAT reuses source ports faster than pf expires states, then maybe tuning pf timeouts would be enough. Or we could allow pf states to transition from TCPS_FIN_WAIT_2 back to TCPS_SYN_SENT, basically implementing SO_REUSEPORT for pf.

In D44488#1014674, @vegeta_tuxpowered.net wrote:

What real-life situation is this fix for? If this is for connections coming from behind a 3rd party SNAT, where SNAT reuses source ports faster than pf expires states, then maybe tuning pf timeouts would be enough. Or we could allow pf states to transition from TCPS_FIN_WAIT_2 back to TCPS_SYN_SENT, basically implementing SO_REUSEPORT for pf.

It is for behind a load balancer, where the initial connection comes in one public IP #1, but then all future connections come in on public IP #2, but the external source port from the incoming IP stays the same.
So when PF NAT's it to the web server behind the PF RDR rules, the incoming source ports conflict.

marius.h_lden.org added a subscriber: marius.h_lden.org.Mar 29 2024, 10:49 PM

markj added a subscriber: markj.Apr 12 2024, 6:26 PM

markj added inline comments.

sys/netpfil/pf/pf.c
5050 ↗	(On Diff #136146)	Can we get away with doing this on demand, i.e., try to insert a new state without checking and try with a different port if pf_state_insert() returns EEXIST? (I'll try to answer my own question, but posing it here in case someone has a quick reason to say "no".)

kp added inline comments.Apr 13 2024, 10:41 AM

sys/netpfil/pf/pf.c
5050 ↗	(On Diff #136146)	I think that ought to work. There's a log entry that will potentially fire in that case, but we can either fix that or live with it, given that collisions ought to be rare anyway. It's probably even the best approach, because trying to find a free port here and then running pf_state_insert() is potentially racy.

I've spent some time looking into the underlying problem, and I'm not sure that this approach (quietly rewriting the src port) is correct. The pf.conf manual page says:

In addition to modifying the address, some translation rules may modify
source or destination ports for tcp(4) or udp(4) connections; implicitly
in the case of nat rules and explicitly in the case of rdr rules.  Port
numbers are never translated with a binat rule.

but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

As an aside, I also found that if a nat rule matches an outbound packet, but we fail to allocate a src port for it (e.g., because static-port is configured and a collision occurs, or pf_get_sport() fails to find a free port), then pf will pass the packet out without performing any translation. That is, we end up passing out unmodified packets from the internal network. This happens in the "NAT proxy port allocation failed" path in pf_get_translation(); I'd really expect pf to drop the packet instead. Is this a bug?

In D44488#1031012, @markj wrote:

but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

That'd be things like rdr on $ext_if proto tcp from any to any port 80 -> 127.0.0.1 port 8080, where we redirect incoming connections on port 80 to localhost port 8080.
When we're doing explicit rewrites it's always going to be the destination port, of course.

As an aside, I also found that if a nat rule matches an outbound packet, but we fail to allocate a src port for it (e.g., because static-port is configured and a collision occurs, or pf_get_sport() fails to find a free port), then pf will pass the packet out without performing any translation. That is, we end up passing out unmodified packets from the internal network. This happens in the "NAT proxy port allocation failed" path in pf_get_translation(); I'd really expect pf to drop the packet instead. Is this a bug?

Yeah, I'd say that's a bug. I suspect it mostly stems from confusion between 'there is no translation required' and 'something went wrong', because both return NULL.

In D44488#1031012, @markj wrote:
I've spent some time looking into the underlying problem, and I'm not sure that this approach (quietly rewriting the src port) is correct. The pf.conf manual page says:
In addition to modifying the address, some translation rules may modify
source or destination ports for tcp(4) or udp(4) connections; implicitly
in the case of nat rules and explicitly in the case of rdr rules.  Port
numbers are never translated with a binat rule.
but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

There may be a subtlety here, the static-port flag is mostly talking about 'NAT' rules, and the purpose of the static-port flag is to keep the source-port used by internal machines (on the inside) when translated to the public IP (outside).
In the case this patch is addressing, we're modulating the source-port of an inbound connection, and we are only changing it 'inside' our network, because the only other choice is to drop the packet.

In D44488#1031217, @kp wrote:

In D44488#1031012, @markj wrote:

but with this change we are implicitly rewriting the source port. What explicit mechanism is the man page alluding to here?

That'd be things like rdr on $ext_if proto tcp from any to any port 80 -> 127.0.0.1 port 8080, where we redirect incoming connections on port 80 to localhost port 8080.
When we're doing explicit rewrites it's always going to be the destination port, of course.

Ah, I had read that as meaning that it was possible for rdr rules to rewrite source ports, but upon a second look, it seems not.

There may be a subtlety here, the static-port flag is mostly talking about 'NAT' rules, and the purpose of the static-port flag is to keep the source-port used by internal machines (on the inside) when translated to the public IP (outside).
In the case this patch is addressing, we're modulating the source-port of an inbound connection, and we are only changing it 'inside' our network, because the only other choice is to drop the packet.

Right, I guess what I'm getting at is that the rdr rule syntax probably ought to provide some control over how source port rewriting is done. For instance, one might want to provide a range of source ports to use, or to disable rewriting entirely (ala static-port for nat rules).

markj commandeered this revision.Jun 21 2024, 3:04 PM

markj added a reviewer: allanjude.

Handle source port translation in pf_get_translation(), so that it applies only to RDR rules.
Make the test case more reliable; the helper program is now much smaller and doesn't depend on the internals of the test.

Harbormaster completed remote builds in B58280: Diff 140070.Jun 21 2024, 3:06 PM

In D44488#1031217, @kp wrote:

Yeah, I'd say that's a bug. I suspect it mostly stems from confusion between 'there is no translation required' and 'something went wrong', because both return NULL.

I posted https://reviews.freebsd.org/D45672 to fix this, though I still need to write a test case.

sys/netpfil/pf/pf_lb.c
774	Ideally, this range could be passed from userspace, but we'd need to extend the rdr rule syntax to do so. I suspect we should also honour `static-port` on rdr rules to mean, "don't do this modulation." That is, we rewrite the source port if necessary by default, but you can ask pf not to do that, just as with nat rules. Does that seem reasonable?

kp added inline comments.Jun 24 2024, 2:39 PM

sys/netpfil/pf/pf_lb.c
774	Ideally, this range could be passed from userspace, but we'd need to extend the rdr rule syntax to do so. Yeah, or a sysctl or something, but I'm not sure it's worth it for a very marginal use case. I suspect we should also honour static-port on rdr rules to mean, "don't do this modulation." That is, we rewrite the source port if necessary by default, but you can ask pf not to do that, just as with nat rules. Does that seem reasonable? It does, yes. If I'm reading pfctl's parse.y correctly we need to check for `r->rpool.proxy_port[0] == 0 && r->rpool.proxy_port[1] == 0`
794	That seems like left over debug code? Oh, no, that's the 'we found a new port' handing code. I'm not fond of this construct. Can we have a `bool found'` or something to make this a bit more obvious? I'm also a bit unclear on why we assign to nportp in the normal case but to (nkp)->port[0] in the conflict case. Oh, right, we normally only change the destination port, but in the conflict case we modify the source port. That wants a comment explaining that, because there's zero chance I'll remember any of this the next time I have to look at this code.
tests/sys/netpfil/pf/rdr.sh
205	What's the point in saving that pid?

Add some comments.
Update pf.conf.5.

Harbormaster completed remote builds in B58337: Diff 140198.Jun 24 2024, 8:55 PM

markj added inline comments.Jun 24 2024, 8:55 PM

sys/netpfil/pf/pf_lb.c
774	Oh hmm, that `proxy_port` check won't work here since those fields refer to the desired destination port (range) for redirect rules, whereas I want `rdr ... static-port` to mean, "don't rewrite the src port". Perhaps it would be best to leave this behaviour unconditional.
794	I ended up changing this in D45672, is that sufficient? I added some comments.

kp accepted this revision.Jun 25 2024, 12:16 PM

kp added inline comments.

sys/netpfil/pf/pf_lb.c
794	Yeah, that's better.

This revision is now accepted and ready to land.Jun 25 2024, 12:16 PM

allanjude accepted this revision.Jul 9 2024, 1:16 PM

Closed by commit rG9897a66923a3: pf: Let rdr rules modify the src port if doing so would avoid a conflict (authored by markj). · Explain WhyAug 19 2024, 2:38 PM

This revision was automatically updated to reflect the committed changes.

markj added a commit: rG9897a66923a3: pf: Let rdr rules modify the src port if doing so would avoid a conflict.

Beating a closed ticket here, but in case anyone is reading this - wouldn't there be a way to DoS this?
If you know anything at all about the target network, it would seem that you could hammer the load balanced address 15k times to create states for all the source ports in the range, and then hit one of the other addresses that have rdr rules that translate to the same address. Each such hit would cause the loop to run 15k times, for as long as the initial states remain valid. Or am I completely misreading this?

Being the one who raised this issue in the first place I should probably have chimed in earlier. Sorry 'bout that.

In D44488#1056431, @ltning-freebsd_anduin.net wrote:

Beating a closed ticket here, but in case anyone is reading this - wouldn't there be a way to DoS this?
If you know anything at all about the target network, it would seem that you could hammer the load balanced address 15k times to create states for all the source ports in the range, and then hit one of the other addresses that have rdr rules that translate to the same address. Each such hit would cause the loop to run 15k times, for as long as the initial states remain valid. Or am I completely misreading this?

Yes, that's true. The loop is effectively the same as the one used by NAT rules to assign a source port, but that would typically apply to outbound connections and so is less concerning. I'm not sure if 15k iterations is expensive enough for this to be a fatal problem, but it could well be.

Probably the easiest mitigation is to give up after a short number of attempts, on the basis that these kinds of state conflicts are rare in practice. I had been working on an extension to this patch to ignore conflicts if the conflicting state is in a FIN_WAIT_2 or TIME_WAIT state, as pf_state_key_attach() will simply reuse the state in that case, but wasn't sure if that was really necessary. If that seems desirable, I'll finish it.

Yes, that's true. The loop is effectively the same as the one used by NAT rules to assign a source port, but that would typically apply to outbound connections and so is less concerning. I'm not sure if 15k iterations is expensive enough for this to be a fatal problem, but it could well be.

Probably not for random drive-by, but we've seen (D)DoS attack vectors requiring a high degree of knowledge about our particular network (not to mention motivation well beyond trophy-hunting). If *I* could think of it...

Probably the easiest mitigation is to give up after a short number of attempts, on the basis that these kinds of state conflicts are rare in practice. I had been working on an extension to this patch to ignore conflicts if the conflicting state is in a FIN_WAIT_2 or TIME_WAIT state, as pf_state_key_attach() will simply reuse the state in that case, but wasn't sure if that was really necessary. If that seems desirable, I'll finish it.

Sounds like a good idea. And perhaps a configurable loop count limiter? We were seeing this on a pretty large percentage of connections from a specific (and deeply troubled..) source due to the way the client worked: Connections to the LB and to the RDR target handled by the same process, using the same starting port number, increasing more or less in sync.

markj added a commit: rG08d0ebe560a8: pf: Let rdr rules modify the src port if doing so would avoid a conflict.Nov 20 2024, 9:41 PM

pf: if a new RDR state connect be created, modulate src port
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 140198

share/man/man5/pf.conf.5

sys/netpfil/pf/pf_lb.c

tests/sys/netpfil/pf/Makefile

tests/sys/netpfil/pf/rdr-srcport.py

tests/sys/netpfil/pf/rdr.sh

pf: if a new RDR state connect be created, modulate src portClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 140198

share/man/man5/pf.conf.5

sys/netpfil/pf/pf_lb.c

tests/sys/netpfil/pf/Makefile

tests/sys/netpfil/pf/rdr-srcport.py

tests/sys/netpfil/pf/rdr.sh

pf: if a new RDR state connect be created, modulate src port
ClosedPublic
Actions

Revision Contents
Changeset List