Page MenuHomeFreeBSD

Allow ng_nat to be attached to a ethernet interface directly via ng_ether(4) and such
ClosedPublic

Authored by sobomax on Dec 12 2018, 10:47 PM.
Referenced Files
Unknown Object (File)
Wed, Jan 1, 11:50 AM
Unknown Object (File)
Tue, Dec 31, 12:14 PM
Unknown Object (File)
Mon, Dec 30, 12:10 PM
Unknown Object (File)
Sun, Dec 29, 12:06 PM
Unknown Object (File)
Sat, Dec 28, 11:37 AM
Unknown Object (File)
Fri, Dec 27, 12:25 PM
Unknown Object (File)
Dec 14 2024, 11:17 PM
Unknown Object (File)
Nov 26 2024, 12:15 PM

Details

Summary

Allow ng_nat to be attached to a ethernet interface directly via ng_ether(4) or the likes. Add new control message types: setdlt and getdlt to switch from default DLT_RAW (no encapsulation) to DLT_EN10MB (ethernet).

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Make sure we have enough data for ethernet header.

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

IDK, there is no way this option to be set to anything but DLT_RAW or DLT_EN10MB in the course of normal operation of the node. So it would require some form of memory corruption to actually happen. IDK, panic(9) seems an appropriate action in that case. There are other panic(9) call in the code in similar situations.

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

IDK, there is no way this option to be set to anything but DLT_RAW or DLT_EN10MB in the course of normal operation of the node. So it would require some form of memory corruption to actually happen. IDK, panic(9) seems an appropriate action in that case. There are other panic(9) call in the code in similar situations.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

share/man/man4/ng_nat.4
272 ↗(On Diff #51938)

New line at end of a sentence.

287 ↗(On Diff #51938)

One more new line.

I would like to see a generic code in netgraph that marks hooks with DLT. So, no special messages to be needed, nodes will autosense what's connected to them. However, this is just a wish not a blocker for this change.

This revision is now accepted and ready to land.Dec 13 2018, 6:55 PM

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

IDK, there is no way this option to be set to anything but DLT_RAW or DLT_EN10MB in the course of normal operation of the node. So it would require some form of memory corruption to actually happen. IDK, panic(9) seems an appropriate action in that case. There are other panic(9) call in the code in similar situations.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

Well, that's where I respectively disagree. As an owner of hundreds of FreeBSD systems servicing many millions of customers I think that rebooting the system immediately after any slight kernel heap/stack memory corruption is detected is not just appropriate but the only sane action available. Shutting down particular netgraph node and hope for the best would just leave the service down indefinitely with no hope for any sorts of automatic recovery. On top of which, longer the system stays in this state the bigger there is a chance for it to self-inflict some long-lasting persistent data corruption.

Moreover, arguably sooner you panic the easier it would be to narrow down code which has stomped upon you data structures. It many case it would be just function being called or some of the functions in the call chain.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

Well, that's where I respectively disagree. As an owner of hundreds of FreeBSD systems servicing many millions of customers I think that rebooting the system immediately after any slight kernel heap/stack memory corruption is detected is not just appropriate but the only sane action available. Shutting down particular netgraph node and hope for the best would just leave the service down indefinitely with no hope for any sorts of automatic recovery.

Node destruction and recreation is easily implemented automatic recovery. We obviously have different usage patterns. I use net/mpd5 that can use ng_nat connecting it to the session that can be destroyed and re-created by any of "client" or "server" recovering from corruption and not interfering with other sessions. Other usage patterns can react on such event to re-create nodes too.

On top of which, longer the system stays in this state the bigger there is a chance for it to self-inflict some long-lasting persistent data corruption.

This heavily depend on particular case. For example, our netgraph(3) userland library has unfixed race in NgSendMsg/NgSendAsciiMsg functions that can corrupt useland/kernel socket data flow protocol for threaded application that can lead to garbage in the data at userland side due to desync and mis-interpretation of data received over AF_NETGRAPH socket.

I agree there may be usage patterns that benefit from instant panic but disagree on that this is universal rule.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

Well, that's where I respectively disagree. As an owner of hundreds of FreeBSD systems servicing many millions of customers I think that rebooting the system immediately after any slight kernel heap/stack memory corruption is detected is not just appropriate but the only sane action available. Shutting down particular netgraph node and hope for the best would just leave the service down indefinitely with no hope for any sorts of automatic recovery.

Node destruction and recreation is easily implemented automatic recovery. We obviously have different usage patterns. I use net/mpd5 that can use ng_nat connecting it to the session that can be destroyed and re-created by any of "client" or "server" recovering from corruption and not interfering with other sessions. Other usage patterns can react on such event to re-create nodes too.

In our case the node sits in the default network path attached directly to the interface. So if it's internal configuration goes bad and it shuts itself down the box gotta be cut out of the network permanently. And since it's EC2 instance it has no console either. So the only recourse to recover is to reboot instance manually.

On top of which, longer the system stays in this state the bigger there is a chance for it to self-inflict some long-lasting persistent data corruption.

This heavily depend on particular case. For example, our netgraph(3) userland library has unfixed race in NgSendMsg/NgSendAsciiMsg functions that can corrupt useland/kernel socket data flow protocol for threaded application that can lead to garbage in the data at userland side due to desync and mis-interpretation of data received over AF_NETGRAPH socket.

Well, this does not apply to this particular check. Input coming from userland is validated in different code path *before* in-kernel configuration is modified. Thus, the only way for this field to go bad is for a kernel code to go haywire, and this is exactly the condition that panic(9) is supposed to "handle".

This revision was automatically updated to reflect the committed changes.