Page MenuHomeFreeBSD

Allow ng_nat to be attached to a ethernet interface directly via ng_ether(4) and such
ClosedPublic

Authored by sobomax on Dec 12 2018, 10:47 PM.

Details

Summary

Allow ng_nat to be attached to a ethernet interface directly via ng_ether(4) or the likes. Add new control message types: setdlt and getdlt to switch from default DLT_RAW (no encapsulation) to DLT_EN10MB (ethernet).

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 21525

Event Timeline

sobomax created this revision.Dec 12 2018, 10:47 PM
sobomax updated this revision to Diff 51938.Dec 12 2018, 10:57 PM

Make sure we have enough data for ethernet header.

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

IDK, there is no way this option to be set to anything but DLT_RAW or DLT_EN10MB in the course of normal operation of the node. So it would require some form of memory corruption to actually happen. IDK, panic(9) seems an appropriate action in that case. There are other panic(9) call in the code in similar situations.

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

IDK, there is no way this option to be set to anything but DLT_RAW or DLT_EN10MB in the course of normal operation of the node. So it would require some form of memory corruption to actually happen. IDK, panic(9) seems an appropriate action in that case. There are other panic(9) call in the code in similar situations.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

glebius added inline comments.Dec 13 2018, 6:54 PM
share/man/man4/ng_nat.4
272

New line at end of a sentence.

287

One more new line.

glebius accepted this revision.Dec 13 2018, 6:55 PM

I would like to see a generic code in netgraph that marks hooks with DLT. So, no special messages to be needed, nodes will autosense what's connected to them. However, this is just a wish not a blocker for this change.

This revision is now accepted and ready to land.Dec 13 2018, 6:55 PM

Looks good with one exception: additional plain panic(). Can it be replaced with KASSERT?

IDK, there is no way this option to be set to anything but DLT_RAW or DLT_EN10MB in the course of normal operation of the node. So it would require some form of memory corruption to actually happen. IDK, panic(9) seems an appropriate action in that case. There are other panic(9) call in the code in similar situations.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

Well, that's where I respectively disagree. As an owner of hundreds of FreeBSD systems servicing many millions of customers I think that rebooting the system immediately after any slight kernel heap/stack memory corruption is detected is not just appropriate but the only sane action available. Shutting down particular netgraph node and hope for the best would just leave the service down indefinitely with no hope for any sorts of automatic recovery. On top of which, longer the system stays in this state the bigger there is a chance for it to self-inflict some long-lasting persistent data corruption.

Moreover, arguably sooner you panic the easier it would be to narrow down code which has stomped upon you data structures. It many case it would be just function being called or some of the functions in the call chain.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

Well, that's where I respectively disagree. As an owner of hundreds of FreeBSD systems servicing many millions of customers I think that rebooting the system immediately after any slight kernel heap/stack memory corruption is detected is not just appropriate but the only sane action available. Shutting down particular netgraph node and hope for the best would just leave the service down indefinitely with no hope for any sorts of automatic recovery.

Node destruction and recreation is easily implemented automatic recovery. We obviously have different usage patterns. I use net/mpd5 that can use ng_nat connecting it to the session that can be destroyed and re-created by any of "client" or "server" recovering from corruption and not interfering with other sessions. Other usage patterns can react on such event to re-create nodes too.

On top of which, longer the system stays in this state the bigger there is a chance for it to self-inflict some long-lasting persistent data corruption.

This heavily depend on particular case. For example, our netgraph(3) userland library has unfixed race in NgSendMsg/NgSendAsciiMsg functions that can corrupt useland/kernel socket data flow protocol for threaded application that can lead to garbage in the data at userland side due to desync and mis-interpretation of data received over AF_NETGRAPH socket.

I agree there may be usage patterns that benefit from instant panic but disagree on that this is universal rule.

Yes, it requires corruption of private node memory. As owner of multiple routers mass servicing thousands of customers using multiple NETGRAPH nodes I can assure you that panic is not appropriatie action. Appropriate action is some form of block for traffic flow trough the node in question (with logging) leaving other nodes working.

Well, that's where I respectively disagree. As an owner of hundreds of FreeBSD systems servicing many millions of customers I think that rebooting the system immediately after any slight kernel heap/stack memory corruption is detected is not just appropriate but the only sane action available. Shutting down particular netgraph node and hope for the best would just leave the service down indefinitely with no hope for any sorts of automatic recovery.

Node destruction and recreation is easily implemented automatic recovery. We obviously have different usage patterns. I use net/mpd5 that can use ng_nat connecting it to the session that can be destroyed and re-created by any of "client" or "server" recovering from corruption and not interfering with other sessions. Other usage patterns can react on such event to re-create nodes too.

In our case the node sits in the default network path attached directly to the interface. So if it's internal configuration goes bad and it shuts itself down the box gotta be cut out of the network permanently. And since it's EC2 instance it has no console either. So the only recourse to recover is to reboot instance manually.

On top of which, longer the system stays in this state the bigger there is a chance for it to self-inflict some long-lasting persistent data corruption.

This heavily depend on particular case. For example, our netgraph(3) userland library has unfixed race in NgSendMsg/NgSendAsciiMsg functions that can corrupt useland/kernel socket data flow protocol for threaded application that can lead to garbage in the data at userland side due to desync and mis-interpretation of data received over AF_NETGRAPH socket.

Well, this does not apply to this particular check. Input coming from userland is validated in different code path *before* in-kernel configuration is modified. Thus, the only way for this field to go bad is for a kernel code to go haywire, and this is exactly the condition that panic(9) is supposed to "handle".

This revision was automatically updated to reflect the committed changes.