Page MenuHomeFreeBSD

If reading the routing table fails, retry up to 10 times
AcceptedPublic

Authored by allanjude on May 19 2018, 3:25 PM.

Details

Summary

To read the routing table, we call sysctl() with a NULL pointer to get the size of the result (needed), then allocate a buffer that large, and then call sysctl() for real.

However, if you are operating a large router, like a BGP full feed, your routing table is changing constantly, and the routing table may grow by one or more entries between the estimate, and the actual read.
Currently, this results in the confusing error message:

# netstat -rnf inet
Routing tables
netstat: sysctl: net.route.0.2.dump.0: Cannot allocate memory

On a router with 12GB of free memory. Memory allocation did not fail, rather the buffer we passed to sysctl() was insufficient because some additional routes were added in the mean time.

This patch will expand needed by 25% over the most recent estimate and try again, up to 10 times.

During convergence, when 100s of thousands of routes are being added rapidly, this can be the difference between being able to read the routing table, and not.

This patch was inspired by one Hiroki offered on the mailing list in 2014.
https://lists.freebsd.org/pipermail/freebsd-current/2014-February/thread.html#48412
That version of the patch only grew the buffer by 4kb, however the routing table in this case is over 180 MB, and the churn may be more than the ~16 additional routes that would fit in an extra 4kb of memory.

Test Plan

Tested on a live router with 3 full BGP feeds

# netstat -rnf inet|wc -l
netstat: sysctl: net.route.0.2.dump.0: routing table grew beyond estimated buffer size (188334440), retrying with larger buffer...
  692454

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 16700
Build 16594: arc lint + arc unit

Event Timeline

allanjude created this revision.May 19 2018, 3:25 PM
allanjude edited the summary of this revision. (Show Details)May 19 2018, 3:26 PM
allanjude edited the test plan for this revision. (Show Details)
allanjude edited the summary of this revision. (Show Details)
allanjude edited the test plan for this revision. (Show Details)
allanjude updated this revision to Diff 42737.May 19 2018, 4:19 PM

After discussion with rgrimes, round needed up to the nearest full PAGE_SIZE

This should eliminate the need to retry in the common case where the number of routes has increased by only a single digit

Make the expansion facter larger, to try to deal with the 'BGP feed flap' case

allanjude edited the summary of this revision. (Show Details)May 19 2018, 4:20 PM
melifaro added inline comments.
usr.bin/netstat/route.c
317

What is the reason of using 1.25 and not, say, 2?

I see this problem often on BPG routers. I usually use the birdc as a better and more reliabe interface to the routing information, though on occasion you do have to sanity check with what the kernel has.
Also birdc seems to be able to give me a routing table 10x faster than netstat -rn and without this issue.

usr.bin/netstat/route.c
255–256

I am not a fan of mixing initialized and un initialized variables in the same line.

rgrimes added inline comments.May 19 2018, 4:53 PM
usr.bin/netstat/route.c
311

I am concerned about this high retry count, netstat -rn on a full view BGP router is a non trivial process, repeating it 10 times in a tight loop may lead to additional problems.

Can we lower this retry count to 4, and add a short delay() inside the loop giving the route table a chance to settle?

I think the round up to page is already going to handle the stable state, and it is just during bgp session start up that the problem is most likely. You certainly do not want to be banging on the CPU when it is already pretty busy doing a full view route flap convergence!

While the intention is good, I'm curious why someone would want to use "netstat -rn" these days for BGPv4 full view having about 700 thousands prefixes?

netstat gives text view of the table requiring great deal of CPU cycles to format it first and then to parse again.

usr.bin/netstat/route.c
255–256

Also, variables should be sorted by name.

allanjude updated this revision to Diff 42739.May 19 2018, 6:35 PM
allanjude marked 3 inline comments as done.

Address feedback from rgrimes@, melifaro@, and everyone else who commented

  • Cleanup style of declarations
  • Lower retry count from 10 to 4
  • Add a linearly increasing sleep between retries
  • Change the growth factor from 1.25 to 2x
allanjude marked an inline comment as done.May 19 2018, 6:40 PM
rgrimes accepted this revision.May 23 2018, 7:38 AM

I am ok with this for now, but I think the values 100000uS and 2x are going to need some "experience" tuning.

This revision is now accepted and ready to land.May 23 2018, 7:38 AM
olivier accepted this revision.Nov 28 2019, 12:24 AM

Still not committed ?

I have come to the state that if your running into this problem you probably should not be using netstat -rn to look at route tables but rather use the proper tool that talks with your routing daemon, adding a "bang on the kernel" repeat inside of netstat is probably a poor solution.

I have come to the state that if your running into this problem you probably should not be using netstat -rn to look at route tables but rather use the proper tool that talks with your routing daemon, adding a "bang on the kernel" repeat inside of netstat is probably a poor solution.

Well, I'm not sure.
First of all, there _are_ cases where you have to fetch the data, regardless of its size - troubleshooting can be one of those.
Second, netstat is the only standard base system tool we have to inspect the routing table.
Thirdly, we're going pretty good towards providing the structured output for all utilities and net stat may actually be the only way to provide that - nor bird or quagga/FRR are able to do so.

Saying that the only base tool we have to read the routing table shouldn't be used doesn't look really positive.
Currently the kernel<>user land interaction protocol is far from being optimal, but _that_ is the problem we should be targetting.
Meanwhile I don't see a reason why we shouldn't just fix the tool to do its job properly.
Yes, it adds some CPU overhead, yes it may delay the convergence - but let's the customer decide whether he's ok with this or not.

melifaro accepted this revision.Dec 2 2019, 6:32 PM

I have come to the state that if your running into this problem you probably should not be using netstat -rn to look at route tables but rather use the proper tool that talks with your routing daemon, adding a "bang on the kernel" repeat inside of netstat is probably a poor solution.

Well, I'm not sure.
First of all, there _are_ cases where you have to fetch the data, regardless of its size - troubleshooting can be one of those.

And when trouble shooting netstat -rn general works ok, it gets the error only while the route table is growing rapidly and is very easy for the user to retry the command, SHOULD they be using it.

Second, netstat is the only standard base system tool we have to inspect the routing table.

And a base system without FRR/bird/Quagga, etc all is very very unlikley to ever encounter this issue. You need route growth rates that are high to trigger it.

Thirdly, we're going pretty good towards providing the structured output for all utilities and net stat may actually be the only way to provide that - nor bird or quagga/FRR are able to do so.

Again, yoiu do not really want to be running netstat -rn on a system that has this issue.

Saying that the only base tool we have to read the routing table shouldn't be used doesn't look really positive.

The conditions under which it should not be used are fairly retrictive, it is not as if this is a global stamp of "Oh, its broke, dont use that

Currently the kernel<>user land interaction protocol is far from being optimal, but _that_ is the problem we should be targetting.

Certainly, come up with a fix for that interface that a) does not have this problem, and b) does not kill at least 1 core while it runs

Meanwhile I don't see a reason why we shouldn't just fix the tool to do its job properly.

I do not see this as a fix, this is a "hammer the box tell it gives me results" solution.

Yes, it adds some CPU overhead, yes it may delay the convergence - but let's the customer decide whether he's ok with this or not

Some CPU? I think you need to look closer at what this does when BGP is trying to converge on 500k+ routes during a flap

Offering a solution. Have the kernel track the max size of the routing table and return that value on sysctl(null) rather than the current size. This should only fail during initial table growth on a large table router, and should not require a retry except on rare occasion.

I have come to the state that if your running into this problem you probably should not be using netstat -rn to look at route tables but rather use the proper tool that talks with your routing daemon, adding a "bang on the kernel" repeat inside of netstat is probably a poor solution.

Well, I'm not sure.
First of all, there _are_ cases where you have to fetch the data, regardless of its size - troubleshooting can be one of those.

And when trouble shooting netstat -rn general works ok, it gets the error only while the route table is growing rapidly and is very easy for the user to retry the command, SHOULD they be using it.

Please consider scripts that call netstat as a use case.

Thirdly, we're going pretty good towards providing the structured output for all utilities and net stat may actually be the only way to provide that - nor bird or quagga/FRR are able to do so.

Again, yoiu do not really want to be running netstat -rn on a system that has this issue.

JFYI: routing daemons performs routing table sync on regular basis, as async mechanism does not actually guarantee message delivery. They do this on regular intervals, for example bird default is 60 seconds. Please take a look at the bird code to handle this (IIRC frr and openbgpd does the same): krt-sock.c

Saying that the only base tool we have to read the routing table shouldn't be used doesn't look really positive.

The conditions under which it should not be used are fairly retrictive, it is not as if this is a global stamp of "Oh, its broke, dont use that

Sorry, I'm afraid I disagree. We should make the tool work instead of telling users not to use it.

Currently the kernel<>user land interaction protocol is far from being optimal, but _that_ is the problem we should be targetting.

Certainly, come up with a fix for that interface that a) does not have this problem, and b) does not kill at least 1 core while it runs

Meanwhile I don't see a reason why we shouldn't just fix the tool to do its job properly.

I do not see this as a fix, this is a "hammer the box tell it gives me results" solution.

Yes, it adds some CPU overhead, yes it may delay the convergence - but let's the customer decide whether he's ok with this or not

Some CPU? I think you need to look closer at what this does when BGP is trying to converge on 500k+ routes during a flap

Let me try to describe a bigger picture as I see it.

Fetching large dynamically-changing datastructure from kernel to the userland has always been a challenge, it is not unique for the routing tables.
Traditionally people solve this by either using "stop-the-world" approach, creating momental snapshot at the cost of locking the entire structure, or trying to split dump into smaller pieces.

*BSD routing API traditionally was using the former approach, coupled with rtsock serving as async communication channel. Netlink is example of the latter approach, where dump is requested and can be read message-by-message.

Generally, my opinion is that the right way of solving it would be introducing netlink support. It would unblock quite a lot of cases, including this one. I'm working with mmokhi@ to split existing GSOC task into one exclusively targeting netlink support.

However, until we have it landed one way or anothrt, there is not a lot of things we can change in the existing sysctl()-based API, as all routing software depends on it and the protocol extensibility is pretty bad.

The existing interface dictates an access model: as the buffer size is not known in advance, caller has to retry. There is no way of avoiding it - even if kernel keeps historical data, there is no guarantee that by the time caller passes the buffer to kernel the actual size won't increase.

Here specifically we talk about 2 issues:

  1. slowness of a single dump run

and

  1. netstat failing due to suboptimal algorithm used.

Let's try to look into in a greater details.
For the first one, rtsock protocol is overly chatty, consuming ~240 bytes per prefix. With that, full-view generation takes ~140 mbytes of data that needs to be generated under radix read lock. This indeed makes all writers unhappy and consumes decent amount of resources. Protocol itself is pretty rigid and it would be hard to cut report size even in half. Addressing it would require creation of an additional interface and adopting it in major routing daemons, making them further diverge from their generic *BSD implementations. I would rather invest time in the netlink support, as there have been quite a lot of efforts to improve its performance, the last one is the explicit nexthops introduction.

  1. I fully agree, that wrong implementation would indeed hammer kernel without any usable results. However, the exponential buffer size extension approach, used as a retry mechanism, would just give 2x overhead in the worst case (assuming 2x buffer size multiplier), or 3x overhead for 1.5 multiplier. I think this is quite a good tradeoff, allowing both good user experience by making the tool reliable and not putting a lot of additional (unbounded) pressure to the kernel.

What do you think?