This is inspired by counter(9) PCPU implementation. @ae remove cached route support for tunnel interfaces in 0b9f5f8a, f325335ca . The original implementation does not work with concurrent io as the cached route is not properly synchronized, and it acquires exclusive lock since ip_output() / ip6_output() may update it.
A simple gif(4) over disc(4) test on a N5105 box shows about 3% - 17% increasing of forwarding rate, and 5% - 41% decreasing of CPU use time vary the size of routing table.