Page MenuHomeFreeBSD

inpcb rtentry/l2 prepend caching
AbandonedPublic

Authored by kmacy on Dec 11 2015, 2:47 AM.
Referenced Files
Unknown Object (File)
Mar 20 2024, 2:57 PM
Unknown Object (File)
Mar 9 2024, 3:37 PM
Unknown Object (File)
Feb 15 2024, 10:40 PM
Unknown Object (File)
Jan 18 2024, 7:13 PM
Unknown Object (File)
Jan 18 2024, 4:26 AM
Unknown Object (File)
Dec 26 2023, 1:54 AM
Unknown Object (File)
Dec 20 2023, 2:38 AM
Unknown Object (File)
Nov 27 2023, 8:14 PM

Details

Summary

D4364 updated to work against D4102

  • cache rtentry in inpcb to skip route lookups on every packet
  • cache prepend (L2 header info) in inpcb to avoid L2 lookups on every packet
  • stale info is avoided by keeping a generation a counter and discarding cached fields when cached generation and global generation no longer match
  • rtentry caching is also a prerequisite for subnet based connection policies (ECN, congestion control protocol, alternate TCP stacks, etc)
Test Plan

Netflix beating

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kmacy retitled this revision from to inpcb rtentry/l2 prepend caching.
kmacy updated this object.
kmacy edited the test plan for this revision. (Show Details)
kmacy added reviewers: rrs, melifaro, karels, gnn.
kmacy set the repository for this revision to rS FreeBSD src repository - subversion.
kmacy added a project: transport.
kmacy added subscribers: benno, transport.

Thanks for updating the patch!
Looks OK, several small comments inline

sys/net/if_ethersubr.c
316

Not that I'm against it, but this might break w/ some HA gateways using gratuitous arp as switchover mechanism. Probably not a problem, just a note.

sys/netinet/in_pcb.c
541

It would be great not to retrieve/address rnh directly since it should't be visible to external consumers.
Having something like rt_tables_get_revision(int fibnum, int family) api would be much better.

553

I'm not sure if it is really possible to have NULL rt_ifp. rt_getifa_fib() would either return some ifa or return an error on addition. And, generally, the only reason not to have ougoing interface, are blackhole/reject routes, but we have always required to insert such routes with any (typically, loopback) interface. So, I'd just remove this check.

592

There is MRT support for IPv6, it would be great not to break it :)

Some in-line comments from a quick pass.

sys/netinet/in_pcb.c
540

It looks like this does not match the style of this file. The file seems to enclose if/else statements like this in appropriate #ifdefs for INET and INET6. As far as I can tell, the code should compile fine; however, the current code will probably produce unnecessary instructions if either INET or INET6 are not defined.

589

Again, it looks like this doesn't match the style of the file, which encloses the actual check inside the ifdef. See, for example, lines 478-489.

As far as I can tell, the code should compile fine as is; however, this may produce unnecessary instructions if INET or INET6 are not defined.

sys/netinet/ip_output.c
245

Should nortfree be initialized to 0 in the true case?

sys/net/if_ethersubr.c
316

I think this needs to be addressed by making sure rnh_gen is incremented when llentry updates occur.

sys/netinet/in_pcb.c
540

It makes it that much uglier handling the else. However, I'll do it for the sake of consistency. I don't think FreeBSD will ever use a compiler without rudimentary dead code elimination. Although, I guess it might generate an extra branch for the always taken else path.

541

Is there such an API in place? Otherwise I'll happily submit a patch when it comes in to being. Blocking on a hypothetical API doesn't make a whole lot of sense tome.

553

I'll turn it in to an MPASS in that case.

592

This patch pre-dates said support. Thanks for the heads up.

kmacy edited edge metadata.

Updated to (I hope) address all the comments and compile against the most recent HEAD. This has been used in ISLN's internal incast testing and I'm hoping to get LL and NFLX to test.

Could Mike point out what his patch provides that this doesn't - apart from the lookup on bound UDP? This provides L2 caching which is important.

kmacy removed a reviewer: transport.
kmacy added a reviewer: hiren.
kmacy removed a subscriber: transport.

About L3 caching:

I think my approach, including a "struct route" in the in_pcb, is both simpler and more correct.

I dislike building a "struct route" on the stack and then moving fields to and from that structure from the pcb. I think it is wrong to compare the route using the in_pcb destination address, as that may have changed (e.g. in the UDP case). The struct route contains the previously-looked-up destination, which is necessary.

When adding a route, it is not necessary to change the rt_gen for every fib, just for the one that was added. It is not necessary to change the gen count for a change or deletion, as those are already handled by the routing code. I don't think it is necessary to use atomic operations, as all that is necessary is that increments happen, not that they increment by a specific amount.

About L2 caching: this approach is very wrong. Caching a pointer and length to the L2 header allows no way to validate or invalidate the cache. What happens when the L2 entry times out and is deleted? I believe the approach in my previous patch is correct, and the struct route should be reverted to contain an lle pointer. I reverted that change to put it in a separate review, as I thought it might be more controversial.

In D4490#111105, @mike-karels.net wrote:

About L3 caching:

I think my approach, including a "struct route" in the in_pcb, is both simpler and more correct.

I dislike building a "struct route" on the stack and then moving fields to and from that structure from the pcb. I think it is wrong to compare the route using the in_pcb destination address, as that may have changed (e.g. in the UDP case). The struct route contains the previously-looked-up destination, which is necessary.

Which is why I'm not supporting UDP at the moment. Bound sockets are the exception and modifying the inpcb requires acquiring a wlock which is awkward to acquire in the context of ip_output. If I were to jump through the necessary hoops to support unbound UDP sockets, that would be a problem.

It might be cleaner to have the struct route in the in_pcb. However, it seems like potentially just more bloat. I do concede that it makes it's use simpler. Nonetheless, passing the inpcb's route down the stack is incorrect moving forward as really it's an implementation flaw that the inpcb lock needs to be held across lengthy essentially stateless operations such as ip_output. So even if we had a struct route in the inpcb, we'd nonetheless want to copy it's contents to the stack.

When adding a route, it is not necessary to change the rt_gen for every fib, just for the one that was added. It is not necessary to change the gen count for a change or deletion, as those are already handled by the routing code. I don't think it is necessary to use atomic operations, as all that is necessary is that increments happen, not that they increment by a specific amount.

If you don't use atomics sometimes the rt_gen will be stale and lookups during that window will hold a potentially incorrect rtentry.

Except in the lle code where I don't know which fib is the correct one, where do I change the rt_gen for "every fib"? You're probably right about deletion - I'll need to think about that. Where do I increment it for change? I only see it in the context of add and delete in route.c

About L2 caching: this approach is very wrong. Caching a pointer and length to the L2 header allows no way to validate or invalidate the cache. What happens when the L2 entry times out and is deleted?

I invalidate all cached context. See if_llatbl.c.

I believe the approach in my previous patch is correct, and the struct route should be reverted to contain an lle pointer. I reverted that change to put it in a separate review, as I thought it might be more controversial.

This is not correct. The cached llentry still has the same lifetime issues in terms of needing to be re-validated. Regenerating the precomputed header when there is a change to the L2 hash tables is an adequate solution.

I'll bring it up on the transport call to see how others feel about it.

I'll respond to other parts soon.  But I want to correct this:

About L2 caching: this approach is very wrong. Caching a pointer and length to the L2 header allows no way to validate or invalidate the cache. What happens when the L2 entry times out and is deleted?

I invalidate all cached context. See if_llatbl.c.

This is not effective. Consider the following:

Thread A checks the routing generation number in ip_output. Thread B then increments the generation count of all the routing tables (the bigger hammer approach!) after the generation is checked, then changes or deletes the L2 header. Only later, the L2 code code is reached by thread A, and it leads to incorrect results or a crash.

In D4490#111292, @mike-karels.net wrote:
I'll respond to other parts soon.  But I want to correct this:

About L2 caching: this approach is very wrong. Caching a pointer and length to the L2 header allows no way to validate or invalidate the cache. What happens when the L2 entry times out and is deleted?

I invalidate all cached context. See if_llatbl.c.

This is not effective. Consider the following:

Thread A checks the routing generation number in ip_output. Thread B then increments the generation count of all the routing tables (the bigger hammer approach!)

Not as big as looking up the L2 information for every single packet, or probably for that matter, regenerating the header every packet.

after the generation is checked, then changes or deletes the L2 header.

Only later, the L2 code code is reached by thread A, and it leads to incorrect results or a crash.

Please read the code again.

A) The precomputed header is a private copy allocated during resolution in ether_output. The previously cached version will be freed when the precomputed header is assigned to inp_prepend. The cached version will be freed if ever there is an rt_gen mismatch. Ideally yes, we'd have a separate generation counter for it. Perhaps on the ifnet. Even the "bigger hammer approach" is superior to not caching because the number of lookups saved is still hundreds of thousands to millions on an active server.

B) The incorrect results will be for the duration of a single packet. The same race applies to the rtentry when we check the generation count. The interface could very well change between the check and its use. The race is still even present when we look up the rtentry on _every_ packet.

With or without caching reliability is not guaranteed on IP.

OK, I missed the malloc/copy in if_ethersubr.c. So there isn't a problem with pointers to freed memory, but the L2 code still has the issue as the routing code with copying the pointers to/from the route structure.

It seems to me that this approach has a lot of extra code to avoid having a struct route in the inpcb. Also, too much of the additional code is in tcp_output.c, which is definitely a layering issue. (Not to mention including ethernet.h in tcp_output.c!) With the code in ip_output.c where it belongs, it is much easier to support UDP (see more below).

It might be cleaner to have the struct route in the in_pcb. However,
it seems like potentially just more bloat. I do concede that it
makes it's use simpler. Nonetheless, passing the inpcb's route down
the stack is incorrect moving forward as really it's an implementation
flaw that the inpcb lock needs to be held across lengthy essentially
stateless operations such as ip_output. So even if we had a struct
route in the inpcb, we'd nonetheless want to copy it's contents to
the stack.

I disagree that the route should be copied to the stack. TCP already holds a write lock on the inpcb, which also locks the tcpcb. This is needed, but is not a problem because simultaneous operations on a socket are rare and almost always in correct. This has nothing to do with passing down the inpcb. And with the code
ip_output.c where it belongs, it is easier to support UDP as well. Bound UDP sockets are not that rare; most anything that sends more than a few packets uses connected sockets.

Not as big as looking up the L2 information for every single packet, or
probably for that matter, regenerating the header every packet.

I agree that the L2 information should be cached as much as possible. As noted earlier, I removed it from D4306 to separate it from the L3 cache, and to discuss those issues separately. If the reviewers prefer, I'll put back it into D4306.

A minor point:

If you don't use atomics sometimes the rt_gen will be stale and
lookups during that window will hold a potentially incorrect rtentry.

I think the window is much smaller for that issue than for the longer time between the generation counter check and the use of the L2 info. However, I could be convinced to use atomics.