Page MenuHomeFreeBSD

tcp: LRO epoch issue as well a potential pullup issue.

Authored by rrs on Thu, Jul 15, 11:17 AM.


Group Reviewers

There are some drivers that call LRO without having been in
an epoch. This of course can cause a panic due to the assumption
that we are in an epoch. Also as long as we are here cleaning
up a few things, there is no validation that we don't need
a pullup. In theory drivers should always have the IP+TCP
headers in the first mbuf, but there is no assurance that
this is true, let's not assume it and instead validate it.
Also in investigating this the rack_bbr_common, which as
a pullup in it for one of its paths does a mfree in a pullup
failure, thats plain wrong.

Test Plan

Make sure that LRO still works and also get the reporter of the
epoch issue to test his specific circumstances with the patch
to validate that it fixes the epoch issue.

Diff Detail

Lint Skipped
Unit Tests Skipped

Event Timeline

rrs requested review of this revision.Thu, Jul 15, 11:17 AM

Please use __predict_false() where possible.


Please use __predict_false(needs_epoch) instead of needs_epoch.

See latest patch I sent...


See above.


See above.


See above.


See above.


This code will never trigger, because the tcp_lro_parser() already checks that the full payload, including the TCP header and options, fit into m_len. If it doesn't it returns NULL, and so the packet is not LRO'ed.

/* We expect a contiguous header [eh, ip, tcp]. */
pa = tcp_lro_parser(m, &po, &pi, true);
if (__predict_false(pa == NULL))
        return (TCP_LRO_NOT_SUPPORTED);

Why is this even needed?

The way I'm reading this, almost all drivers mark themselves as network drivers (INTR_TYPE_NET) when registering interrupts. When that happens, ithread_loop() does the NET_EPOCH_ENTER() for them, around calls to their interrupt handler. It is only "special snowflake" drivers that do not use this mechanism and mark themselves as IFF_KNOWSEPOCH. So I'd say this should be replaced by a NET_EPOCH_ASSERT()

There is a similar check in ether_input which Gleb added. I've added him to the review for his comment.

A couple of bits..

Hans great to hear that the pullup is not needed earlier in the parser.. I can pull that code out then.

Michael, yes I can put things at predict false

Drew, the bits on epoch is an actual reported bug (Michael can chime in on the number.. NFS or some such I guess).

And something in this is broken yet so there will be more changes.. if I have LRO going through the latest
rack code things break still debugging


So I have now been looking at the parser, can you explain to me how
it is checking this?

The first thing I see that tcp_lro_parser() does is call the low level parser, but
that only gets m->m_data and walks through the pointer, it does not validate
that m_len is long enough to do what its doing.

Now I agree that most drivers will not be sending these short mbufs
in, but how is just walking the headers validating that a driver did
not violate the thing?

Maybe I am missing something can you please point out to me how this
is validated??




When packets enter the tcp_lro_parser() I've factored out the generic parts into the tcp_lro_low_level_parser() so that it can be used for both inner and outer headers.

Before the tcp_lro_low_level_parser() function returns non-NULL it updates parser->total_hdr_len, which include all TCP options, if any.

Then before tcp_lro_parser() returns, it checks total_hdr_len agains m_len:

if (data_ptr == NULL || po->total_hdr_len > m->m_len)
       return (NULL);

If the m_len is not big enough to contain all inner and outer headers, including TCP options, then it fails passing the packet into the LRO engine basically.



Yes I see that this prevents us from having a packet that fails that particular test...

But what happens if I send in (from some errant driver) a m_pkthdr with the E-net and IP header but not
the TCP header.. yes a driver should not do that.. but you make no provisions in the parsing code. All
you are going to do is romp on through whatever was in the mbuf before and dereference the data..

For example if you see its IPPROT_TCP then you just set parser->tcp = ptr and then dereference
that pointer (to get option length as well as the ports). This is not going to go well if a driver
sends in a non-contiguous block.. you *may* catch it, but you may not.

We probably need to send in the m_len to the low-level parser and make sure it does not
exceed that somehow.


Hi Randall,

The tcp_lro_parser() function returns NULL if there is only ETH+IP headers, if you look carefully.

The tcp_lro_parser() function by default, only returns non-NULL, if the inner header is TCP and all of it fits into the first mbuf in the chain.



This is broken if the caller send you ETH+IP+TCP and the TCP_Options in the next mbuf, you
will deref the options and read garbage.

If the caller passes you ETH+IP and the TCP + Options is in the next mbuf
then you will deref whatever happens to be in the mbuf.

You thus end up with unknown results and problems.

Michael and I have chatted on the transport sync. I am going to remove
this review, and he is going to separate out the EPOCH review on its own.

Then we need to address the fact that the parser code is possibly
wandering through garbage in a separate review. We either need
to protect against this i.e. using pullup or pass in the m_len and track
how much is left.. or fix it with pullups..


This is broken if the caller send you ETH+IP+TCP and the TCP_Options in the next mbuf, you
will deref the options and read garbage.

Hi Randall,

The current LRO code will just forward such packets AS-IS to the TCP stack, because the tcp_lro_parser() will return NULL then!

mbuf chains like you describe are _not_ touched by the current LRO code.


This is a driver bug. The hyperv driver appears to be using a taskqueue to inject packets into the network stack, rather than a normal interrupt handler. As such, I feel that it is the hyperv driver's responsibility to take the network epoch, and not force 99% of drivers which do the right thing to take the epoch twice.

I'd really like to simplify all input paths, not make them more complex with unneeded tests and atomic operations.