Page MenuHomeFreeBSD

Teach lagg(4) to not use flowid if hash type is M_HASHTYPE_OPAQUE
AbandonedPublic

Authored by hiren on Mar 19 2015, 5:46 PM.

Details

Summary

flowid_shift helps in load sharing but some nics like igb(4), flowid is just cpu/msix id and the values would generally be between 0 to 7 or some small value like that. Right-shift '>>' 16 causes the resulting value to be 0 affecting the load balancing adversely.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

hiren retitled this revision from to Change lagg(4) flowid_shift default value from 16 to 0.
hiren updated this object.
hiren edited the test plan for this revision. (Show Details)
hiren added a reviewer: scottl.

igb h/w doesn't provide full flowid by default currently and that's why we set the hash type to M_HASHTYPE_OPAQUE
Teach lagg(4) code to not use flowid if it's of type opaque. This is better fix that changing flowid shift to 0 as it'd break good drivers providing a useful flowid like cxgbe.

hiren retitled this revision from Change lagg(4) flowid_shift default value from 16 to 0 to Teach lagg(4) to not use flowid if hash type is M_HASHTYPE_OPAQUE.Mar 21 2015, 1:29 AM

I am not sure if this is the right fix.

Before this change, if I set flowid_shift to 0, igb(4) can use
p = m->m_pkthdr.flowid >> sc->flowid_shift;

and doesn't have to do the calculation itself via m_ether_tcpip_hash(). With this change, it has to do that. I don't think that's efficient.

I'll think about this more and get back.

Adding navdeep as he also has some concerns.

So, here is the code block in question:

2053         if ((sc->sc_opts & LAGG_OPT_USE_FLOWID) &&
2054             M_HASHTYPE_GET(m) != M_HASHTYPE_NONE)
2055                 p = m->m_pkthdr.flowid >> sc->flowid_shift;
2056         else
2057                 p = m_ether_tcpip_hash(sc->sc_flags, m, lb->lb_key);
2058         p %= sc->sc_count;
2059         lp = lb->lb_ports[p];

Here, sc_count is number of ports. All we need to get out of line 2055 is what port to send data to along with 2058. It is not deciding queue within the port like usual flowid mappings do.

i.e. We are using flowid to loadbalance traffic within ports and not within queues, as far as I can tell.

I am not clear how flowid_shift actually helps in doing that.

@scottl can you please educate me here? What was the exact problem that you saw that you fixed with r260070?

Thanks in advance.

You are correct, 2055-2058 select the port. However, the cxgber driver uses a similar algorithm to select the queue within the selected port. Cxgbe uses the lower bits of the flowid, so if lagg also uses the lower bits, you'd wind up having the same bits used twice. That, in turn can cause aliasing; a predictable bit pattern will always go to each selected port, limiting the number of pseudo-random bits left for the driver to use for its selection.

Think about it this way. If you have 2 interfaces in lagg, a valid flowid, and you have a flowid_shift = 0, then bit 0 will always be used to select the interface. If bit0 equals 0 then the packet will go to the first interface, if bit0 equals 1 then the packet will go to the second interface. That means that the first interface will always see a flowid with bit0 == 0. If that interface has 2 queues and uses the same algorithm with the flowid for queue selection, then that it'll always look at bit0 of the flowid, which will always be 0. Thus it'll always select queue0. A similar aliasing will happen for the second interface. If the interface has 4 queues, then bit0 will still always be 0, and only bit1 will be pseudo-random, so it'll only ever use queue0 or queue2, never queue1 or queue3.

Having a large flowid_shift helps guarantee that different bits will be used for interface selection than for queue selection. I really think that the drivers should be advertising how many bits of the flowid are useful for the upper layers to use, so that they can then safely reserve bits of it for their own use. The M_HASHTYPE_FOO flags probably don't convert enough information. You also don't want to always use M_HASHTYPE_OPAQUE because that forces lagg to do a someone costly hash generation on each packet, something that you don't want to do on a high speed link.

Thanks a lot @scottl for the explanation. This is exactly what we are seeing with igb also when turn shift off.

IMHO, the correct way is to make igb(4) expose true rss hash given by the card. Currently that is hidden under "ifdef RSS". I am not sure what is the reason for it.
igb_rxeof() has:

5142 #ifdef  RSS
5143                         /* XXX set flowtype once this works right */
5144                         rxr->fmp->m_pkthdr.flowid =
5145                             le32toh(cur->wb.lower.hi_dword.rss);

Is it possible to bring this outside RSS? This seems to be fixing the problem we are seeing based on my limited testing.

I'll bring this up on -net.

hiren requested a review of this revision.Apr 28 2015, 7:44 PM

I think the current diff looks okay. Can someone review it?

hiren added a reviewer: hiren.

approved by @scottl on irc.

This revision is now accepted and ready to land.Apr 29 2015, 5:19 PM

Just realized what @np told on irc. That all drivers set M_HASHTYPE_OPAQUE in non-RSS case right now which is the default.

This change is not correct in that regard.

I started down this route because igb(4) was not exposing full 32-bit flowid in non-RSS case. I fixed that with https://svnweb.freebsd.org/changeset/base/281838

Now, I don't think change mentioned in this review is needed for lagg(4) to do the right thing. The proposed change would actually hurt.

I guess I'll close this review.