flowid_shift helps in load sharing but some nics like igb(4), flowid is just cpu/msix id and the values would generally be between 0 to 7 or some small value like that. Right-shift '>>' 16 causes the resulting value to be 0 affecting the load balancing adversely.
igb h/w doesn't provide full flowid by default currently and that's why we set the hash type to M_HASHTYPE_OPAQUE
Teach lagg(4) code to not use flowid if it's of type opaque. This is better fix that changing flowid shift to 0 as it'd break good drivers providing a useful flowid like cxgbe.
I am not sure if this is the right fix.
Before this change, if I set flowid_shift to 0, igb(4) can use
p = m->m_pkthdr.flowid >> sc->flowid_shift;
and doesn't have to do the calculation itself via m_ether_tcpip_hash(). With this change, it has to do that. I don't think that's efficient.
I'll think about this more and get back.
Adding navdeep as he also has some concerns.
So, here is the code block in question:
2053 if ((sc->sc_opts & LAGG_OPT_USE_FLOWID) && 2054 M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) 2055 p = m->m_pkthdr.flowid >> sc->flowid_shift; 2056 else 2057 p = m_ether_tcpip_hash(sc->sc_flags, m, lb->lb_key); 2058 p %= sc->sc_count; 2059 lp = lb->lb_ports[p];
Here, sc_count is number of ports. All we need to get out of line 2055 is what port to send data to along with 2058. It is not deciding queue within the port like usual flowid mappings do.
i.e. We are using flowid to loadbalance traffic within ports and not within queues, as far as I can tell.
I am not clear how flowid_shift actually helps in doing that.
@scottl can you please educate me here? What was the exact problem that you saw that you fixed with r260070?
Thanks in advance.
You are correct, 2055-2058 select the port. However, the cxgber driver uses a similar algorithm to select the queue within the selected port. Cxgbe uses the lower bits of the flowid, so if lagg also uses the lower bits, you'd wind up having the same bits used twice. That, in turn can cause aliasing; a predictable bit pattern will always go to each selected port, limiting the number of pseudo-random bits left for the driver to use for its selection.
Think about it this way. If you have 2 interfaces in lagg, a valid flowid, and you have a flowid_shift = 0, then bit 0 will always be used to select the interface. If bit0 equals 0 then the packet will go to the first interface, if bit0 equals 1 then the packet will go to the second interface. That means that the first interface will always see a flowid with bit0 == 0. If that interface has 2 queues and uses the same algorithm with the flowid for queue selection, then that it'll always look at bit0 of the flowid, which will always be 0. Thus it'll always select queue0. A similar aliasing will happen for the second interface. If the interface has 4 queues, then bit0 will still always be 0, and only bit1 will be pseudo-random, so it'll only ever use queue0 or queue2, never queue1 or queue3.
Having a large flowid_shift helps guarantee that different bits will be used for interface selection than for queue selection. I really think that the drivers should be advertising how many bits of the flowid are useful for the upper layers to use, so that they can then safely reserve bits of it for their own use. The M_HASHTYPE_FOO flags probably don't convert enough information. You also don't want to always use M_HASHTYPE_OPAQUE because that forces lagg to do a someone costly hash generation on each packet, something that you don't want to do on a high speed link.
Thanks a lot @scottl for the explanation. This is exactly what we are seeing with igb also when turn shift off.
IMHO, the correct way is to make igb(4) expose true rss hash given by the card. Currently that is hidden under "ifdef RSS". I am not sure what is the reason for it.
5142 #ifdef RSS 5143 /* XXX set flowtype once this works right */ 5144 rxr->fmp->m_pkthdr.flowid = 5145 le32toh(cur->wb.lower.hi_dword.rss);
Is it possible to bring this outside RSS? This seems to be fixing the problem we are seeing based on my limited testing.
I'll bring this up on -net.
Just realized what @np told on irc. That all drivers set M_HASHTYPE_OPAQUE in non-RSS case right now which is the default.
This change is not correct in that regard.
I started down this route because igb(4) was not exposing full 32-bit flowid in non-RSS case. I fixed that with https://svnweb.freebsd.org/changeset/base/281838
Now, I don't think change mentioned in this review is needed for lagg(4) to do the right thing. The proposed change would actually hurt.
I guess I'll close this review.