Page MenuHomeFreeBSD

Add gve, the driver for Google Virtual NIC (gVNIC)

Authored by on Apr 28 2023, 4:42 PM.
Referenced Files
Unknown Object (File)
Tue, Sep 26, 9:11 AM
Unknown Object (File)
Tue, Sep 26, 9:11 AM
Unknown Object (File)
Tue, Sep 26, 9:10 AM
Unknown Object (File)
Tue, Sep 26, 9:10 AM
Unknown Object (File)
Thu, Sep 7, 11:22 AM
Unknown Object (File)
Thu, Sep 7, 11:18 AM
Unknown Object (File)
Thu, Sep 7, 11:18 AM
Unknown Object (File)
Thu, Sep 7, 11:17 AM



gVNIC is a virtual network interface designed specifically for
Google Compute Engine (GCE). It is required to support per-VM Tier_1
networking performance, and for using certain VM shapes on GCE.

The NIC supports TSO, Rx and Tx checksum offloads, and RSS.
It does not currently do hardware LRO, and thus the software-LRO
in the host is used instead. It also supports jumbo frames.

For each queue, the driver negotiates a set of pages with the NIC to
serve as a fixed bounce buffer, this precludes the use of iflib.

Diff Detail

rG FreeBSD src repository
Lint Not Applicable
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Just use device_printf() here like everywhere else?


We need to include netinet/tcp_lro.h here for the use of struct lro_ctrl in gve.h.


Please use m_gethdr() in new code.


Please use m_get() in new code.


device_print() needs a trailing newline.


This isn't necessarily an IPv4 packet.

Some minor nits


Nit: hide this under bootverbose; the message is useful for debugging but not very useful for regular boots (the rx queue allocation should always succeed), and omitting these would help reducing boot time.


Nit: we don't usually explicitly print out this in network drivers. Maybe it can be removed or hidden under bootverbose?


Maybe hide this under bootverbose?


Maybe hide this under if (bootverbose)?


This (operation being successful) is usually a bootverbose message.


This should be hidden under bootverbose because it always succeeds.


This can be hidden under bootverbose because it's an operation that normally succeeds.


This (and the link is down below) is typically bootverbose message.


This is usually a bootverbose message.


(nit: M_WAITOK will always give a non-NULL result)


(nit: M_WAITOK will always give a non-NULL result).


Nit: this is more useful for debugging but may be too verbose for regular boot, therefore it's advisable to hide it under bootverbose (doing so would improve boot time). marked 41 inline comments as done.
  • Fix an ipv6 Tx bug
  • Remove all uses of linuxkpi
  • Stop checking for failure of malloc when WAITOK is supplied
  • Hide all the happy-path logging behind bootverbose
  • Make few style corrections and typo fixes
  • Introduce a couple of new sysctl counters

The last one is unrelated to review comments and is just a straggler that did not make the first diff.

Could you please elaborate a bit more? How exactly does iflib make this hard to implement?

It looked to me that iflib assumes that the NIC can read packets from and write packets to arbitrary locations. In this driver, all Tx packets need to be necessarily copied into a pre-negotiated set of pages, the Tx mbufs produced by the upper layers cannot be DMA-ed as-is to be seen by the NIC. In Rx too the location where the NIC can write is pre-determined, so there is really no "posting buffers" phase. So a lot of the goodies in iflib relating to buffer management do not apply to this driver, and it didn't look to me that there was a way to ask iflib to allow the driver to keep control of buffer management -- but even if there was a way to do that, it felt too neither-here-nor-there to pursue.


"link is down" message originates from the NIC and is not expected to happen if things are working, so keeping it, moved the "link is up" to bootverbose,


Indeed, I kept it to make it easy to port to 13.1 without having to fine tune the code


bytes can be legitimately zero when filling beyond the first descriptor:

payload_nfrags = gve_tx_alloc_fifo(&tx->fifo, pkt_len - first_seg_len,

first_seg_len might equal pkt_len


It actually does not support VLAN, I changed the code to reflect that.


Fixed the bug


The error here is inability to queue the mbuf onto the drbr. I assumed that if the driver just returned ENOBUFS, the networking stack above will later try to transmit the mbuf again later. What entrenched this belief of mine further is that I don't observe any mbuf leaks when this happens to a significant number of mbufs; and more conclusively: I don't see any other ethernet driver free an mbuf on a drbr_enqueue failure. Please let me know if I'm mistaken.

Also, freeing the mbuf here causes a panic, further indicating that something above the driver still holds on to the mbuf, and was caught unaware when the driver freed the mbuf from under it.


What global state is protected by this lock? It looks like it really ought to be a per-interface lock.


No need for backslashes here.


If there is more than one interface, this will overwrite an initialized lock.


I don't think you need these defines, you can use __packed and _Static_assert from cdefs.h.


Can this be a blocking (i.e., M_WAITOK) allocation? In general, allocations done at driver attach time can block.


So assert that eh->ether_type != ETHERTYPE_VLAN.

I don't think you can simply declare VLAN-tagged packets as not supported. What happens if someone configures a vlan interface in front of a gve interface? I can't see anything that prevents if_vlan from attaching; it is happy so long as it's an ethernet interface with IFF_BROADCAST and IFF_MULTICAST set, which is the case here.

It looks like the driver should at minimum drop all non-IPv4/v6 packets, and document this constraint in the man page.


We can't reasonably assume that ether_type != ETHERTYPE_IPV6 implies that ether_type == ETHERTYPE_IPV4. If that assumption is critical to the code, then it should be asserted.


Sorry, you're right. I had mixed up drbr_enqueue() with ifmp_ring_enqueue().

130 marked 10 inline comments as done.
  • Do not assume ipv4 if not ipv6
  • Explicitly declare lack of support for VLAN
  • Replace custom defines of Static_assert with native calls
  • Switch to a per-interface lock

C macro bodies should be parenthesized to avoid operator precedence bugs.


Could you explain this mechanism? What ensures that the page_info reference remains live? Consider the scenario where a received packet is placed in a userspace socket buffer and isn't processed immediately, and the interface is brought down before the mbuf is freed.

How does the driver know it can safely reuse this memory, given that there don't seem to be any barriers here? It looks like we should be accessing can_flip with acquire/release semantics.


Do we need this macro? It's used once.


Below you copy the packet into this mbuf without checking that it actually fits. A buffer returned by m_getcl() is 2KB (MCLBYTES) in size, but according to the review description the driver supports jumbo frames.

Can we hit this path with a jumbo frame? If not, please add KASSERT()s to that effect.


What's the purpose of having nested if-statements here? All of these conditions can be checked in one predicate.


The indentation of continuing lines is inconsistent. Please stick with one style, ideally the one prescribed by the style(9) man page, though I think most of the rest of the driver uses Linux style for that. marked 7 inline comments as done.
  • Account for interface leaving before all Rx mbufs are freed
  • Assert that all packet fragments fit in a cluster mbuf
  • Have a consistent continuation-line style

It is a bug, the ext_arg1 pointer would indeed be pointing to driver-private freed memory if the interface is detached before all in-flight Rx mbufs are consumed.

I have now fixed it by having the driver and the mbuf-freeing stack above it be independently able to free the bounce-buffer page: if the driver is detached after all inflight mbufs are consumed, then it would be the entity that finally frees the page, and if instead it detaches before some mbufs are consumed, then the pages pointed to by those mbufs are only truly freed when mfree acts on them.

This is achieved by the following changes:

  • Instead of allocating the bounce buffer pages with bus_dmamem_alloc, vm_page_alloc_noobj is used. The driver then has to do kva_alloc+pmap_qenter. The page is alloced with an initial wire count of 1.
  • Before surrendering a newborn Rx mbuf to the stack above, the driver bumps up the wire count by using vm_page_wire, which, since, works on pages not belonging to an object.
  • The MEXTADD callback still exists, but it no longer works on driver-private memory and instead on the underlying vm_page_t.
  • Both the callback and the driver's detach-time free routine lower the wire count with vm_page_unwire_noq which returns true only to the last reference holder, who would then end up doing kva_free+pmap_qremove+vm_page_free.
  • The driver reads the ref count and if it finds it to be 1, it concludes that nothing above it is using the page, and thus avoids a copy. I could not find any race conditions in the driver reading the ref count un-atomically.

Although this adds a bit of complexity, it helps preserve driver performance and keeps its copy rate (and by extension, cpu util) very low.


The 9k-sized Rx packets arrive as a chain of MCLBYTES-sized fragments.

Thanks for fixing the indentation style.

I think the driver's in decent shape, the inline comments are minor. If you haven't tried testing the driver with GENERIC-KASAN and GENERIC-KMSAN configurations, I'd suggest it. I've never tried either of them in GCE so you might run into some unrelated problems there, especially with KMSAN; if so, please let us know.


Extra newline.


You might stick a __predict_false annotation here. This ought to be very rare.


Thank you for the explanation. I think this is ok, though I made some comments elsewhere.


I think this scheme is fine from a correctness perspective.

I can imagine scenarios where the median rx mbuf lifetime is slightly longer than the time it takes to go through the rx ring once, in which case this page flipping optimization might fail most of the time. This could happen when the kernel needs to decrypt the contents of each received mbuf. I suppose the counters should at least make this kind of problem easy to identify, or is it not a concern in practice?


This should use atomic_load_int(), to signal that this is performing an unlocked access:

int ref_count = atomic_load_int(&page_info->page->ref_count);


I'm not certain, but now that the receive buffers are not allocated through bus_dma, KMSAN doesn't have any way to know that the received buffers are initialized, so it might raise false positives. You might need to call kmsan_mark_mbuf() before handing it to upper layers.


git apply complains about a whitespace error on this line.


Still using the old indentation style here.

This revision is now accepted and ready to land.Jun 1 2023, 3:54 PM

I have a bunch of silly style things that my automated script catches. Please consider fixing them.

You might want to run src/tools/build/ youreself for this. You can ignore it being picky about 80 column violations, but the >120 really should be attended to (there's just one)
There were 22 errors (which I've flagged examples of above) and 84 warnings (many of which are overly picky warnings about 80 columns and most aren't easily dealt with because the rule of having the entire string on one line for messages trumps the length rule because we've found grepping for messages more important than going over line length a little bit).



This line is over 90 characters, which is starting to be too long. It's under the hard limit in so it's OK if fixing it would make things too ugly or shorten the comment too much.


Block comments are

 * comment

Where /* and */ are on lines of their own. There's several examples, but I'm just flagging this one.

tools/build/ catches these as a 'warning' (but see overall comment about some of the warnings)


This line is way too long. Consider breaking it up like the lines just before it.


stlye(9) heavily suggests () around the return values. Since there's only a few that need fixing, please consider it here and below.
There's 9 more of these I've not flagged.


Extra space here between return and (


FreeBSD style is 'gve_priv *priv' with a space after the * but none between the * and variable name.


ditto... Not calling out each individual one here, but there's 4 more after this


Need a space here...
But maybe just do
#define min(x, y) MIN(x, y)
since MIN is defined in sys/param.h in FreeBSD


need a space after 'for' here marked 17 inline comments as done.
  • Use __predict_false in the MEXTADD callback
  • Read page ref count with atomic_load_int
  • A bunch of style corrections
  • Switch to MIN from sys/params.h
This revision now requires review to proceed.Jun 1 2023, 11:38 PM

Sorry, couldn't find a way to make it non ugly and also be under 80


At least in the testing I did, I wasn't able to cause a situation in which the driver copies most of the time. Such conditions might very well exist but may not be too commonplace.


Understood, I will try to test this.

I'll commit this version (with minor changes to remove it from i386 for now, and connect the manual page to build) and follow up with additional changes.

This revision is now accepted and ready to land.Jun 2 2023, 9:35 PM