Page MenuHomeFreeBSD

Add gve, the driver for Google Virtual NIC (gVNIC)
ClosedPublic

Authored by shailend_google.com on Apr 28 2023, 4:42 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Sep 5, 11:21 AM
Unknown Object (File)
Thu, Sep 5, 10:07 AM
Unknown Object (File)
Mon, Aug 26, 2:53 PM
Unknown Object (File)
Wed, Aug 21, 6:59 AM
Unknown Object (File)
Sun, Aug 18, 3:21 PM
Unknown Object (File)
Thu, Aug 15, 9:56 PM
Unknown Object (File)
Thu, Aug 15, 9:04 AM
Unknown Object (File)
Thu, Aug 15, 9:04 AM

Details

Summary

gVNIC is a virtual network interface designed specifically for
Google Compute Engine (GCE). It is required to support per-VM Tier_1
networking performance, and for using certain VM shapes on GCE.

The NIC supports TSO, Rx and Tx checksum offloads, and RSS.
It does not currently do hardware LRO, and thus the software-LRO
in the host is used instead. It also supports jumbo frames.

For each queue, the driver negotiates a set of pages with the NIC to
serve as a fixed bounce buffer, this precludes the use of iflib.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/dev/gve/gve_adminq.c
108

Just use device_printf() here like everywhere else?

sys/dev/gve/gve_plat.h
68

We need to include netinet/tcp_lro.h here for the use of struct lro_ctrl in gve.h.

sys/dev/gve/gve_rx.c
426

Please use m_gethdr() in new code.

434

Please use m_get() in new code.

sys/dev/gve/gve_tx.c
115

device_print() needs a trailing newline.

620

This isn't necessarily an IPv4 packet.

Some minor nits

sys/dev/gve/gve_adminq.c
250

Nit: hide this under bootverbose; the message is useful for debugging but not very useful for regular boots (the rx queue allocation should always succeed), and omitting these would help reducing boot time.

295
323

Nit: we don't usually explicitly print out this in network drivers. Maybe it can be removed or hidden under bootverbose?

374–375

Maybe hide this under bootverbose?

568

Maybe hide this under if (bootverbose)?

sys/dev/gve/gve_main.c
195

This (operation being successful) is usually a bootverbose message.

386

This should be hidden under bootverbose because it always succeeds.

493

This can be hidden under bootverbose because it's an operation that normally succeeds.

526
670

This (and the link is down below) is typically bootverbose message.

793

This is usually a bootverbose message.

sys/dev/gve/gve_tx.c
124–127

(nit: M_WAITOK will always give a non-NULL result)

133–136

(nit: M_WAITOK will always give a non-NULL result).

sys/dev/gve/gve_utils.c
239

Nit: this is more useful for debugging but may be too verbose for regular boot, therefore it's advisable to hide it under bootverbose (doing so would improve boot time).

shailend_google.com marked 41 inline comments as done.
  • Fix an ipv6 Tx bug
  • Remove all uses of linuxkpi
  • Stop checking for failure of malloc when WAITOK is supplied
  • Hide all the happy-path logging behind bootverbose
  • Make few style corrections and typo fixes
  • Introduce a couple of new sysctl counters

The last one is unrelated to review comments and is just a straggler that did not make the first diff.

Could you please elaborate a bit more? How exactly does iflib make this hard to implement?

It looked to me that iflib assumes that the NIC can read packets from and write packets to arbitrary locations. In this driver, all Tx packets need to be necessarily copied into a pre-negotiated set of pages, the Tx mbufs produced by the upper layers cannot be DMA-ed as-is to be seen by the NIC. In Rx too the location where the NIC can write is pre-determined, so there is really no "posting buffers" phase. So a lot of the goodies in iflib relating to buffer management do not apply to this driver, and it didn't look to me that there was a way to ask iflib to allow the driver to keep control of buffer management -- but even if there was a way to do that, it felt too neither-here-nor-there to pursue.

sys/dev/gve/gve_main.c
670

"link is down" message originates from the NIC and is not expected to happen if things are working, so keeping it, moved the "link is up" to bootverbose,

842

Indeed, I kept it to make it easy to port to 13.1 without having to fine tune the code

sys/dev/gve/gve_tx.c
533

bytes can be legitimately zero when filling beyond the first descriptor:

payload_nfrags = gve_tx_alloc_fifo(&tx->fifo, pkt_len - first_seg_len,
                                 &info->iov[payload_iov]);

first_seg_len might equal pkt_len

609

It actually does not support VLAN, I changed the code to reflect that.

620

Fixed the bug

761

The error here is inability to queue the mbuf onto the drbr. I assumed that if the driver just returned ENOBUFS, the networking stack above will later try to transmit the mbuf again later. What entrenched this belief of mine further is that I don't observe any mbuf leaks when this happens to a significant number of mbufs; and more conclusively: I don't see any other ethernet driver free an mbuf on a drbr_enqueue failure. Please let me know if I'm mistaken.

Also, freeing the mbuf here causes a panic, further indicating that something above the driver still holds on to the mbuf, and was caught unaware when the driver freed the mbuf from under it.

sys/dev/gve/gve.h
134

What global state is protected by this lock? It looks like it really ought to be a per-interface lock.

sys/dev/gve/gve_main.c
380

No need for backslashes here.

739

If there is more than one interface, this will overwrite an initialized lock.

sys/dev/gve/gve_plat.h
100

I don't think you need these defines, you can use __packed and _Static_assert from cdefs.h.

sys/dev/gve/gve_tx.c
175

Can this be a blocking (i.e., M_WAITOK) allocation? In general, allocations done at driver attach time can block.

355
613

So assert that eh->ether_type != ETHERTYPE_VLAN.

I don't think you can simply declare VLAN-tagged packets as not supported. What happens if someone configures a vlan interface in front of a gve interface? I can't see anything that prevents if_vlan from attaching; it is happy so long as it's an ethernet interface with IFF_BROADCAST and IFF_MULTICAST set, which is the case here.

It looks like the driver should at minimum drop all non-IPv4/v6 packets, and document this constraint in the man page.

620

We can't reasonably assume that ether_type != ETHERTYPE_IPV6 implies that ether_type == ETHERTYPE_IPV4. If that assumption is critical to the code, then it should be asserted.

761

Sorry, you're right. I had mixed up drbr_enqueue() with ifmp_ring_enqueue().

sys/dev/gve/gve_utils.c
130
shailend_google.com marked 10 inline comments as done.
  • Do not assume ipv4 if not ipv6
  • Explicitly declare lack of support for VLAN
  • Replace custom defines of Static_assert with native calls
  • Switch to a per-interface lock
sys/dev/gve/gve_main.c
310
sys/dev/gve/gve_plat.h
94

C macro bodies should be parenthesized to avoid operator precedence bugs.

sys/dev/gve/gve_rx.c
374

Could you explain this mechanism? What ensures that the page_info reference remains live? Consider the scenario where a received packet is placed in a userspace socket buffer and isn't processed immediately, and the interface is brought down before the mbuf is freed.

How does the driver know it can safely reuse this memory, given that there don't seem to be any barriers here? It looks like we should be accessing can_flip with acquire/release semantics.

385

Do we need this macro? It's used once.

416

Below you copy the packet into this mbuf without checking that it actually fits. A buffer returned by m_getcl() is 2KB (MCLBYTES) in size, but according to the review description the driver supports jumbo frames.

Can we hit this path with a jumbo frame? If not, please add KASSERT()s to that effect.

544

What's the purpose of having nested if-statements here? All of these conditions can be checked in one predicate.

sys/dev/gve/gve_tx.c
677

The indentation of continuing lines is inconsistent. Please stick with one style, ideally the one prescribed by the style(9) man page, though I think most of the rest of the driver uses Linux style for that.

shailend_google.com marked 7 inline comments as done.
  • Account for interface leaving before all Rx mbufs are freed
  • Assert that all packet fragments fit in a cluster mbuf
  • Have a consistent continuation-line style
sys/dev/gve/gve_rx.c
374

It is a bug, the ext_arg1 pointer would indeed be pointing to driver-private freed memory if the interface is detached before all in-flight Rx mbufs are consumed.

I have now fixed it by having the driver and the mbuf-freeing stack above it be independently able to free the bounce-buffer page: if the driver is detached after all inflight mbufs are consumed, then it would be the entity that finally frees the page, and if instead it detaches before some mbufs are consumed, then the pages pointed to by those mbufs are only truly freed when mfree acts on them.

This is achieved by the following changes:

  • Instead of allocating the bounce buffer pages with bus_dmamem_alloc, vm_page_alloc_noobj is used. The driver then has to do kva_alloc+pmap_qenter. The page is alloced with an initial wire count of 1.
  • Before surrendering a newborn Rx mbuf to the stack above, the driver bumps up the wire count by using vm_page_wire, which, since https://reviews.freebsd.org/D26173, works on pages not belonging to an object.
  • The MEXTADD callback still exists, but it no longer works on driver-private memory and instead on the underlying vm_page_t.
  • Both the callback and the driver's detach-time free routine lower the wire count with vm_page_unwire_noq which returns true only to the last reference holder, who would then end up doing kva_free+pmap_qremove+vm_page_free.
  • The driver reads the ref count and if it finds it to be 1, it concludes that nothing above it is using the page, and thus avoids a copy. I could not find any race conditions in the driver reading the ref count un-atomically.

Although this adds a bit of complexity, it helps preserve driver performance and keeps its copy rate (and by extension, cpu util) very low.

416

The 9k-sized Rx packets arrive as a chain of MCLBYTES-sized fragments.

Thanks for fixing the indentation style.

I think the driver's in decent shape, the inline comments are minor. If you haven't tried testing the driver with GENERIC-KASAN and GENERIC-KMSAN configurations, I'd suggest it. I've never tried either of them in GCE so you might run into some unrelated problems there, especially with KMSAN; if so, please let us know.

share/man/man4/gve.4
164
sys/dev/gve/gve_rx.c
78

Extra newline.

374

You might stick a __predict_false annotation here. This ought to be very rare.

374

Thank you for the explanation. I think this is ok, though I made some comments elsewhere.

426

I think this scheme is fine from a correctness perspective.

I can imagine scenarios where the median rx mbuf lifetime is slightly longer than the time it takes to go through the rx ring once, in which case this page flipping optimization might fail most of the time. This could happen when the kernel needs to decrypt the contents of each received mbuf. I suppose the counters should at least make this kind of problem easy to identify, or is it not a concern in practice?

428

This should use atomic_load_int(), to signal that this is performing an unlocked access:

int ref_count = atomic_load_int(&page_info->page->ref_count);

469

I'm not certain, but now that the receive buffers are not allocated through bus_dma, KMSAN doesn't have any way to know that the received buffers are initialized, so it might raise false positives. You might need to call kmsan_mark_mbuf() before handing it to upper layers.

sys/dev/gve/gve_tx.c
408

git apply complains about a whitespace error on this line.

sys/dev/gve/gve_utils.c
258

Still using the old indentation style here.

This revision is now accepted and ready to land.Jun 1 2023, 3:54 PM

I have a bunch of silly style things that my automated script catches. Please consider fixing them.

You might want to run src/tools/build/checkstyle9.pl youreself for this. You can ignore it being picky about 80 column violations, but the >120 really should be attended to (there's just one)
There were 22 errors (which I've flagged examples of above) and 84 warnings (many of which are overly picky warnings about 80 columns and most aren't easily dealt with because the rule of having the entire string on one line for messages trumps the length rule because we've found grepping for messages more important than going over line length a little bit).

Thanks!

sys/dev/gve/gve.h
236

This line is over 90 characters, which is starting to be too long. It's under the hard limit in checkstyle9.pl so it's OK if fixing it would make things too ugly or shorten the comment too much.

243

Block comments are

/*
 * comment
 */

Where /* and */ are on lines of their own. There's several examples, but I'm just flagging this one.

tools/build/checkstyle9.pl catches these as a 'warning' (but see overall comment about some of the warnings)

sys/dev/gve/gve_adminq.c
53

This line is way too long. Consider breaking it up like the lines just before it.

154

stlye(9) heavily suggests () around the return values. Since there's only a few that need fixing, please consider it here and below.
There's 9 more of these I've not flagged.

591

Extra space here between return and (

sys/dev/gve/gve_main.c
446

FreeBSD style is 'gve_priv *priv' with a space after the * but none between the * and variable name.

455

ditto... Not calling out each individual one here, but there's 4 more after this

sys/dev/gve/gve_plat.h
90

Need a space here...
But maybe just do
#define min(x, y) MIN(x, y)
since MIN is defined in sys/param.h in FreeBSD

sys/dev/gve/gve_tx.c
790

need a space after 'for' here

shailend_google.com marked 17 inline comments as done.
  • Use __predict_false in the MEXTADD callback
  • Read page ref count with atomic_load_int
  • A bunch of style corrections
  • Switch to MIN from sys/params.h
This revision now requires review to proceed.Jun 1 2023, 11:38 PM
sys/dev/gve/gve.h
236

Sorry, couldn't find a way to make it non ugly and also be under 80

sys/dev/gve/gve_rx.c
426

At least in the testing I did, I wasn't able to cause a situation in which the driver copies most of the time. Such conditions might very well exist but may not be too commonplace.

469

Understood, I will try to test this.

I'll commit this version (with minor changes to remove it from i386 for now, and connect the manual page to build) and follow up with additional changes.

This revision is now accepted and ready to land.Jun 2 2023, 9:35 PM