Page MenuHomeFreeBSD

create a sysctl so that the maximum I/O size on the NFS server can be increased to 1Mbyte
ClosedPublic

Authored by rmacklem on Jun 20 2021, 4:47 AM.
Tags
None
Referenced Files
F107370344: D30826.id91127.diff
Mon, Jan 13, 4:30 AM
Unknown Object (File)
Wed, Dec 25, 10:07 PM
Unknown Object (File)
Wed, Dec 25, 9:22 PM
Unknown Object (File)
Wed, Dec 25, 9:18 PM
Unknown Object (File)
Wed, Dec 25, 9:05 PM
Unknown Object (File)
Wed, Dec 25, 9:51 AM
Unknown Object (File)
Dec 10 2024, 8:47 PM
Unknown Object (File)
Dec 9 2024, 12:32 AM

Details

Summary

Since MAXPHYS now allows the FreeBSD NFS client
to do 1Mbyte I/O operations, add a sysctl so that the
maximum NFS server I/O size can be set up to 1Mbyte.
The Linux NFS client can also do 1Mbyte I/O operations.

The default of 128Mbytes for the maximum I/O size has
not been changed for two reasons:

  • kern.ipc.maxsockbuf must be increased to support 1Mbyte I/O
  • The limited benchmarking I can do actually shows a drop in I/O rate when the I/O size is above 256Kbytes. (See Test Plan).

I do believe that using a 1Mbyte I/O size can improve performance
but I do not have the hardware needed to benchmark this.
One example case might be a WAN with a large bandwidth X delay,
which needs several Mbytes of data to be in transit to fill the network
pipe. (1Gbps X 40msec --> 4Mbytes --> 1Mbyte * 4 readaheads)

By adding this sysctl, hopefully others can test/benchmark 1Mbyte I/O.

Test Plan

Works at wire speed for 1Gbps networking, if the server's
cache is primed (no need for server disk I/O) for any I/O
size from 128Kbytes->1Mbyte.

Using a local mount via lo0:
I/O size (Kbytes) Read rate (Mbytes/sec)
128 620
256 670
1024 520
I do not know why performance degrades above 256Kbytes,
but it is not a very realistic test (client/server on same machine).

A variety of sizes were tested and the sysctl fails with
a message indicating how large kern.ipc.masockbuf needs
to be to support that I/O size.
(The client had kern.ipc.maxsockbuf increased to several
Mbytes and the vfs.maxbcachebuf tunable set to 1Mbyte
for testing.)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

rmacklem created this revision.

Add printf()s indicating why setting vfs.nfsd.srvmaxio failed.

Add locking using NFSD_LOCK()/NFSD_UNLOCK() to
the sysctl function to ensure that the nfsd threads are
not running when vfs.nfsd.srvmaxio is updated.

I've pulled this into my FreeBSD Stable13; I've verified that a client (ubuntu 18x) negotiates wsize/rsize = 131072 by default and that by adding vfs.nfsd.srvmaxio=1048576 to /etc/sysctl.conf, the same client negotiates wsize/rsize = 1048576. I have also done some cursory IO testing.

I had to make this change as it wasn't in the above patch:
\--- fs/nfs/nfs_commonkrpc.c.orig
+++ fs/nfs/nfs_commonkrpc.c
@@ -101,8 +101,8 @@
extern int nfsrv_lease;

SVCPOOL *nfscbd_pool;
+int nfs_bufpackets = 4;
static int nfsrv_gsscallbackson = 0;
-static int nfs_bufpackets = 4;
static int nfs_reconnects;
static int nfs3_jukebox_delay = 10;
static int nfs_skip_wcc_data_onerr = 1;

It sure would be nice to seem this pulled into current and merged down to Stable13, but I don't have a vote. How about it @asomers or @markj ?

In the commit message, should "The default of 128Mbytes for the maximum I/O size" be "The default of 128Kbytes for the maximum I/O size"?

sys/fs/nfsserver/nfs_nfsdport.c
241

s/mnimal/minimal/

245

Why do you multiply by MSIZE? Both newsrvmaxio and maxsockbuf are in units of bytes, right? And why do you add MCLBYTES twice?

Made inline comment on calculation.

And, yes, it should be 128Kbytes, not 128Mbytes in the
Summary.

I will fix the typo in the comment.

sys/fs/nfsserver/nfs_nfsdport.c
245

This calculation is the inverse of:

sb_max_adj = (u_quad_t)sb_max * MCLBYTES / (MSIZE + MCLBYTES);

(line#599 of sys/kern/uipc_sockbuf.)

My understanding of C (and what happens for a trivial test program) is the
addition "MCLBYTES + MSIZE" happens before the "*=".
I can change this to "(MCLBYTES + MSIZE)" to make it clearer, but
the result doesn't change (at least for FreeBSD's clang).

I add "MCLBYTES - 1" so that the divide by MCLBYTES is rounded up. (Normally a no-op,
since MSIZE is normally a power of 2, so the divide by MCLBYTES is exact,
but I threw it in just in case MSIZE changes to some odd value someday.)

Just fyi Dave, the change that makes nfs_bufpackets global was
just MFC'd to stable/13. It was in a commit already applied
to "main".

Btw Dave, did you happen to measure performance for 128Kbytes vs 1Mbyte
for your Linux client. (My Linux client is on i386 hardware with 100Mbps networking,
so I obviously can't see any performance difference, just wire speed.)

Ok, I get it. Your calculation is correct. One other thing: when I first read this review, I thought that the restrictions on adjusting newsrvmaxio were so strict that you may as well just make it be a tunable instead of a sysctl. But then I realized that kern.ipc.maxsockbuf can be changed at runtime, and newsrvmaxio depends on that. Is that why you went to the trouble of supporting runtime changes for newsrvmaxio?

This revision is now accepted and ready to land.Jul 15 2021, 3:07 AM

Yes, since kern.maxsockbuf can be set in sysctl.conf, I wanted
vfs.nfsd.srvmaxio to be set that way as well.
It can also be changed without rebooting the server, by stopping
and restarting the nfsd.

I restricted it to only be setable when the nfsd are not running
and only increased, so that extant mounts wouldn't be broken.
(Unfortunately NFSv3 clients don't know when a server reboots
and extant NFSv4 clients don't re-negotiate maximum I/O size
after a server reboot.)

Btw, if you think the calculation would be more readable, I
could write it as one statement. I did the steps because I
thought it would be more readable.
--> At the least, I'll try and improve the comment to show

that it is the inverse of the sm_max_adj calculation.

@rmacklem: yes my FreeBSD13 is a little stale. I did some crude IO bench marking; the graphs are very spiky, but it appears that I got around 200-250 MB/sec with the 128k and 250-300++ MB/sec with 1m. Multiple writers all from the same client over a 1G link.

Change the comment that describes the kern.ipc.maxsockbuf
suggested value, in order to clarify what the calculation is.

This revision now requires review to proceed.Jul 16 2021, 2:01 AM
sys/fs/nfsserver/nfs_nfsdport.c
241

minimal is no longer in the comment.

This revision is now accepted and ready to land.Jul 16 2021, 2:07 AM