Page MenuHomeFreeBSD

make max buffer cache block size and max NFS I/O size tunable
ClosedPublic

Authored by rmacklem on May 30 2017, 12:21 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 12, 9:17 PM
Unknown Object (File)
Thu, Jan 9, 1:23 PM
Unknown Object (File)
Sat, Jan 4, 2:36 AM
Unknown Object (File)
Dec 17 2024, 3:57 AM
Unknown Object (File)
Dec 15 2024, 2:14 AM
Unknown Object (File)
Nov 8 2024, 4:49 PM
Unknown Object (File)
Nov 8 2024, 1:07 AM
Unknown Object (File)
Nov 6 2024, 7:52 PM
Subscribers

Details

Summary

The maximum size of an I/O RPC for the NFS client is defined by the maximum size for a buffer
cache block. Without this patch, this is a compile time option called MAXBCACHEBUF.

This patch changes it to be a tunable.

It also sets BKVASIZE to 1/2 the tuned value. (Is this the appropriate setting?)

Test Plan

I have only tested an NFS mount for 128K at this point, since that is the largest size
supported by the FreeBSD NFS server. cperciva@ may soon test 1M against the AmazonEFS service.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

allanjude added inline comments.
kern/vfs_bio.c
251

should this be a SYSCTL_PROC instead? It could call maxbcachebuf_adjust or similar and enforce the power of 2 requirement etc, and then it could possibly even be runtime tunable?

Well, I have no idea what it would take to make this tunable while the system is running,
and I'm not even willing to try to do so;-)
Several things get sized based on maxbcachebuf and bkvasize during booting and why
they are configured the way they are is a mystery to me.

I probably didn't use SYSCTL_PROC() because I didn't realize that could be used for
the RDTUN case. If SYSCTL_PROC() can be safely used for RDTUN, then let me know
and I can redo the patch.

Thanks for the comments, rick

I don't really grok this part of the code well enough to review whether this is a good idea or not, just some small comments.

kern/vfs_bio.c
251

For RDTUN there's no need for SYSCTL_PROC, that's only if it is to be RWTUN.

1006

I would keep this as a KASSERT(). It should be true due to the logic in the adjust function, but doesn't hurt to assert it as well.

BKVASIZE is used to tune the of clean and buffer maps. I believe it is required to adjust vm_init.c to use tunable instead of the constant, otherwise, things become quite inconsistent. You probably did not noticed this because UFS in default mode have ~90% of its buffers unmapped.

fs/nfs/nfs_commonkrpc.c
99

This is inappropriate way to get access to the variable. Put the declaration into sys/buf.h or sys/param.h.

kern/vfs_bio.c
2924

This line effectively uses BKVASIZE as well, I think it must be converted. And the other use in geteblk() should be converted as well.

Thanks for the comments Kostik. (I had actually caught the cases that I had missed w.r.t. BKVASIZE/BKVAMASK.)
I am testing an updated patch.

However, I am not sure what should be done with BKVASIZE?
(I read the comments in param.h, but they don't seem to give me much insight into what it
should be set to.)
I have changed the patch to set it to maxbcachebuf /4 instead of maxbachebuf /2, but I'll
admit that is only because the default is 1/4th of MAXBCACHEBUF.

Somehow, it seems that it should be changed, but I'm not even sure that it shouldn't just
be left at 16K. (The code in vm_init() seems to set the kva used to nbufs * BKVASIZE, so
it does seem that should be increased when maxbcachebuf is increased?)

Anyone have more insight into what bkvasize should be set to?

Thanks for the comments Kostik. (I had actually caught the cases that I had missed w.r.t. BKVASIZE/BKVAMASK.)
I am testing an updated patch.

However, I am not sure what should be done with BKVASIZE?
(I read the comments in param.h, but they don't seem to give me much insight into what it
should be set to.)
I have changed the patch to set it to maxbcachebuf /4 instead of maxbachebuf /2, but I'll
admit that is only because the default is 1/4th of MAXBCACHEBUF.

Somehow, it seems that it should be changed, but I'm not even sure that it shouldn't just
be left at 16K. (The code in vm_init() seems to set the kva used to nbufs * BKVASIZE, so
it does seem that should be increased when maxbcachebuf is increased?)

Anyone have more insight into what bkvasize should be set to?

I think that with the current UFS default block size of 32K, BKVASIZE should be doubled and set to 32K as well. It is unmapped buffers that mostly make the buffer submap fragmentation non-observable and thus the old value is kept around without much consequences. IMHO you should bump it to 32K and probably keep as is.

Also I suggest you to write to bde, he usually does not look into phabricator.

This version of the patch applies fixes as suggested by jhb@ and kib@.
It sets bkvasize = maxbcachebuf / 4.

However, since most NFS client buffers are maximum size by default,
I think that bufspace and kvabufspace should be about the same.
If this is correct, then bkvasize should be set to maxbcachebuf.
Maybe something like:
#ifdef amd64

bkvasize = maxbcachebuf;

#else

bkvasize = maxbcachebuf / 4;

#endif

Thanks Kostik.

I didn't read your comment before typing the last one.

I will email bde@. If it is like my previous discussion with him on this,
he'll suggest that NFS should only do 16K I/Os, but I'll see what he says.
(AmazonEFS recommends 1M.)

Since both kib@ and bde@ suggested that BKVASIZE did not need to be
changed for larger values of maxbcachecbuf, the tuning of BKVASIZE
has been removed from this patch.

I observed a small (1%) reduction in performance for the test case I
ran when BKVASIZE was left at 16K, but I don't think this indicates a
significant performance issue.
If future testing indicates that BKVASIZE needs to be tuned, an additional
patch can be generated at that time.

Both kib@ and bde@ suggested that BKVASIZE could be increased to
32K, since UFS now uses larger block sizes, but that can be a separate commit,
probably best left to those that know UFS and the buffer cache.

My comments are cosmetic.

kern/vfs_bio.c
871

I suggest to put this printf under bootverbose. Might be, it does not need to be printed at all. If somebody fiddles with the tunable, it must be capable of checking the sysctl value.

1033

Print the value which triggered the assert. perhaps.

This revision is now accepted and ready to land.Jun 3 2017, 3:04 PM

I applied this patch, set

vfs.maxbcachebuf=1048576
kern.ipc.maxsockbuf=4194304

(the former because that's the recommended I/O size for Amazon EFS; the latter because it was necessary in order to make the NFS mount not fail with "No buffer space available"). After a few minutes of buildworld:

Fatal trap 12: page fault while in kernel mode
cpuid = 18; apic id = 20
fault virtual address	= 0xfffffe0c58de0738
fault code		= supervisor write data, page not present
instruction pointer	= 0x20:0xffffffff80afff68
stack pointer	        = 0x28:0xfffffe0f56b031a0
frame pointer	        = 0x28:0xfffffe0f56b03200
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 1234 (ld)
trap number		= 12
panic: page fault
cpuid = 18
time = 1496482447
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0f56b02d80
vpanic() at vpanic+0x19c/frame 0xfffffe0f56b02e00
panic() at panic+0x43/frame 0xfffffe0f56b02e60
trap_fatal() at trap_fatal+0x322/frame 0xfffffe0f56b02eb0
trap_pfault() at trap_pfault+0x62/frame 0xfffffe0f56b02f10
trap() at trap+0x2ad/frame 0xfffffe0f56b030d0
calltrap() at calltrap+0x8/frame 0xfffffe0f56b030d0
--- trap 0xc, rip = 0xffffffff80afff68, rsp = 0xfffffe0f56b031a0, rbp = 0xfffffe0f56b03200 ---
allocbuf() at allocbuf+0x378/frame 0xfffffe0f56b03200
getblk() at getblk+0x48f/frame 0xfffffe0f56b03290
nfs_getcacheblk() at nfs_getcacheblk+0x49/frame 0xfffffe0f56b032e0
ncl_write() at ncl_write+0xd63/frame 0xfffffe0f56b034d0
VOP_WRITE_APV() at VOP_WRITE_APV+0x168/frame 0xfffffe0f56b035f0
vn_write() at vn_write+0x257/frame 0xfffffe0f56b03670
vn_io_fault1() at vn_io_fault1+0x1b0/frame 0xfffffe0f56b037c0
vn_io_fault() at vn_io_fault+0x197/frame 0xfffffe0f56b03840
dofilewrite() at dofilewrite+0xa7/frame 0xfffffe0f56b03890
kern_writev() at kern_writev+0x68/frame 0xfffffe0f56b038e0
sys_write() at sys_write+0x86/frame 0xfffffe0f56b03930
amd64_syscall() at amd64_syscall+0x589/frame 0xfffffe0f56b03ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0f56b03ab0
--- syscall (4, FreeBSD ELF64, sys_write), rip = 0x4cb28a, rsp = 0x7fffffffdac8, rbp = 0x7fffffffdaf0 ---

Should I be adjusting other sysctls as well, or is this a bug?

I applied this patch, set

vfs.maxbcachebuf=1048576
kern.ipc.maxsockbuf=4194304
Should I be adjusting other sysctls as well, or is this a bug?

MAXBCACHEBUF cannot be greater than MAXPHYS.

OK, so with a GENERIC kernel this patch only takes us from a maximum 64kB request size to a maximum 128kB request size?

Is there a reason why MAXPHYS hasn't been increased since... err... January 1998?

I'll make the changes kib@ suggested and add a sanity check for maxbcachebuf to make
sure it is <= MAXPHYS. (I didn't know about this restriction. Since the FreeBSD server only
does 128K, I hadn't tested 1M.)

I'll let others debate if/when MAXPHYS can be increased, since I don't know anything about it.

Thanks for the comments/testing, rick

rmacklem edited edge metadata.

This patch includes the two changes suggested by kib@ and it also
limits maxbcachebuf to MAXPHYS, since that is necessary to avoid crashes.

This revision now requires review to proceed.Jun 4 2017, 1:31 PM
kib added inline comments.
kern/vfs_bio.c
3517

Is there a reason not to use KASSERT() for this check as well ?

This revision is now accepted and ready to land.Jun 5 2017, 10:24 AM

FWIW, testing with Amazon EFS: A straightforward dd throughput test shows a doubling in performance when moving from 64 kB I/Os to 128 kB I/Os -- from 15 MB/s to 30 MB/s. But 'make buildworld' benchmarking doesn't show any significant difference in performance; it looks like while larger I/Os are happening, they're not common enough in buildworld to matter. (In fact, EFS shows ~5M "metadata" operations per hour, but only ~500k writes/hour and ~50k reads/hour; is this an expected distribution?)

In fact, EFS shows ~5M "metadata" operations per hour, but only ~500k writes/hour and ~50k reads/hour; is this an expected distribution?

Match this with the rpc counts from client nfsstat(8), I think it would be clearer. I am completely unsurprised by the writes vs reads count for the builworld, I think bde demostrated similar numbers. this is just a consequence of working buffer cache and the fact that we commit on close.

5M metadata, if my guess is correct, should be mostly component lookups, but as I asked above, confirm this by the client statistics.

Yep, and lots of Getattr's would be my guess, but "nfsstat -E -c" should tell you.
When I've looked (quite a while ago), I've seen lots of small writes generated
by the loader. (There is the "nocto" option, that sometimes improves software
build times.)

Yes, that's it -- lots of getattrs. I wonder if EFS was optimized for "open a file and use it over an extended duration" usage patterns rather than buildworld's reading and writing of lots of separate files.

Yes, that's it -- lots of getattrs. I wonder if EFS was optimized for "open a file and use it over an extended duration" usage patterns rather than buildworld's reading and writing of lots of separate files.

It is not reads and writes which hit the metadata RPCs but open/close calls. As Rick suggested, try nocto mount option.

Since you mentioned Opens (maybe not intending to refer to an NFSv4 Open operation), you'll probably see
lots of these too.
For a kernel build, I see 500K of them.
I tried a patch that did an Open as the last component for a VOP_LOOKUP(). It worked, but only reduced the
Open RPC count by about 20K. The rest appear to be re-Opens where the name cache avoids the VOP_LOOKUP().

  • Delegations will reduce the # of Opens, but I recall AmazonEFS doesn't support them.
  • I have thought about delaying Close operations until the Open appears to be "inactive" for a few seconds, but haven't coded this yet. (Currently, the Open is Closed in VOP_INACTIVE().)
    • The coding of this is a little more interesting than it sounds, but I think it can be done fairly easily by having the client hold a ref cnt. on the vnode for an Open and then vrele() it when the Open hasn't been used for a few seconds. Coding this to try it is on my "to do" list.
      • It will result in the client having more Open structures...

Maybe this should be a thread on an email list? (It seems a little off topic here?)

This revision was automatically updated to reflect the committed changes.