The kernel RPC code, which we inherited in part from Sun and NetBSD, contains a hard-coded limit that prevents more than 45 MB of RPC requests from being outstanding at any one time. Modern machines can handle far more than this. Because WRITE RPCs are large and all other RPCs are small, this limit particularly impacts write performance, but once the throttle is engaged -- a single fast client can cause this -- all RPCs from all clients are left to rot in the incoming socket buffers, causing intolerable performance even for small requests like LOOKUP or GETATTR. (This means that when a server is throttled, even mounting the filesystem often fails.) The original code had several integer overflow bugs which were particularly apparent once the old 45 MB limit was removed.
Unit Tests Skipped
This simple change actually failed even worse than the original code. I suspect signedness problems surfaced by the switch from gcc 4.2 to clang (we just upgraded from 9.3 to 10.1 on our NFS srevers), and I'm preparing a new diff that tries to deal with that.
In the previous diff, arithmetic overflow resulted in incorrect values being computed. This version changes all of the data types to be longs rather than ints, as is more appropriate for 64-bit machines. (I haven't compile-tested on an ILP32 machine to see if it even still builds -- I suspect not, and I'll need to figure out a way to work around the lack of atomic 64-bit ops on 32-bit architectures.)
One important change, which I think is actually the important part, makes the type of "sz" in svc_run_internal() be signed.
clang gives a tautological-compare warning here on LP64 platforms (nmbclusters is an int). What's the test I really should be doing?
Previously, sz was a size_t, which is unsigned long, and -sz (still unsigned long) was passed to a signed int formal parameter, resulting in overflow and undefined behavior. I suspect gcc 4.2 did the "obvious" thing and clang generates different code here that doesn't.
Here's how GCC implemented it in my 9.3 kernel:
0xffffffff807bd50e <+1454>: mov %r13d,%esi 0xffffffff807bd511 <+1457>: mov %r14,%rdi 0xffffffff807bd514 <+1460>: neg %esi 0xffffffff807bd516 <+1462>: callq 0xffffffff807bc820 <svc_change_space_used>
It actually performs the negation in the parameter type (int) rather than the value type (size_t). Of course any situation in which the result would be different involves undefined behavior so GCC was not in error here (the code was still wrong).
Thinking that from the behavior I'm seeing, this should perhaps be hoisted up into the loop, so that the space-in-use is immediately credited when the request competes. We don't batch this adjustment on insert, so I don't think it should be batched on dequeue either. Thoughts?
It looks ok to me. I might have used int64_t instead of long, so that they're
64bits on all arches, but I don't think it will matter, since nmbclusters won't
be that large for 32bit arches, I think?
I don't think we can expect to have a 64-bit atomic_fetchadd on ILP32 arches, whereas long (by definition) will be available there, and (perhaps with the exception of i386+PAE) I don't think nmbclusters can ever be large enough for it to matter on those arches and still have a functioning system. Next revision is going to remove the belt-and-suspenders overflow check.
Remove the redundant overflow check and expand the comment to make it clear what's going on when we compute the request buffer limits. This should be the last iteration of this patch. This version isn't running on a production server yet; hopefully that will happen on Monday.
This actually hurts on a busy server, because the request-processing loop can livelock until the throttling limit has been reached. On one server with heavy write activity, I saw the request_space_used increasing steadily until it hit the limit; then the throttle would engage, the request-processing loops would all run to completion, and request_space_used would go back to zero. Lather, rinse, repeat, ad infinitum. This doesn't happen when traffic is light because the request-processing loop hits the end of the queue before more work can be appended to it.