Page MenuHomeFreeBSD

Remove antique hard-coded limit and integer overflow bugs in kernel RPC request throttling code
ClosedPublic

Authored by wollman on Mar 28 2015, 8:02 PM.

Details

Summary

The kernel RPC code, which we inherited in part from Sun and NetBSD, contains a hard-coded limit that prevents more than 45 MB of RPC requests from being outstanding at any one time. Modern machines can handle far more than this. Because WRITE RPCs are large and all other RPCs are small, this limit particularly impacts write performance, but once the throttle is engaged -- a single fast client can cause this -- all RPCs from all clients are left to rot in the incoming socket buffers, causing intolerable performance even for small requests like LOOKUP or GETATTR. (This means that when a server is throttled, even mounting the filesystem often fails.) The original code had several integer overflow bugs which were particularly apparent once the old 45 MB limit was removed.

Test Plan

Will shortly be running this code on a production fileserver.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

wollman updated this revision to Diff 4481.Mar 28 2015, 8:02 PM
wollman retitled this revision from to Remove antique hard-coded limit in kernel RPC request throttling code.
wollman updated this object.
wollman edited the test plan for this revision. (Show Details)
wollman added reviewers: network, rmacklem, dfr, mav.
wollman set the repository for this revision to rS FreeBSD src repository.
wollman updated this object.
dfr accepted this revision.Mar 28 2015, 8:29 PM
dfr edited edge metadata.

Sounds good to me.

This revision is now accepted and ready to land.Mar 28 2015, 8:29 PM

This simple change actually failed even worse than the original code. I suspect signedness problems surfaced by the switch from gcc 4.2 to clang (we just upgraded from 9.3 to 10.1 on our NFS srevers), and I'm preparing a new diff that tries to deal with that.

wollman updated this revision to Diff 4482.Mar 28 2015, 8:44 PM
wollman edited edge metadata.

In the previous diff, arithmetic overflow resulted in incorrect values being computed. This version changes all of the data types to be longs rather than ints, as is more appropriate for 64-bit machines. (I haven't compile-tested on an ILP32 machine to see if it even still builds -- I suspect not, and I'll need to figure out a way to work around the lack of atomic 64-bit ops on 32-bit architectures.)

One important change, which I think is actually the important part, makes the type of "sz" in svc_run_internal() be signed.

This revision now requires review to proceed.Mar 28 2015, 8:44 PM
wollman added inline comments.Mar 28 2015, 8:49 PM
sys/rpc/svc.c
120 ↗(On Diff #4482)

clang gives a tautological-compare warning here on LP64 platforms (nmbclusters is an int). What's the test I really should be doing?

1261 ↗(On Diff #4482)

Previously, sz was a size_t, which is unsigned long, and -sz (still unsigned long) was passed to a signed int formal parameter, resulting in overflow and undefined behavior. I suspect gcc 4.2 did the "obvious" thing and clang generates different code here that doesn't.

wollman added inline comments.Mar 28 2015, 9:11 PM
sys/rpc/svc.c
1261 ↗(On Diff #4482)

Here's how GCC implemented it in my 9.3 kernel:

0xffffffff807bd50e <+1454>:  mov    %r13d,%esi
0xffffffff807bd511 <+1457>:  mov    %r14,%rdi
0xffffffff807bd514 <+1460>:  neg    %esi
0xffffffff807bd516 <+1462>:  callq  0xffffffff807bc820 <svc_change_space_used>

It actually performs the negation in the parameter type (int) rather than the value type (size_t). Of course any situation in which the result would be different involves undefined behavior so GCC was not in error here (the code was still wrong).

wollman edited the test plan for this revision. (Show Details)Mar 28 2015, 9:13 PM
wollman edited edge metadata.
wollman added inline comments.Mar 28 2015, 9:23 PM
sys/rpc/svc.c
1261 ↗(On Diff #4482)

Thinking that from the behavior I'm seeing, this should perhaps be hoisted up into the loop, so that the space-in-use is immediately credited when the request competes. We don't batch this adjustment on insert, so I don't think it should be batched on dequeue either. Thoughts?

wollman updated this revision to Diff 4489.Mar 28 2015, 9:50 PM

This version eliminates the variable "sz" in svc_run_internal() entirely by not batching updates to the available request space.

wollman retitled this revision from Remove antique hard-coded limit in kernel RPC request throttling code to Remove antique hard-coded limit and integer overflow bugs in kernel RPC request throttling code.Mar 28 2015, 9:51 PM
wollman updated this object.
rmacklem accepted this revision.Mar 29 2015, 3:12 AM
rmacklem edited edge metadata.

It looks ok to me. I might have used int64_t instead of long, so that they're
64bits on all arches, but I don't think it will matter, since nmbclusters won't
be that large for 32bit arches, I think?

This revision is now accepted and ready to land.Mar 29 2015, 3:12 AM
In D2165#20, @rmacklem wrote:

It looks ok to me. I might have used int64_t instead of long, so that they're
64bits on all arches, but I don't think it will matter, since nmbclusters won't
be that large for 32bit arches, I think?

I don't think we can expect to have a 64-bit atomic_fetchadd on ILP32 arches, whereas long (by definition) will be available there, and (perhaps with the exception of i386+PAE) I don't think nmbclusters can ever be large enough for it to matter on those arches and still have a functioning system. Next revision is going to remove the belt-and-suspenders overflow check.

wollman updated this revision to Diff 4492.Mar 29 2015, 6:00 AM
wollman edited the test plan for this revision. (Show Details)
wollman edited edge metadata.
wollman set the repository for this revision to rS FreeBSD src repository.

Remove the redundant overflow check and expand the comment to make it clear what's going on when we compute the request buffer limits. This should be the last iteration of this patch. This version isn't running on a production server yet; hopefully that will happen on Monday.

This revision now requires review to proceed.Mar 29 2015, 6:00 AM
rmacklem accepted this revision.Mar 30 2015, 12:12 AM
rmacklem edited edge metadata.

Yep, this looks ok to me.

This revision is now accepted and ready to land.Mar 30 2015, 12:12 AM

NB: This diff is relative to 10.1. Context lines have changed in 11-current so the diff won't apply directly.

wollman updated this revision to Diff 4505.Mar 30 2015, 2:38 AM
wollman edited edge metadata.

Previous patch had a use-after-free bug (that should have been obvious). This version reintroduces the "sz" variable to prevent this.

This revision now requires review to proceed.Mar 30 2015, 2:38 AM
rmacklem accepted this revision.Mar 30 2015, 3:02 AM
rmacklem edited edge metadata.
This revision is now accepted and ready to land.Mar 30 2015, 3:02 AM

This code is running on a production fileserver now. Waiting for the all-arches compile test before I'm ready to commit.

mav accepted this revision.Mar 31 2015, 9:22 AM
mav edited edge metadata.

It looks good to me.

svc_change_space_used() call out of the loop was kind of optimization when modifying variable shared by many threads, but that may be not so important.

In D2165#39, @mav wrote:

svc_change_space_used() call out of the loop was kind of optimization when modifying variable shared by many threads, but that may be not so important.

This actually hurts on a busy server, because the request-processing loop can livelock until the throttling limit has been reached. On one server with heavy write activity, I saw the request_space_used increasing steadily until it hit the limit; then the throttle would engage, the request-processing loops would all run to completion, and request_space_used would go back to zero. Lather, rinse, repeat, ad infinitum. This doesn't happen when traffic is light because the request-processing loop hits the end of the queue before more work can be appended to it.

wollman closed this revision.Apr 1 2015, 12:46 AM
wollman updated this revision to Diff 4559.

Closed by commit rS280930 (authored by @wollman).