Page MenuHomeFreeBSD

nfsclient: Protest loudly when GETATTR responses are invalid
ClosedPublic

Authored by cem on Aug 5 2015, 12:23 AM.
Tags
None
Referenced Files
F103575033: D3304.id7688.diff
Tue, Nov 26, 4:37 PM
Unknown Object (File)
Thu, Nov 7, 12:32 PM
Unknown Object (File)
Oct 2 2024, 5:32 AM
Unknown Object (File)
Sep 28 2024, 9:31 AM
Unknown Object (File)
Sep 28 2024, 2:09 AM
Unknown Object (File)
Sep 25 2024, 6:30 AM
Unknown Object (File)
Sep 25 2024, 1:13 AM
Unknown Object (File)
Sep 24 2024, 12:34 PM
Subscribers

Details

Summary

Certain WAN "accelerators" attempt to cache NFS GETATTR traffic, but actually
corrupt it (e.g., responding to requests with attributes for totally different
files). As one might guess, this leads to file corruption from the client's
perspective.

Warn very verbosely when this is detected to help users track down this sort of
problem. Linux' NFS client has a similar warning.

Adds a sysctl/tunable (vfs.nfs.fileid_maxwarnings) to configure the
quantity of warnings; default to 10. (Zero disables; -1 is unlimited.)

Adds a failpoint to aid in validating the warning / behavior with a
non-broken server. Use something like:

sysctl 'debug.fail_point.nfscl_force_fileid_warning=10%return(1)'
Test Plan

The error value could potentially be ESTALE or EBADRPC instead of EIDRM. If you
have strong feelings, let me know.

Example kfail_point usage:

$ sysctl 'debug.fail_point.nfscl_force_fileid_warning=10%return(1)'
$ find ./my/nfs/mount >/dev/null 2>/dev/null &
$ dmesg | tail -n20
...
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0, got 0x783134. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0x7602cc, got 0x7602cc. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0, got 0x763320. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0, got 0x802336. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0x783134, got 0x783134. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0, got 0x1544. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0, got 0x1544. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0x783134, got 0x783134. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0, got 0x226e0. (Are you using a broken WAN accelerator?)
newnfs: server '192.168.0.187' error: fileid changed. fsid 0:0: expected fileid 0, got 0xce2545. (Are you using a broken WAN accelerator?)
newnfs: Logged 10 times about fileid corruption; going quiet to avoid spamming logs excessively. (Limit is: 10).
$ sudo sysctl 'debug.fail_point.nfscl_force_fileid_warning=off'

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

cem retitled this revision from to nfsclient: Protest loudly when GETATTR responses are invalid.
cem updated this object.
cem edited the test plan for this revision. (Show Details)
cem added reviewers: rmacklem, pfg, markj.
markj edited edge metadata.
markj added inline comments.
sys/fs/nfsclient/nfs_clport.c
398 ↗(On Diff #7659)

Should this be rate-limited? See ppsratecheck().

This revision is now accepted and ready to land.Aug 5 2015, 12:51 AM
sys/fs/nfsclient/nfs_clport.c
398 ↗(On Diff #7659)

It should probably be limited — I'm just not sure *rate*-limiting makes sense here.

We could log the first N times, and then log that we're going to stop warning about it.

I'm imagining N on the order of 10.

ngie added inline comments.
sys/fs/nfsclient/nfs_clport.c
398 ↗(On Diff #7659)

10 would be a good default, but I'd make it a RW sysctl too so the administrator can adjust the throttle count before it goes silent.

400 ↗(On Diff #7659)

I'd leave the WAN accelerator comment out here. I'd say something more like "; check your network link".

sys/fs/nfsclient/nfs_clport.c
398 ↗(On Diff #7659)

That seems like overkill. But I'm imagining maybe someone wants to disable the warning entirely, and it's one sysctl either way.

400 ↗(On Diff #7659)

No, this only happens with broken servers (old NetApp filers from ~2006 did this, apparently) or broken middleboxes. "network link" isn't really helpful.

Middlebox is slightly more generic than WAN accelerator and a (?)relatively common term, that could work. But, I'm inclined to leave it as-is. Most people will not hit this, and when they do, they probably want to know about it.

cem edited edge metadata.

Add logspam limiting. Default to 10; tunable/sysctlable.

Add kfail_point for validation.

This revision now requires review to proceed.Aug 5 2015, 6:43 PM
cem marked 2 inline comments as done.Aug 5 2015, 6:44 PM
cem edited the test plan for this revision. (Show Details)
cem edited edge metadata.
rmacklem edited edge metadata.

Basically, this looks ok to me. I'd like to see a comment explaining
when/how this will occur is as much detail as you know. I'd like to
avoid email questions along the lines of "how do I fix this on FreeBSD?".
I'd also like to see "BROKEN NFS SERVER OR MIDDLEWARE" in the
log entry (and, yes, in capitals), so that it is clear that the client
isn't the culprit. (ditto above)
However, I've accepted the revision, so do whatever you feel is
appropriate.
I'm guessing you declared "ncl_fileid_nwarnings" volatile, but
didn't use an atomic to increment it, since you didn't care if it
ended up "out by one", but didn't want it to get messed up a lot
by caching in various cpus? (If this is correct, it seems fine to me.)

I also have to mention that I've never heard of this before and
I found it hard to believe that any shipped products would actually
be this broken, but...I have to believe you.;-)

This revision is now accepted and ready to land.Aug 5 2015, 9:53 PM

Basically, this looks ok to me. I'd like to see a comment explaining
when/how this will occur is as much detail as you know.

Ok, will add that (in the code, right?).

I'd like to
avoid email questions along the lines of "how do I fix this on FreeBSD?".
I'd also like to see "BROKEN NFS SERVER OR MIDDLEWARE" in the
log entry (and, yes, in capitals), so that it is clear that the client
isn't the culprit. (ditto above)

Ok.

However, I've accepted the revision, so do whatever you feel is
appropriate.
I'm guessing you declared "ncl_fileid_nwarnings" volatile, but
didn't use an atomic to increment it, since you didn't care if it
ended up "out by one", but didn't want it to get messed up a lot
by caching in various cpus? (If this is correct, it seems fine to me.)

Yeah. It doesn't matter if we log an extra 100% of nwarnings times. Order of magnitude best-effort is my only concern with it.

I also have to mention that I've never heard of this before and
I found it hard to believe that any shipped products would actually
be this broken, but...I have to believe you.;-)

Yeah. It's kind of astoundingly bad. It's a WAN appliance by one of the top networking vendors... you've heard of them.

cem edited edge metadata.
cem updated this object.

Add the comment and all-caps warnings

This revision now requires review to proceed.Aug 5 2015, 10:16 PM
This revision was automatically updated to reflect the committed changes.
head/sys/fs/nfsclient/nfs_clport.c
362

Should this be static/set to 0 by default?

Also, should this be boolean instead of int?

372

Uh... you're not actually returning if off is != 0...

440–441

I usually do #ifdef INVARIANTS with something like this to not affect non-debug kernels, but there might be an option for compiling the nfs client with debug support that this should be keyed off of.

455–456

Who's "We"?

It might be better to remove the actor in this comment -- otherwise it's going to cause confusion about who the affected parties are, i.e.

"This was observed reliably when doing a parallelized buildworld with clang"

head/sys/fs/nfsclient/nfs_clport.c
362

No, it is initialized below.

In the words of bde,

Initializations in declarations are specifically forbidden in style(9)

372

This is wholly intentional.