Page MenuHomeFreeBSD

Fix the NFSv4.1 client for recovery from NFS4ERR_BAD_SESSION server failures
ClosedPublic

Authored by rmacklem on Dec 10 2016, 2:37 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 22, 5:26 PM
Unknown Object (File)
Thu, Jan 9, 10:54 AM
Unknown Object (File)
Dec 2 2024, 9:20 PM
Unknown Object (File)
Dec 1 2024, 11:21 AM
Unknown Object (File)
Nov 22 2024, 8:52 AM
Unknown Object (File)
Nov 22 2024, 7:17 AM
Unknown Object (File)
Nov 7 2024, 11:11 PM
Unknown Object (File)
Oct 3 2024, 12:24 PM
Subscribers
None

Details

Summary

Testing done by cperciva@ of the NFSv4.1 client against an AmazonEFS server found
several problems during recovery from NFS4ERR_BAD_SESSION failures. Normally
NFS4ERR_BAD_SESSION failures are a rare occurrence for an NFSv4.1 server, but
this service fails frequently in this way.
Briefly, the problems fixed are:

  • If more than 32 processes were attempting to do RPCs at the time of failure, some could be stuck forever waiting for a session slot on the failed session.
  • If the reply to an RPC that was successful on the old session just before it failed was processed after the new session was created, it bogusly updated the new session with the slot used by the old session, corrupting it.
  • Non-state handling RPCs (ones not using ClientIDs or StateIDs) would fail when they got NFS4ERR_BAD_SESSION instead of retrying the RPC with a new session.
  • Handling of the session list was "racey" and could have failed if the pointer was used just when a new session was being added to the list. This patch protects all use of this TAILQ_LIST by the NFSLOCKMNT() mutex.
  • RPCs that use ClientIDs/StateIDs no longer initiate recovery, since the code in the RPC handler (newnfs_request()) initiates recovery whenever a NFS4ERR_BAD_SESSION is received.
Test Plan

cperciva@ has been doing extensive testing on several patches leading up to this one.
I am doing tests via simulated failures (manual reboots of the NFSv4.1 server) with the
FreeBSD and Linux servers for both NFSv4.0 and NFSv4.1.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

rmacklem retitled this revision from to Fix the NFSv4.1 client for recovery from NFS4ERR_BAD_SESSION server failures.
rmacklem updated this object.
rmacklem edited the test plan for this revision. (Show Details)
rmacklem added a reviewer: cperciva.

Added a small fix so that mkdir won't fail when it gets a NFS4ERR_BAD_SESSION and will loop
to get a new session.

The patch I was testing eliminated a variety of hangs, panics, I/O errors, and suspicious error messages, without introducing any new problems; but it wasn't exactly the same as this patch here. Not sure if the differences are significant...

fs/nfs/nfs_commonsubs.c
830

I committed the 'fileid > 32bits' printf changes in r308708, so I'm not quite sure what they're doing here.

fs/nfsclient/nfs_clstate.c
2500–2516

This bit is also different from the patch I tested; again, I'm not sure about the significance.

fs/nfsclient/nfs_clvfsops.c
1323

This bit wasn't in the patch I tested. Did it slip in by accident? I have no idea what it does.

I have no idea how to do inline comments, but responding to cperciva@'s three comments:
#1 and #3 are code already in head. (I mistakenly did the diff against old code without it.)
#2 is for Data Servers (only pNFS) which cperciva@ isn;t using.
--> So, for the purposes of cperciva@'s testing, it doesn't matter.

I'll try and update the patch so that #1 and #3 aren't there.

rmacklem edited edge metadata.

Redid the diff so that no code already in head is in it. (Basically cpreciva@'s 1st and 3rd items.)

This revision was automatically updated to reflect the committed changes.