Testing done by cperciva@ of the NFSv4.1 client against an AmazonEFS server found
several problems during recovery from NFS4ERR_BAD_SESSION failures. Normally
NFS4ERR_BAD_SESSION failures are a rare occurrence for an NFSv4.1 server, but
this service fails frequently in this way.
Briefly, the problems fixed are:
- If more than 32 processes were attempting to do RPCs at the time of failure, some could be stuck forever waiting for a session slot on the failed session.
- If the reply to an RPC that was successful on the old session just before it failed was processed after the new session was created, it bogusly updated the new session with the slot used by the old session, corrupting it.
- Non-state handling RPCs (ones not using ClientIDs or StateIDs) would fail when they got NFS4ERR_BAD_SESSION instead of retrying the RPC with a new session.
- Handling of the session list was "racey" and could have failed if the pointer was used just when a new session was being added to the list. This patch protects all use of this TAILQ_LIST by the NFSLOCKMNT() mutex.
- RPCs that use ClientIDs/StateIDs no longer initiate recovery, since the code in the RPC handler (newnfs_request()) initiates recovery whenever a NFS4ERR_BAD_SESSION is received.