Details

Reviewers

asomers
freqlabs
markj

Commits

rG1e0a518d6548: nfscl: Add a Linux compatible "nconnect" mount option

Summary

Linux has had an "nconnect" NFS mount option for some time.
It specifies that N (up to 16) TCP connections are to created for a mount.

A discussion on freebsd-net@ indicated that this could improve client<-->server
network bandwidth, if either the client or server have one of the following:

multiple network ports aggregated to-gether with lagg/lacp.
a fast NIC that is using multiple queues

It does result in using more IP port#s and might increase server peek load for a client.

Test Plan

Testing of mount attempts with various valid and invalid values
for nconnect. For valid mounts, look at a packet capture in
wireshark to see that the N TCP connections are being used,
round robin.

Also test network partitioning, to see that the N TCP connections
are re-established after the network partition fails.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rmacklem requested review of this revision.Jul 1 2021, 1:34 AM

rmacklem created this revision.

This comment was imbedded in a recent email message on linu-nfs@vger.kernel.org.
Although other posters felt more evidence was needed to determine this,
it at least suggests that separating the large RPC messages from the small ones may
improve performance under certain circumstances.

The original issue described was how a high read/write process on the
client could slow another process trying to do heavy metadata
operations (like walking the filesystem). Using a different mount to
the same multi-homed server seems to help a lot (probably because of
the independent slot table).

This version of the patch sends all RPCs that use small RPC messages
on the original TCP connection and sends the RPCs (Read/Readdir/Write)
with large messages on the additional TCP connections in a round robin
fashion.
nconnect=2 - separates Read/Readdir/Write onto a separate TCP connection

from the rest of the RPCs. Won't help w.r.t. aggregate network
bandwidth, but will avoid small RPCs being delayed by large
RPC messages.

nconnect=3 or more - separates Read/Readdir/Write onto the N-1 additional

TCP connections, allowing use of the aggregate network bandwidth
provided by lagg/lacp aggregated NICs or a fast NIC using
multiple queues. (In both cases, traffic for a TCP connection is
normally pinned to a NIC or a queue for a NIC.)

I considered making "separate Read/Readdir/Write from RPCs with small messages
a separate mount option, but felt that adding one new mount option was more than
enough for now.

This patch depends on the property that the NFS
server sends the RPC reply on the same TCP connection
as the one the request was received on. Although the
FreeBSD server always does this, it is only required
for NFSv4.1/4.2 (RFC 5661)

The other reason for not supporting "nconnect" for
NFSv3 and NFSv4.0 is that they use the DRC on
servers and "nconnect" might break the DRC,
depending upon its implementation.

As such, I've added checks to limit "nconnect" to
NFSv4.1/4.2 mounts only.

markj added inline comments.Jul 5 2021, 4:38 PM

sys/fs/nfs/nfs_commonkrpc.c
885	Why READDIR but not READDIRPLUS?
sys/fs/nfsclient/nfs_clvfsops.c
1204	Am I right that a mount -u which tries to modify `nconnect` is effectively ignored? Should it be an explicit error? Or is it worth trying to handle that?

Add NFSPROC_READDIRPLUS to the list of RPCs
with large messages as suggested by markj@.

rmacklem marked an inline comment as done.Jul 6 2021, 12:52 AM

rmacklem added inline comments.

sys/fs/nfs/nfs_commonkrpc.c
885	Good catch, thanks.
sys/fs/nfsclient/nfs_clvfsops.c
1204	The tradition for "mount -u" for the NFS client is to silently ignore options that cannot be updated. See line#1234-1249 for a list of option flags that get ignored. There are also a bunch of others, like "minorversion" and "nconnect" (as you noted). This tradition has been around for a long time, predating the new NFS client. (Although the above list is much shorter, it exists in FreeBSD4. I didn't look further back.)

markj added inline comments.Jul 6 2021, 1:52 PM

sys/fs/nfs/nfs_commonkrpc.c
652	It looks like we are assuming here that `nm_nextnconn` won't change between here and the change below, but I can't see any synchronization to guarantee it. In particular, isn't it possible for another thread to perform an RPC and advance `nm_nextnconn` after we call newnfs_connect() but before we issue the RPC? If that happens immediately after the mount is created or after a disconnect, `nmp->nm_aconn[nmp->m_nextnconn]` can be NULL.
sys/fs/nfsclient/nfs_clvfsops.c
1204	I see, thanks for the explanation.
sys/fs/nfsclient/nfsmount.h
87	IMO it would be a bit simpler to make this a count of the additional connections, `nm_naconn`. Right now it's either 0 if there are no additional connections, or it's the total number of connections. This means that the code tests `nm_nconnect > 0` to determine whether the mount's configured to make more than one connection, and it also means that we have to subtract 1 in most places to get the number we actually want.

If the last catch was good, this one is a great catch.
I'm a little embarrassed that I didn't think to make
manipulation of nm_nextnconn MP-safe.

Anyhow, this patch fixes this. Setting of nmp->nm_aconn[]
was already MP-safe, since setting "*clipp" is protected by
the nr_mtx mutex.

I've added comments to nfsmount.h to indicate which mutex
is used for the new nm_XXX fields this patch adds.

rmacklem marked an inline comment as done.Jul 6 2021, 11:33 PM

rmacklem added inline comments.

sys/fs/nfs/nfs_commonkrpc.c
652	Great catch. The code now sets a local variable "nextconn" and updates nm_nextnconn with the NFS mount point mutex held. The setting of &nmp->nm_aconn[nextconn] is handled safely in newnfs_connect() by locking nr_mtx and then releasing the "client" if it already set and multiple threads have raced to set it non-NULL. nm_aconn[] is not actually a TCP connection, but a krpc "client" structure that tells the krpc to (re)create a connection when an RPC message is set. It never goes back to NULL. During umount() the client is released, but at that point in the code, it is single threaded and guaranteed to release the NFS mount structure. As such, only non-NULL, it should always be safe to use.
sys/fs/nfsclient/nfsmount.h
87	The Linux "nconnect" is the total number of connections. I now noticed that it allows "nconnect=1", although it is a no-op, so I'll change the patch to allow "nconnect=1". I'm not sure if setting nm_nconnect to the "nconnect" mount option -1 would make the code more readable or more confusing? I can change nm_nconnect to be "nconnect" - 1, if you think it is more readable?

As suggested by markj@, this patch redefines the fields in
struct nfsmount as additional connections, to somewhat simplify
the code.
nm_nconnect renamed to nm_aconnect
nm_nextnconn renamed to nm_nextaconn

Also allow "nconnect=1" as a no-op (default) to make the
mount option more Linux compatible. (This is the only
semantic change for this version of the patch.)

rmacklem marked an inline comment as done.Jul 7 2021, 12:43 AM

rmacklem added inline comments.

sys/fs/nfsclient/nfsmount.h
87	nm_nconnect is now nm_aconnect for "additional connections and nm_nextnconn is now nm_nextaconn.

markj accepted this revision.Jul 7 2021, 1:51 PM

markj added inline comments.

sys/fs/nfs/nfs_commonkrpc.c
658	I suspect this locking could become a bottleneck if many threads are issuing I/O at a high rate. Though, I see that for read and write RPCs we already unconditionally lock the mount to load `nm_maxfilesize` and `nm_wsize`, so maybe it is not worth worrying about for now. One observation is that `nm_aconnect` is effectively fixed at mount time, so we could perhaps avoid the mnt lock and advance `nm_nextaconn` with `atomic_fetchadd_int()`. This doesn't solve the scalability problem but might give some marginal benefit if the mount lock is already heavily contended. It may also be reasonable to use a separate iterator for each CPU.
sys/fs/nfsclient/nfsmount.h
87	Thanks, that's what I was trying to get at. I think handling `mount -o nconnect=1` is the right thing to do as well.

This revision is now accepted and ready to land.Jul 7 2021, 1:51 PM

I had used NFSLOCKMNT(nmp) instead of atomic..() because I wanted
to maintain nm_nextaconn at < nm_aconnect and couldn't do "%" via
an atomic.

But, of course, there is no harm in allowing nm_nextaconn to just keep
being incremented and wrap to 0 when it overflows, doing the remainder (%)
on the local value after atomic_fetchadd_int().
When nm_aconnect is odd, the round robin ordering won't be maintained
when it overflows->0, but why would we care, since round robin ordering
isn't strictly required.

This variant of the patch uses atomic_fetchadd_int() instead of NFSLOCKMNT(),
as suggested by markj@.

This revision now requires review to proceed.Jul 7 2021, 8:21 PM

rmacklem marked an inline comment as done.Jul 7 2021, 8:31 PM

rmacklem added inline comments.

sys/fs/nfs/nfs_commonkrpc.c
658	Lock contention for NFSLOCKMNT() is unlikely to be an issue, since RPCs will always take some time (a fraction of 1msec at the minimum). Also, contention on the NFSCLSTATELOCK() is far more likely to be an issue. Unfortunately, most NFSv4 servers demand the open/lock/delegation stateid be in each Read/Write request and the stateid must be found by searching linked lists with the NFSCLSTATE() lock held. (The RFCs specify "special stateids", but few non-FreeBSD NFSv4 servers allow use of them for Read/Write. Also, there is no well defined way for an NFSv4 client to determine if the server allows them. I am planning to work on this someday, possibly trying a special stateid and then switching over to using an open/lock/delegation stateid if the I/O operation fails on the server.)

markj accepted this revision.Jul 8 2021, 12:52 PM

This revision is now accepted and ready to land.Jul 8 2021, 12:52 PM

Closed by commit rG1e0a518d6548: nfscl: Add a Linux compatible "nconnect" mount option (authored by rmacklem). · Explain WhyJul 9 2021, 12:44 AM

This revision was automatically updated to reflect the committed changes.

rmacklem marked an inline comment as done.

rmacklem added a commit: rG1e0a518d6548: nfscl: Add a Linux compatible "nconnect" mount option.

add a "nconnect" mount option to the NFS client
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 91991

sys/fs/nfs/nfs_commonkrpc.c

sys/fs/nfs/nfs_commonsubs.c

sys/fs/nfs/nfs_var.h

sys/fs/nfsclient/nfs_clrpcops.c

sys/fs/nfsclient/nfs_clvfsops.c

sys/fs/nfsclient/nfsmount.h

sys/fs/nfsserver/nfs_nfsdstate.c

add a "nconnect" mount option to the NFS clientClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 91991

sys/fs/nfs/nfs_commonkrpc.c

sys/fs/nfs/nfs_commonsubs.c

sys/fs/nfs/nfs_var.h

sys/fs/nfsclient/nfs_clrpcops.c

sys/fs/nfsclient/nfs_clvfsops.c

sys/fs/nfsclient/nfsmount.h

sys/fs/nfsserver/nfs_nfsdstate.c

add a "nconnect" mount option to the NFS client
ClosedPublic
Actions

Revision Contents
Changeset List