Page MenuHomeFreeBSD

add a "nconnect" mount option to the NFS client
ClosedPublic

Authored by rmacklem on Jul 1 2021, 1:34 AM.
Tags
None
Referenced Files
F107204258: D30970.id91594.diff
Sat, Jan 11, 2:50 PM
F107203882: D30970.id91991.diff
Sat, Jan 11, 2:49 PM
Unknown Object (File)
Sat, Jan 11, 1:38 PM
Unknown Object (File)
Sat, Jan 11, 11:39 AM
Unknown Object (File)
Thu, Jan 2, 8:41 PM
Unknown Object (File)
Fri, Dec 27, 9:47 AM
Unknown Object (File)
Wed, Dec 25, 10:38 PM
Unknown Object (File)
Wed, Dec 25, 10:20 PM
Subscribers
None

Details

Summary

Linux has had an "nconnect" NFS mount option for some time.
It specifies that N (up to 16) TCP connections are to created for a mount.

A discussion on freebsd-net@ indicated that this could improve client<-->server
network bandwidth, if either the client or server have one of the following:

  • multiple network ports aggregated to-gether with lagg/lacp.
  • a fast NIC that is using multiple queues

It does result in using more IP port#s and might increase server peek load for a client.

Test Plan

Testing of mount attempts with various valid and invalid values
for nconnect. For valid mounts, look at a packet capture in
wireshark to see that the N TCP connections are being used,
round robin.

Also test network partitioning, to see that the N TCP connections
are re-established after the network partition fails.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

rmacklem created this revision.

This comment was imbedded in a recent email message on linu-nfs@vger.kernel.org.
Although other posters felt more evidence was needed to determine this,
it at least suggests that separating the large RPC messages from the small ones may
improve performance under certain circumstances.

The original issue described was how a high read/write process on the
client could slow another process trying to do heavy metadata
operations (like walking the filesystem). Using a different mount to
the same multi-homed server seems to help a lot (probably because of
the independent slot table).

This version of the patch sends all RPCs that use small RPC messages
on the original TCP connection and sends the RPCs (Read/Readdir/Write)
with large messages on the additional TCP connections in a round robin
fashion.
nconnect=2 - separates Read/Readdir/Write onto a separate TCP connection

from the rest of the RPCs. Won't help w.r.t. aggregate network
bandwidth, but will avoid small RPCs being delayed by large
RPC messages.

nconnect=3 or more - separates Read/Readdir/Write onto the N-1 additional

TCP connections, allowing use of the aggregate network bandwidth
provided by lagg/lacp aggregated NICs or a fast NIC using
multiple queues. (In both cases, traffic for a TCP connection is
normally pinned to a NIC or a queue for a NIC.)

I considered making "separate Read/Readdir/Write from RPCs with small messages
a separate mount option, but felt that adding one new mount option was more than
enough for now.

This patch depends on the property that the NFS
server sends the RPC reply on the same TCP connection
as the one the request was received on. Although the
FreeBSD server always does this, it is only required
for NFSv4.1/4.2 (RFC 5661)

The other reason for not supporting "nconnect" for
NFSv3 and NFSv4.0 is that they use the DRC on
servers and "nconnect" might break the DRC,
depending upon its implementation.

As such, I've added checks to limit "nconnect" to
NFSv4.1/4.2 mounts only.

sys/fs/nfs/nfs_commonkrpc.c
885

Why READDIR but not READDIRPLUS?

sys/fs/nfsclient/nfs_clvfsops.c
1204

Am I right that a mount -u which tries to modify nconnect is effectively ignored? Should it be an explicit error? Or is it worth trying to handle that?

Add NFSPROC_READDIRPLUS to the list of RPCs
with large messages as suggested by markj@.

rmacklem added inline comments.
sys/fs/nfs/nfs_commonkrpc.c
885

Good catch, thanks.

sys/fs/nfsclient/nfs_clvfsops.c
1204

The tradition for "mount -u" for the NFS client is to silently ignore options
that cannot be updated.
See line#1234-1249 for a list of option flags that get ignored. There are also
a bunch of others, like "minorversion" and "nconnect" (as you noted).

This tradition has been around for a long time, predating the new NFS client.
(Although the above list is much shorter, it exists in FreeBSD4. I didn't look further back.)

sys/fs/nfs/nfs_commonkrpc.c
638

It looks like we are assuming here that nm_nextnconn won't change between here and the change below, but I can't see any synchronization to guarantee it. In particular, isn't it possible for another thread to perform an RPC and advance nm_nextnconn after we call newnfs_connect() but before we issue the RPC? If that happens immediately after the mount is created or after a disconnect, nmp->nm_aconn[nmp->m_nextnconn] can be NULL.

sys/fs/nfsclient/nfs_clvfsops.c
1204

I see, thanks for the explanation.

sys/fs/nfsclient/nfsmount.h
87

IMO it would be a bit simpler to make this a count of the additional connections, nm_naconn. Right now it's either 0 if there are no additional connections, or it's the total number of connections. This means that the code tests nm_nconnect > 0 to determine whether the mount's configured to make more than one connection, and it also means that we have to subtract 1 in most places to get the number we actually want.

rmacklem marked an inline comment as done.

If the last catch was good, this one is a great catch.
I'm a little embarrassed that I didn't think to make
manipulation of nm_nextnconn MP-safe.

Anyhow, this patch fixes this. Setting of nmp->nm_aconn[]
was already MP-safe, since setting "*clipp" is protected by
the nr_mtx mutex.

I've added comments to nfsmount.h to indicate which mutex
is used for the new nm_XXX fields this patch adds.

rmacklem added inline comments.
sys/fs/nfs/nfs_commonkrpc.c
638

Great catch. The code now sets a local variable "nextconn" and
updates nm_nextnconn with the NFS mount point mutex held.
The setting of &nmp->nm_aconn[nextconn] is handled safely
in newnfs_connect() by locking nr_mtx and then releasing the
"client" if it already set and multiple threads have raced to set it non-NULL.

nm_aconn[] is not actually a TCP connection, but a krpc "client"
structure that tells the krpc to (re)create a connection when an
RPC message is set.

It never goes back to NULL. During umount() the client is released,
but at that point in the code, it is single threaded and guaranteed
to release the NFS mount structure. As such, only non-NULL, it should
always be safe to use.

sys/fs/nfsclient/nfsmount.h
87

The Linux "nconnect" is the total number of connections.
I now noticed that it allows "nconnect=1", although it is a no-op,
so I'll change the patch to allow "nconnect=1".

I'm not sure if setting nm_nconnect to the "nconnect" mount option
-1 would make the code more readable or more confusing?

I can change nm_nconnect to be "nconnect" - 1, if you think it is
more readable?

rmacklem marked an inline comment as done.

As suggested by markj@, this patch redefines the fields in
struct nfsmount as additional connections, to somewhat simplify
the code.
nm_nconnect renamed to nm_aconnect
nm_nextnconn renamed to nm_nextaconn

Also allow "nconnect=1" as a no-op (default) to make the
mount option more Linux compatible. (This is the only
semantic change for this version of the patch.)

rmacklem added inline comments.
sys/fs/nfsclient/nfsmount.h
87

nm_nconnect is now nm_aconnect for "additional connections
and nm_nextnconn is now nm_nextaconn.

markj added inline comments.
sys/fs/nfs/nfs_commonkrpc.c
644

I suspect this locking could become a bottleneck if many threads are issuing I/O at a high rate. Though, I see that for read and write RPCs we already unconditionally lock the mount to load nm_maxfilesize and nm_wsize, so maybe it is not worth worrying about for now.

One observation is that nm_aconnect is effectively fixed at mount time, so we could perhaps avoid the mnt lock and advance nm_nextaconn with atomic_fetchadd_int(). This doesn't solve the scalability problem but might give some marginal benefit if the mount lock is already heavily contended. It may also be reasonable to use a separate iterator for each CPU.

sys/fs/nfsclient/nfsmount.h
87

Thanks, that's what I was trying to get at. I think handling mount -o nconnect=1 is the right thing to do as well.

This revision is now accepted and ready to land.Jul 7 2021, 1:51 PM

I had used NFSLOCKMNT(nmp) instead of atomic..() because I wanted
to maintain nm_nextaconn at < nm_aconnect and couldn't do "%" via
an atomic.

But, of course, there is no harm in allowing nm_nextaconn to just keep
being incremented and wrap to 0 when it overflows, doing the remainder (%)
on the local value after atomic_fetchadd_int().
When nm_aconnect is odd, the round robin ordering won't be maintained
when it overflows->0, but why would we care, since round robin ordering
isn't strictly required.

This variant of the patch uses atomic_fetchadd_int() instead of NFSLOCKMNT(),
as suggested by markj@.

This revision now requires review to proceed.Jul 7 2021, 8:21 PM
rmacklem added inline comments.
sys/fs/nfs/nfs_commonkrpc.c
644

Lock contention for NFSLOCKMNT() is unlikely to be an issue, since RPCs will always
take some time (a fraction of 1msec at the minimum).

Also, contention on the NFSCLSTATELOCK() is far more likely to be an issue.
Unfortunately, most NFSv4 servers demand the open/lock/delegation stateid be in
each Read/Write request and the stateid must be found by searching linked lists
with the NFSCLSTATE() lock held. (The RFCs specify "special stateids", but few non-FreeBSD
NFSv4 servers allow use of them for Read/Write. Also, there is no well defined way for
an NFSv4 client to determine if the server allows them. I am planning to work on this someday,
possibly trying a special stateid and then switching over to using an open/lock/delegation stateid
if the I/O operation fails on the server.)

This revision is now accepted and ready to land.Jul 8 2021, 12:52 PM
This revision was automatically updated to reflect the committed changes.
rmacklem marked an inline comment as done.