Page MenuHomeFreeBSD

Add iostat-like NFS server statistics
ClosedPublic

Authored by rmacklem on Jan 23 2015, 11:15 PM.
Tags
None
Referenced Files
F101479577: D1626.id19099.diff
Wed, Oct 30, 7:44 AM
Unknown Object (File)
Thu, Oct 17, 2:21 AM
Unknown Object (File)
Tue, Oct 15, 2:31 AM
Unknown Object (File)
Mon, Oct 14, 2:56 PM
Unknown Object (File)
Sat, Oct 12, 5:05 PM
Unknown Object (File)
Fri, Oct 11, 8:58 AM
Unknown Object (File)
Fri, Oct 11, 8:34 AM
Unknown Object (File)
Thu, Oct 10, 12:53 AM

Details

Summary

This change is forward-ported from SpectraBSD, which is stable/10. The changes were written by Ken Merry <ken@FreeBSD.org>.

Here are snippets from the original commit logs:

These statistics are similar to to what iostat(8) and ctlstat(8)            
report.  In fact, some of the code has been borrowed from ctlstat           
and libdevstat.                                                             

The statistics reported include kilobytes per transfer, transfers           
per second, megabytes per second, and I/O latency for reads,                
writes and all transactions.  We also report the current queue              
length, and busy percentage.                                                
                                                                            
This is an improvement on the current NFS statistics, which are             
only operation counts.  The better long term solution for this              
would be statistics collection at the VFS level, and integration            
with the devstat(9) subsystem so that we could have a common set of         
statistics routines that would work for disk-type devices and               
filesystems.                                                                

nfsport.h:      Add a start/done count, bytes transferred per            
                operation, number of operations, operation               
                duration, and busy time.                                 
                                                                            
nfs_nfsdport.c: Record bytes transferred.  Note that this is not            
                locked, which is a problem, and is not coincident           
                with the entire operation getting recorded.                 
nfs_nfsdsocket.c:                                                           
                Add two new routines, nfsrvd_statstart() and                
                nfsrvd_statend(), that record statistics for each           
                NFS operation type.                                         
                                                                            
                Even though we have a provision to record bytes in          
                nfsrvd_statend(), we don't because at this level,           
                we haven't decoded enough of the request to know            
                how long the operation is.                                  
                                                                            
nfsstat/Makefile:                                                           
                nfsstat now depends on libdevstat.                          
                                                                            
nfsstat.1:      Document the -d option to enable the new statistics         
                reporting.                                                  
                                                                            
nfsstat.c:      Implement a new statistics reporting mode that is           
                similar to what iostat(8) does.                             
                                                                            
                This mode is only active when -d and an interval is         
                specified.                                                  
                                                                            
                Change exp_sideways() to print out the new                  
                statistics (when asked), and printhdr to print              
                the new header.                                          
                                                                         
                compute_new_stats() and compute_totals and the           
                display code in exp_sideways() heavily borrow            
                from ctlstat and borrow some from devstat.               

                With higher NFS throughput, the existing -d dispay frequently            
                wrapped past 80 columns due to the size of the MB/sec and tps            
                numbers.                                                                 
                                                                         
                To compensate for this, remove individual read and write latency         
                from the default display for the -d option.  (The total latency is       
                still displayed by default.)                                             
                                                                         
                Add a third stat type, STAT_TYPE_COMMIT.

                Read and write latency can now be added back in with the -W option.      
                In addition, we now display commits per second and commit latency        
                with the -d -W option.                                                   

                Make printhdr() look at the widemode flag when it        
                is printing out the header for the newStats mode,        
                and adjust the field headings to include or exclude      
                fields depending on whether we are in wide mode.         
                                                                         
                In exp_sidewaysintpr(), calculate commit latency         
                and print it out in wide mode.  Only print out           
                read and write latency if we are in wide mode.
Test Plan

Update kernel/world. Run I/O tests from NFS export. Run "nfsstat -d -w 1" to see iostat-style output.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

will retitled this revision from to Add iostat-like NFS server statistics .
will updated this object.
will edited the test plan for this revision. (Show Details)
will added a reviewer: rmacklem.
will set the repository for this revision to rS FreeBSD src repository - subversion.

Basically, it looks ok to me, except for the fact that it grows
the ext_nfsstat structure, which is not handled by nfsstat(1).
(See below.)

I have no idea if server sysadmins will find
these stats useful, but hopefully users will comment on this.
(Since they can't comment using this tool, maybe a post to freebsd-fs@
with example stats output for comments might be useful?)

All nfsstat(1) does is nfssvc(NFSSVC_GETSTATS, &newnfsstats) and
the syscall does a copyout() of the data. As such, when newnfsstats
is grown, the kernel will copyout() a larger amount of data, overwriting
something in an old nfsstat(1) binary.

I think the new stuff should go in a separate structure and then code
to get that should be added to nfsstat(1). I think there is a library that
can handle this fairly easily, but I've lost track of what it is called.;-)
(The NFSSVC_xxx stuff was done when the code was meant to be
portable to different BSDen. I don't consider this important any more.)

At the very least, a test of an unpatched nfsstat binary running on a
patched kernel needs to be done.

rick

In D1626#5, @rmacklem wrote:

Basically, it looks ok to me, except for the fact that it grows
the ext_nfsstat structure, which is not handled by nfsstat(1).
(See below.)
I have no idea if server sysadmins will find these stats useful, but hopefully users will comment on this.
(Since they can't comment using this tool, maybe a post to freebsd-fs@ with example stats output for comments might be useful?)

All nfsstat(1) does is nfssvc(NFSSVC_GETSTATS, &newnfsstats) and
the syscall does a copyout() of the data. As such, when newnfsstats
is grown, the kernel will copyout() a larger amount of data, overwriting
something in an old nfsstat(1) binary.

You're right, this would almost certainly break old binaries. That would be bad.

I think the new stuff should go in a separate structure and then code
to get that should be added to nfsstat(1). I think there is a library that
can handle this fairly easily, but I've lost track of what it is called.;-)
(The NFSSVC_xxx stuff was done when the code was meant to be portable to different BSDen. I don't consider this important any more.)

Doing it with devstat(9) is an option. Given that it is designed for disk-type devices with a limited number of operation types (read, write, free), the way I would probably do that is to have the devstat peripheral device name be "nfs" and then the unit number correspond to the NFS RPC number, so we could track operation counts and bytes read/written for each type of operation. The client (nfsstat in this case) would need to understand the naming/numbering scheme, and allow the user to optionally select which stats he wants to see.

The drawback to that approach is that it is rather kludgy, not that clean from an interface standpoint, and only gives you global NFS statistics. It won't give you stats on a per export level, or for any other filesystems. And it would be cumbersome to expand elsewhere. And the existing devstat tools won't really work well with that approach.

Another way to do it would be with custom Dtrace probes. That might give a reasonable way to get things out of the kernel, but getting nicely formatted output may be more of a challenge there. It also wouldn't allow much re-use of our current statistics infrastructure.

I think the best way to do it would be an extension to devstat(9) that would add a new VFS-specific stats structure. It would be added for each filesystem mount, and would track operations/bytes for all VFS calls. Then we could have a more generic fsstat utility that could re-use some of the devstat infrastructure, and give more useful filesystem statistics.

Although I think that's the best way to do it, I've got a lot of other things in my queue to upstream (lots of code that is already written) that are a higher priority for me. So I don't think I'll get to it any time soon.

At the very least, a test of an unpatched nfsstat binary running on a
patched kernel needs to be done.

rick

Thanks for looking at this!

Ken

Speaking as a user, I would really love to see some form of this functionality happen.

Note that Linux has fairly detailed NFS client side stats. They track info such as bytes transmitted/received and request latency per-RPC (so not just read, write, other, but read, write, getatter, access, etc.). After having seen that in person, I would love to have that level of detail. This definitely seems like a giant step close to that, but I would encourage you to go for adding those broken out stats on the client side if you are already hacking on the client. It is not unusual in certain workloads to see different response times for READ vs WRITE vs GETATTR/ACCESS vs namespace operations like LOOKUP, READDIR, CREATE, etc. (For example, if everything is slow then the server's NFS request queue might be overloaded (or it is is full of requests that are otherwise taking a long time). OTOH, if GETATTR/ACCESS complete quickly but READ/WRITE are slow then you might be disk bound on the server. Also, namespace operations can be slow due to lock contention even if other operations are not slow, etc.

rmacklem edited reviewers, added: will; removed: rmacklem.

I have a diff that includes these changes plus some other updates for
the nfsstat structure.
I think I have to do this to be able to update the revision.

rmacklem set the repository for this revision to rS FreeBSD src repository - subversion.

This patch updates the nfsstat structure to add the iostat-like changes.
It also:

  • all stats are uint64_t
  • a vers field at the beginning allows further revisions of the structure
  • tables are now large enough for NFSv4.2
  • it has compatibility code for the old nfsstat command

I did not add a new style of nfsstat output that includes the NFSv4.1
specific ops. I will do that as a separate future revision of nfsstat.c.

I've been meaning to do something like this over the past year (since I made my last comment) but haven't had the time. But after attending a few conferences I became convinced that what I really wanted was to capture exact timing for all RPCs and have a usermode program to analyze the data via something like ALQ. In particular, I want to measure median and tail latency (which is going to be different for different operations) and also queue-residence time in the various queues a request moves through. This may be too expensive for some (many?) applications, so I don't expect to see that sort of thing here.

Long ago (and I mean long, long ago as in 4BSD era) I had a circular log with entries
for each RPC (start/end timestamps and I don't recall what else). Then I had a userland
program that would enable the log and capture it for a while, etc.

I found it useful, but I recall that with the low clock resolution at the time, the timings
were of limited use.

Doing something like this could indeed be useful, but I don't think it would be a part
of the nfsstat structure.
Since it took me a year and a half to get around to doing this little change, maybe you
should come up with the patch?

Thanks for taking this on! I would like to see the byte totals updated. As it is in the patch, they aren't used, and so the throughput numbers from nfsstat won't produce anything.

Ken, maybe you or Will could update the patch w.r.t. byte totals and
amything else.
I thought I had everything in the previous patch in it, but if I missed
something, sorry...
You guys would have a better idea of what you expect to be generated
by the code than I would, so please make the changes.
There is no rush, since it will be months before -head is open for new commits.

rmacklem edited edge metadata.
rmacklem removed rS FreeBSD src repository - subversion as the repository for this revision.

The patch was missing the changes to nfs_nfsdport.c. I had the changes in
my test kernel, but they were missing here. This update fixes that.

Thanks to ken@ for spotting that they were missing.

wblock added inline comments.
usr.bin/nfsstat/nfsstat.1
65 ↗(On Diff #18524)

Use the serial/Harvard/Oxford comma:
s/ and/, and/

70 ↗(On Diff #18524)

Serial comma again:
s/ and/, and/

72 ↗(On Diff #18524)

"May" usually means "is permitted to", but here appears to mean "might". And it's not really the busy percentage, it's the calculated or shown busy percentage. So maybe

The busy percentage shown can exceed 100 at times.
74 ↗(On Diff #18524)

Sorry, don't understand this explanation. Why does this cause displayed busy percentages to go over 100%?

ken edited edge metadata.

Looks good, thank you for doing this! Your version is much cleaner than my original version. I have tested this out on head and 'nfsstat -d 1' and 'nfsstat -W -d 1' are working as intended.

There is one change needed to get nfsstat to build, see the comment inline.

usr.bin/nfsstat/nfsstat.c
132 ↗(On Diff #18524)

This needs to be declared static ("static struct stattypes...") to fix a build warning/error with clang at least.

This revision is now accepted and ready to land.Aug 2 2016, 3:14 PM

Trying to add my inline comment again.

Inline comment reply doesn't seem to work, so here is my answer to Warren's question on how the busy percentage is calculated:

This is because of the way busy percentages are calculated. The busy time is calculated by adding the elapsed time between the last time an operation started or finished, and the current time. If there is only one operation outstanding, the "busy time" time will get updated with the total time of that operation.

That means that the difference in the busy time between the two measurement intervals (often 1 second) will increase by more than the measurement interval.

Here is an example from 'nfsstat -W -d 1':

[======== Read ========] [======== Write =======] [Commit ] [=========== Total ============]
KB/t tps MB/s ms KB/t tps MB/s ms tps ms KB/t tps MB/s ms ql %b
0.00 0 0.00 0.00 64.00 1790 111.90 0.17 0 0.00 64.00 1790 111.90 0.17 0 30
0.00 0 0.00 0.00 64.00 1790 111.90 0.17 0 0.00 64.00 1790 111.90 0.17 0 31
0.00 0 0.00 0.00 64.03 1790 111.93 0.16 0 0.00 64.03 1790 111.93 0.16 1 29
0.00 0 0.00 0.00 63.97 1791 111.86 0.17 0 0.00 63.97 1791 111.86 0.17 0 30
0.00 0 0.00 0.00 64.04 1790 111.91 0.17 0 0.00 64.04 1790 111.91 0.17 1 31
0.00 0 0.00 0.00 63.96 1631 101.85 0.16 0 0.00 63.96 1631 101.85 0.16 1 26
0.00 0 0.00 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 1 0
0.00 0 0.00 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 1 0
0.00 0 0.00 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 1 0
0.00 0 0.00 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 1 0
0.00 0 0.00 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 1 0
0.00 0 0.00 0.00 0.00 0 0.00 0.00 1 5739.50 0.00 1 0.00 5739.50 0 540
0.00 0 0.00 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0

There was a long running commit operation that finished, and shows 540% busy. Because we had a 1 second measurement interval in this case, it means that percentage came from dividing 5.4 seconds by 1 second.

I've copied these lines from ken@'s comment into the man page and it seems to read ok to me.
I added these lines after ...can exceed 100%.:

This is because of the way busy percentages are calculated. The busy time is calculated by adding the elapsed time between the last time an operation started or finished, and the current time. If there is only one operation outstanding, the "busy time" time will get updated with the total time of that operation.
That means that the difference in the busy time between the two measurement intervals (often 1 second) will increase by more than the measurement interval.

that replace:
That is not a bug, but is rather caused when a long running operation completes.

Btw, I didn't see the nfsstat.c build warning, but added the "static" anyhow. I did need to add
#include <stdint.h>
for i386 builds.

usr.bin/nfsstat/nfsstat.1
65 ↗(On Diff #18524)

I have made this change to my copy of the patch. I don't know how to change it here.

70 ↗(On Diff #18524)

I have made this change to me copy of the patch. I don't know how to
update it here...

72 ↗(On Diff #18524)

I've applied this change to my copy of the patch. I won't bother to say I don't know
how to change this one.;-)

74 ↗(On Diff #18524)

Hopefully will@ or ken@ can explain this further. I just cloned their patch.

I've copied these lines from ken@'s comment into the man page and it seems to read ok to me.
I added these lines after ...can exceed 100%.:

This is because of the way busy percentages are calculated. The busy time is calculated by adding the elapsed time between the last time an operation started or finished, and the current time. If there is only one operation outstanding, the "busy time" time will get updated with the total time of that operation.
That means that the difference in the busy time between the two measurement intervals (often 1 second) will increase by more than the measurement interval.

Thank you, it reads okay here.

The way to get Phabricator to update what it shows is to update the diff. Of course, then the comments are often on the wrong lines. But we take what we can get.

rmacklem edited edge metadata.

This version of the patch includes the changes to the man page recommended by wblock@.
It also includes space in the nfsstatsv1 structure for the NFSv4.2 operations, based on the
most recent internet draft (this may change, but hopefully not).

If no one suggests changes to this, I plan on committing it to head in about a week.

This revision now requires review to proceed.Aug 6 2016, 11:30 PM
ken edited edge metadata.

Looks good!

This revision is now accepted and ready to land.Aug 8 2016, 8:32 PM