Page MenuHomeFreeBSD

Augment systat(1) -swap to display large swap space processes
Needs ReviewPublic

Authored by ota_j.email.ne.jp on Apr 13 2021, 9:41 PM.

Details

Reviewers
kib
markj
imp
mckusick
bcr
Group Reviewers
manpages
Summary

This change updates the systat(1) -swap display to use libprocstat to obtain and display per-process swap space usage infomation following its existing swap devise/file statistics. It also incorporates the disk I/O information from the -vmstat display.

The new screen looks like below with 'systat -swap':

/0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10
Load Average |

Device/Path Size Used |0% /10 /20 /30 /40 / 60\ 70\ 80\ 90\ 100|
ada0s1b 2048M 2034M XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
zvol/sys/tempora 1024M 1015M XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
zvol/sys/swap 1024M 1014M XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Total 4096M 4063M XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Pid Username Command Swap/Total Per-Process Per-System
24153 hiro seamonkey 98M / 1G 7% 2%
23677 hiro xfce4-pane 28M / 81M 34% XXX 0%
23629 hiro xfce4-sess 25M / 118M 21% XX 0%
23681 hiro xfdesktop 20M / 58M 34% XXX 0%
23678 hiro thunar 15M / 43M 36% XXX 0%
23658 hiro at-spi-bus 14M / 23M 63% XXXXXX 0%
23660 hiro gvfsd 12M / 21M 56% XXXXX 0%

Disks ada0 ada1 ada2 cd0 pass0 pass1 pass2 pass3
KB/t 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
tps 0 0 0 0 1 0 0 0
MB/s 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
%busy 0 0 0 0 0 0 0 0

Test Plan

This patch has been tested by the submitter, Yoshihiro Ota.

Further feedback is welcomed from the reviewers.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 42082
Build 38970: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I made the following toy https://gist.github.com/1f0de166c34d822b6aa7a05110ede8be
Use it as shadow_show <pid> to dump all swap-like entries from the map of the given process, and all the shadow chains of the objects below it. Sample output is

pid 9215
0x00000000000235000-0x00000000000236000
  0x0fffff802f1ee8528 1 0
0x0000000080025b000-0x0000000080027e000
  0x0fffff80321f4c210 34 0
0x00000000800349000-0x0000000080034b000
  0x0fffff805e4bf9000 2 0
0x0000000080056f000-0x00000000800795000
  0x0fffff805e460c840 15 0
0x000000008007af000-0x00000000813f1b000
  0x0fffff805e4b92528 87396 0
0x00000000813f1c000-0x0000000081402c000
  0x0fffff805e4b92528 87396 0
0x0000000081402e000-0x000000008141ac000
  0x0fffff805e4b92528 87396 0
0x000000008141ae000-0x00000000814461000
  0x0fffff805e4b92528 87396 0
0x00000000814462000-0x0000000081446a000
  0x0fffff805e4b92528 87396 0
0x0000000081446c000-0x000000008144b7000
  0x0fffff805e4b92528 87396 0
0x000000008144b8000-0x0000000081453f000
  0x0fffff805e4b92528 87396 0
0x00000000814541000-0x00000000814553000
  0x0fffff805e4b92528 87396 0
0x00000000814555000-0x0000000081482f000
  0x0fffff805e4b92528 87396 0
0x00000000814831000-0x000000008148a3000
  0x0fffff805e4b92528 87396 0
0x000000008148a5000-0x0000000081496e000
  0x0fffff805e4b92528 87396 0
0x00000000814970000-0x00000000814b1c000
  0x0fffff805e4b92528 87396 0
0x00000000814b1d000-0x00000000814b1e000
  0x0fffff805e4b92528 87396 0
0x00000000814b1f000-0x00000000814b5b000
  0x0fffff805e4b92528 87396 0
0x00000000814b5c000-0x00000000814c0a000
  0x0fffff805e4b92528 87396 0
0x00000000814c0b000-0x00000000814ca6000
  0x0fffff805e4b92528 87396 0
0x00000000814ca7000-0x0000000081502e000
  0x0fffff805e4b92528 87396 0
0x00000000815033000-0x00000000815612000
  0x0fffff805e4b92528 87396 0
0x00000000815616000-0x00000000815688000
  0x0fffff805e4b92528 87396 0
0x00000000815689000-0x00000000815776000
  0x0fffff805e4b92528 87396 0
0x00000000815777000-0x000000008157b3000
  0x0fffff805e4b92528 87396 0
0x000000008157b5000-0x0000000081587d000
  0x0fffff805e4b92528 87396 0
0x0000000081587e000-0x0000000081598f000
  0x0fffff805e4b92528 87396 0
0x00000000815990000-0x000000008160d7000
  0x0fffff805e4b92528 87396 0
0x000000008160d8000-0x000000008161fd000
  0x0fffff805e4b92528 87396 0
0x000000008161fe000-0x00000000816287000
  0x0fffff805e4b92528 87396 0
0x00000000816289000-0x0000000081689e000
  0x0fffff805e4b92528 87396 0
0x000007ffffff5f000-0x000007ffffff9f000
  0x0fffff805e466d630 1 0
0x000007ffffff9f000-0x000007ffffffdf000
  0x0fffff805e4a6ee70 1 0
0x000007ffffffdf000-0x000007ffffffff000
  0x0fffff804bed92210 0 0
Total 87450 0

The number near each object handle, and total, are the number of pages in memory and swap. As you see, it handles de-duplication of the objects, in the example above there are a lot of mappings of the same object (no idea why the ninja process need this).

Try this program on the java process that gives a trouble to you.

usr.bin/systat/proc.c
36

sys/param.h already includes sys/types.h

55

Don't call it pages, it is either 'swapped_pages' or 'swapped'.

140

Why is max static?

153

&& should be on the previous line

155

Why do you skip objects with kvo_swapped == 0? They do participate in the shadow chain.

  • Avoid double counting and address review comments.

Thanks for the program. Indeed, I saw lots of duplicates from the java program.

usr.bin/systat/proc.c
140

This tracks the max realloc size.
I renamed it to maxnobj in the latest version.

204

After adding pid check, the numbers look more reasonable - e.g. no process with more swap than its allocated.

usr.bin/systat/systat.h
86–97

I moved these from devs.h.
I guess we don't want to include these in the review or commit.

I mostly looked at style.

usr.bin/systat/proc.c
4

Add email?

28

$FreeBSD$ in copyright was only used for header.

After git conversion, it does not make much sense to add the id to new files, I suggest to remove __FBSDID() below as well.

69

COMMLEN + 1

125

return (0);

126

return (*aa > bb->kvo_me ? -1: 1);

134

contig line should be indented +4 spaces
return ();

171

BTW why realloc(3) and not reallocf(3)?

183

... number of swap pages ...

184

Newline after uint32_t

214

return (pages);

245

space before '('

311

const struct proc_usage *aa = *((const struct proc_usage **)a);

314

return (aa->pages > bb->pages ? -1: 1);

usr.bin/systat/swap.c
6

I do not think that this changes deserve a copyright line, at least according to our typical policy of 20% changes.

usr.bin/systat/proc.c
5

Also by convention we've been dropping All rights reserved. as it is not necessary, feel free to take it out absent a reason you want it there (as an example, a company contribution guideline might specify it).

ota_j.email.ne.jp marked an inline comment as done.

Style fixes based on review comments and recstored Kirk's systat.1 updates.

ota_j.email.ne.jp added inline comments.
usr.bin/systat/proc.c
28

Do I keep sys/cdfs.h here?

171

I've never known reallocf().
It calls die() which calls exit() if allocation fails, anyway.

usr.bin/systat/swap.c
6

Is it 20% changes per commit or over a year?
I'm just curious.

usr.bin/systat/systat.h
36

I think I can drop devstat.h as well from the changeset.

ota_j.email.ne.jp marked 3 inline comments as done.

Remove #include <devstat.h>

usr.bin/systat/proc.c
28

No need to keep it. sys/param.h includes sys/types.h which already includes sys/cdefs.h

121

space before :

243

I do not quite follow. Assuming there is no sysctl errors, on the first call you return 0, and on the second and consequent calls, 1. Is this intended?

Also, why not follow consistent style, and do just return (prstat != NULL); (or whatever, ==) in the first statement in the function?

277

Excess ()

310

space before :

usr.bin/systat/swap.c
6

Total. I did not looked at the blame output.

If you are the author of more than 1/5 of this file, then it is fine, of course.

181

I would wrote this as count = kvnsw == 1 ? 2 : 3;

183

Excess (), two times

usr.bin/systat/sysput.c
48

This does not look like a tab before return.

But, can you use e.g. atop() and ptoa() instead of hand-rolled functions?

106

Change function definition to ANSI.
BTW as I see it is not needed if you start using ptoa()

usr.bin/systat/systat.1
296

..., the total use of swap space in bytes, ...

usr.bin/systat/systat.h
36

Style, sys/ includes before userspace includes.

77

I am not sure what meaning does this comment provide.

Thank you for reviews.

I happened to extract ports.tar.xz onto tmpfs and that caused long CPU spike and high memory usage.
systat took nearly 700MB.
I suppose it was procstat_getvmmap() took long time.
"top" wasn't updating its display, neither.

I will mention it in the BUGS section for a warning.

usr.bin/systat/proc.c
243

I had one wrong.
swapinit() returns 1 for success.

Thank you for reviews.

I happened to extract ports.tar.xz onto tmpfs and that caused long CPU spike and high memory usage.
systat took nearly 700MB.
I suppose it was procstat_getvmmap() took long time.
"top" wasn't updating its display, neither.

I will mention it in the BUGS section for a warning.

Most likely it was sysctl vm.objects and not per-proc vmmap sysctl that caused the intermediate hang and CPU usage. As I and Mark discussed it earlier, some more fine-grained API to retrieve e.g. just the shadow chain for given object handle probably would help there.

But before I implement this sysctl, can you recheck, please, that the following patch solves the transient hang issue?

 commit fd2da9e1515bb50fb1c8fb747d9987ee9b8a314d
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Sat May 8 01:13:29 2021 +0300

    sysctl vm.objects: yield if hog

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 1c4e879d82ea..b1fa3ebdf9ab 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -2614,6 +2614,7 @@ sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
 		kvo->kvo_structsize = roundup(kvo->kvo_structsize,
 		    sizeof(uint64_t));
 		error = SYSCTL_OUT(req, kvo, kvo->kvo_structsize);
+		maybe_yield();
 		mtx_lock(&vm_object_list_mtx);
 		if (error)
 			break;

Most likely it was sysctl vm.objects and not per-proc vmmap sysctl that caused the intermediate hang and CPU usage. As I and Mark discussed it earlier, some more fine-grained API to retrieve e.g. just the shadow chain for given object handle probably would help there.

But before I implement this sysctl, can you recheck, please, that the following patch solves the transient hang issue?

 commit fd2da9e1515bb50fb1c8fb747d9987ee9b8a314d
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Sat May 8 01:13:29 2021 +0300

    sysctl vm.objects: yield if hog

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 1c4e879d82ea..b1fa3ebdf9ab 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -2614,6 +2614,7 @@ sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
 		kvo->kvo_structsize = roundup(kvo->kvo_structsize,
 		    sizeof(uint64_t));
 		error = SYSCTL_OUT(req, kvo, kvo->kvo_structsize);
+		maybe_yield();
 		mtx_lock(&vm_object_list_mtx);
 		if (error)
 			break;

The patch didn't help.
By the way, I've been working on releng/13.0 branch with git cherry-pick.
If I need some changes only in main or stable/13 branches, I will either cherry-pick or switch to such branch.

I checked process with procstat as below.
kinfo_getvmobject() was the one taking long time.
xosview was still updating CPU, memory, and swap usage; it must be using different APIs than top/systat.

% procstat -k 13568

PID    TID COMM                TDNAME              KSTACK

13568 101023 systat - tmpfs_vptocnp_fill tmpfs_vptocnp vn_vptocnp vn_fullpath_any vn_fullpath sysctl_vm_object_list sysctl_root_handler_locked sysctl_root userland_sysctl kern___sysctlbyname sys___sysctlbyname syscall wmt_devclass
% procstat -k 13568

PID    TID COMM                TDNAME              KSTACK

13568 101023 systat - vn_vptocnp vn_fullpath_any vn_fullpath sysctl_vm_object_list sysctl_root_handler_locked sysctl_root userland_sysctl kern___sysctlbyname sys___sysctlbyname syscall wmt_devclass

Should we also try "sysctl improvements", https://www.freebsd.org/status/report-2021-01-2021-03/#_sysctl_improvements, to see if that helps?

Adjust systat.1 from the other review.

A few suggestions for wording improvements in the man page.

usr.bin/systat/systat.1
282

I would change the last "and" here to "as well as" to make this long sentence a bit more legible. Otherwise, it's just one long sentence connected by and's. I'm prone to do this myself, but I think we don't need to break up this sentence into two either.

285

s/percentage usage/usage percentage/

297

s/and/as well as/ (in this line here, see above)

I mean, when asking whether my one-line patch helped, did interactivity of the system during vm.objects retrieval improved? Of course it cannot optimize the running time of the sysctl, but it should allow for other threads to run some more while sysctl is gathering data and copying it out.

In D29754#677726, @kib wrote:

I mean, when asking whether my one-line patch helped, did interactivity of the system during vm.objects retrieval improved? Of course it cannot optimize the running time of the sysctl, but it should allow for other threads to run some more while sysctl is gathering data and copying it out.

I do not see any differences with or without patch. I ran "top" and "systat -swap" and after creating lots of tmpfs files, both systat spins and top is blocked.

In D29754#677726, @kib wrote:

I mean, when asking whether my one-line patch helped, did interactivity of the system during vm.objects retrieval improved? Of course it cannot optimize the running time of the sysctl, but it should allow for other threads to run some more while sysctl is gathering data and copying it out.

I do not see any differences with or without patch. I ran "top" and "systat -swap" and after creating lots of tmpfs files, both systat spins and top is blocked.

Ok. Could you add the following to the kernel, please, and see if it makes any difference?

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index dbb09f67b4c1..cf8b3fcb8f9c 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -2478,7 +2478,7 @@ sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
 	struct vnode *vp;
 	struct vattr va;
 	vm_object_t obj;
-	vm_page_t m;
+//	vm_page_t m;
 	u_long sp;
 	int count, error;
 
@@ -2524,6 +2524,7 @@ sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
 		kvo->kvo_memattr = obj->memattr;
 		kvo->kvo_active = 0;
 		kvo->kvo_inactive = 0;
+#if 0
 		TAILQ_FOREACH(m, &obj->memq, listq) {
 			/*
 			 * A page may belong to the object but be
@@ -2539,6 +2540,7 @@ sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
 			else if (m->a.queue == PQ_INACTIVE)
 				kvo->kvo_inactive++;
 		}
+#endif
 
 		kvo->kvo_vn_fileid = 0;
 		kvo->kvo_vn_fsid = 0;
In D29754#680269, @kib wrote:

Ok. Could you add the following to the kernel, please, and see if it makes any difference?

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index dbb09f67b4c1..cf8b3fcb8f9c 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -2478,7 +2478,7 @@ sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
 	struct vnode *vp;
 	struct vattr va;
 	vm_object_t obj;
-	vm_page_t m;
+//	vm_page_t m;
 	u_long sp;
 	int count, error;
 
@@ -2524,6 +2524,7 @@ sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
 		kvo->kvo_memattr = obj->memattr;
 		kvo->kvo_active = 0;
 		kvo->kvo_inactive = 0;
+#if 0

Did you mean both the 1st patch and 2nd patch together?

I so far tested the 2nd one only; I had reverted the 1st one.
With 2nd one only, I don't see visible improvements.

Update systat.1 based on suggestions.

I tested the 2 kernel changes both applied. In short, it helps since the 2nd refresh.

With or without the kernel changes, systat takes long time to return from the call. During that period, top, ps and few others freeze, although xosview draws memory and swap usage fine. Without the kernel changes, each systat refresh takes same amount of time and causes some others to hung. On the other hand, with the kernel changes, systat redraws significantly faster for the 2nd refresh and thereafter and also others like top don't hung.
So, the first call is an expensive call.

To revive this stuff, please try D31163. You need to replace sysctl vm.objects with vm.swap_objects. No other changes to systat code is required. You might even try vm.swap_objects first, and if not available, still use vm.objects.

The new sysctl reports all but only swap objects in one go. This is probably the fastest we can get, except for computing swap use fully in kernel (but I prefer to avoid that). In particular, comparing with my initial proposal of doing per-vm_entry sysctl, this only requires one syscall to fetch relevant objects info.

So please apply D31163, change code to use new sysctl, and report if it is any faster. I think that we would move forward anyway, assuming that the per-process swap use calculation can be disabled.

gbe added a subscriber: gbe.

OK for the man page part, since all comments are addressed.

I tested the 2 kernel changes both applied. In short, it helps since the 2nd refresh.

With or without the kernel changes, systat takes long time to return from the call. During that period, top, ps and few others freeze, although xosview draws memory and swap usage fine. Without the kernel changes, each systat refresh takes same amount of time and causes some others to hung. On the other hand, with the kernel changes, systat redraws significantly faster for the 2nd refresh and thereafter and also others like top don't hung.
So, the first call is an expensive call.

Have you had a chance to try out Kostik's latest sysctl addition (D31163 to add sysctl vm.objects_swap) that returns only swap objects to
see if it reduces the overhead when you are running your `systat -swap'?

I started looking into https://reviews.freebsd.org/D31163.
I've been releng/13.0 based system (with cherrypick).
This time, the patch doesn't apply to 13 and I need to prepare main branch based environment.
I will need a bit time to build the latest.

  • systat: Stop displaying total if there is one swap devise.
  • systat: Implemented per-process swap display on -swap.
  • Avoid double counting and address review comments.
  • systat.1 updates
  • D31163 - Add sysctl vm.objects_swap.
  • Use new vm.objects_swap instead of vm.objects.

Added a patch from D31163. Unfortuantely, the new sysctl didn't improve
performance issue with tmpfs files.

  • systat: Stop displaying total if there is one swap devise.
  • systat: Implemented per-process swap display on -swap.
  • Avoid double counting and address review comments.
  • systat.1 updates
  • D31163 - Add sysctl vm.objects_swap.
  • Use new vm.objects_swap instead of vm.objects.

Added a patch from D31163. Unfortuantely, the new sysctl didn't improve
performance issue with tmpfs files.

Great to see that you have had some time to get back to this project.

While you say that D31163 did not improve the performance issue with tmpfs files, did it improve performance on other tests? Or did it fail to improve performance on any of your tests?

While you say that D31163 did not improve the performance issue with tmpfs files, did it improve performance on other tests? Or did it fail to improve performance on any of your tests?

tmpfs with lots of files slows down a lot. For example, when I extract ports.txz onto tmpfs, I observed sysctl("vm.objects") call takes significant amout of time with truss -o file. Each sysctl calls takes more than few minutes. On the contrary, when I have a large file on tmpfs, e.g. dd if=/dev/zero of=/tmpfs/large bs=1M count=10240, vm.objects calls are very quick.

I haven't be able to figure out which function during a system call taking long time, yet.