Page MenuHomeFreeBSD

Align the laundry and page out worker shortfall calculations
Needs ReviewPublic

Authored by jtl on Feb 5 2020, 4:01 PM.

Details

Reviewers
markj
jeff
kib
alc
Summary

Fix a race and a mismatch in the laundry system's shortfall mode.

The race occurs in accessing vmd->vmd_pageout_deficit. The page out worker accesses and clears this while it is calculating the shortfall. If it calculates a shortfall which vm_pageout_scan_inactive() cannot clear, vm_pageout_scan_inactive() will signal the laundry worker to enter "shortfall" mode. The laundry worker will try to use vmd->vmd_pageout_deficit in its calculation of the shortfall. However, because this variable has now been cleared (and its contents are unpredictable, as they could evidence anywhere from 0 to 10 ms of accumulation), the laundry worker's shortfall calculation is unpredictable.

The mismatch occurs in the calculation of the shortfall. r329882 added a PID controller to calculate the shortfall. When vm_pageout_scan_inactive() cannot clear the shortfall, vm_pageout_scan_inactive() will signal the laundry worker to enter "shortfall" mode. However, the laundry worker makes an independent calculation of the shortfall. Importantly, it does not consult the PID controller. This can lead to the perverse result that the laundry system cancels an in-progress background laundering operation, calculates that there is no shortfall, and does not even immediately resume background laundering.

This commit gives the laundry worker access to the shortfall calculated by the PID loop so the laundry worker will have the same shortfall information the pageout daemon has.

It is possible that vm_pageout_scan_inactive() will resolve a great deal of the shortfall. Therefore, this change may cause the laundry thread to be a touch too aggressive. However, I think the theory here is similar to the theory of using vmd->vmd_pageout_deficit: it is likely the shortfall will recur, so it is probably justified to be more aggressive in laundering pages as a one-time event.

(Note: there may be better fixes. However, I'm proposing this as a straw man.)

Test Plan

I've created artificial memory pressure on a test system by consuming a large amount of memory and repetitively touching each page in a loop. At the same time, I have background processes running which find large files and calculate their md5 checksum.

After applying this change, swap usage seemed appropriate for the memory load. And, the system seemed to be more stable. Also, the free memory seemed to stay in a fairly well-constrained range.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 29183
Build 27115: arc lint + arc unit

Event Timeline

I believe that for global system behavior, it is better to clean pages from laundry earlier than later, even if on principle that more clean pages we have, easier can be survive a peak in allocations.
After your observation, I wonder should vmd_shortage be updated by adding the current shortage instead of directly assigning to it.

So, the problem manifests as the laundry queue steadily growing without any swapping in response? And the theory is that shortfall requests are interrupting periodic background laundering, and the laundry thread fails to do anything because it frequently reads vmd_free_count after the page daemon has brought it up above vmd_free_target?

Note that the page daemon uses vmd_free_target as the PID controller set point, but its target may be larger than the instantaneous difference vmd_free_target - vmd_free_count. So if it manages to free enough pages to satisfy vm_paging_target(), but not enough to satisfy the PID controller target, it'll trigger the laundry thread's shortfall mode, which then does nothing (unless pageout_deficit happens to be bigger than the negative difference). This suggests to me that the page daemon should be storing the value of page_shortage in the vmd_shortage field. In other words, the page daemon has failed to meet its target by page_shortage pages, and the laundry thread should try and make up that difference.

Assuming my understanding is right, I'm surprised that you're seeing shortfall requests so regularly. Isn't the inactive queue usually quite large in steady state operation?

So, the problem manifests as the laundry queue steadily growing without any swapping in response?

We noticed this when we tried enabling encrypted swap. On the console, we see a string of processes killed due to the server being out of memory. Then, eventually, the watchdog timer fires and kills the system. I don't know what triggers this cycle. It seems to happen on a small percentage of systems hours to days after boot. To the best of my knowledge, we have not been able to observe a system descend into this naturally.

We tried to recreate the problem by artificially creating memory pressure. We ran a program that allocates a lot of memory (equal to the sum of the free and inactive sizes) and sequentially writes to each page in a loop. When we did this, we saw:

  1. The laundry size is growing.
  2. Processes are continually killed due to low memory.
  3. Finally, the watchdog kicks in and reboots the system.

Now, we may have recreated a different problem that has similar symptoms. But, at minimum, it seems like this is showing *a* bug.

And the theory is that shortfall requests are interrupting periodic background laundering, and the laundry thread fails to do anything because it frequently reads vmd_free_count after the page daemon has brought it up above vmd_free_target?

My theory is that the PID controller is predicting we will need additional free pages in the future, so it sets a large shortfall. vm_pageout_scan_inactive() is able to clear some of the shortfall and leaves vmd_free_count above vmd_free_target, but it is not able to completely satisfy all the shortfall predicted by the PID controller. So, vm_pageout_scan_inactive() still thinks there is a shortfall and sets VM_LAUNDRY_SHORTFALL.

When the laundry thread wakes up, it sees VM_LAUNDRY_SHORTFALL. However, when it calculates the shortfall, it comes up with a negative number. This interrupts the current background laundering (setting the target to 0), but functionally does not actually cause a "shortfall" laundering. I suspect the most likely scenario is that it starts a new background laundering; however, it looks like it is possible that there could be a perverse set of circumstances where the laundering actually goes back to idle.

This theory is supported by a core I gathered during an artificial recreation.

The laundry thread was sitting idle:

Thread 592 (Thread 100424):
#0  sched_switch (td=0xfffff8000483f6e0, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/sched_ule.c:2148
#1  0xffffffff8059d562 in mi_switch (flags=0x104) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:525
#2  0xffffffff805ec93f in sleepq_timedwait (wchan=<unavailable>, pri=<unavailable>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/subr_sleepqueue.c:705
#3  0xffffffff8059cd7b in _sleep (ident=0xffffffff80a6a89e <pause_wchan+14>, lock=0x0, priority=0x0, wmesg=0xffffffff809dfe8d "laundp", sbt=0x1999997c, pr=0x0, flags=0x100) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:216
#4  0xffffffff8059d284 in pause_sbt (wmesg=<unavailable>, sbt=0x1999997c, pr=<unavailable>, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:335
#5  0xffffffff8089ed10 in vm_pageout_laundry_worker (arg=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:1114
#6  0xffffffff8055660e in fork_exit (callout=0xffffffff8089dff0 <vm_pageout_laundry_worker>, arg=0x0, frame=0xfffffe038cd66480) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_fork.c:1058
#7  <signal handler called>

When it wakes up, it should find there is a shortfall, since vmd->vmd_laundry_request was set to VM_LAUNDRY_SHORTFALL:

  vmd_laundry_request = VM_LAUNDRY_SHORTFALL,

It should then calculate the shortfall as ((vmd->vmd_free_target - vmd->vmd_free_count) + vmd->vmd_pageout_deficit).

In this case, the core shows:

  vmd_free_target = 0x547a0, 
  vmd_free_count = 0x54a9f,
  vmd_pageout_deficit = 0x0,

Therefore, the "shortfall" would be calculated as -767.

Note that the page daemon uses vmd_free_target as the PID controller set point, but its target may be larger than the instantaneous difference vmd_free_target - vmd_free_count. So if it manages to free enough pages to satisfy vm_paging_target(), but not enough to satisfy the PID controller target, it'll trigger the laundry thread's shortfall mode, which then does nothing (unless pageout_deficit happens to be bigger than the negative difference). This suggests to me that the page daemon should be storing the value of page_shortage in the vmd_shortage field. In other words, the page daemon has failed to meet its target by page_shortage pages, and the laundry thread should try and make up that difference.

@gallatin said something similar. I tried using page_shortage instead of the PID controller target. In the artificial stress test we used to recreate this, the machine encountered significant problems due to failed memory allocations. In that respect, using page_shortage seemed to perform much worse than the version I've put in this review. @gallatin made the point that this version will probably over-estimate the number of pages which need to be laundered in normal circumstances. However, it appears there exist conditions where using page_shortage will not launder enough pages quickly enough.

I'm not married to this exact mechanism for calculating the shortfall. Indeed, I agree with @gallatin that this mechanism will probably over-estimate the number of pages which need to be laundered. However, the question then becomes: what is a better metric?

@kib said above that it may be better to launder pages earlier rather than later. If that is true, it might be better to overestimate the shortfall than risk underestimating it. OTOH, I suspect it is possible that overestimating the shortfall too much would cause its own performance problems.

Assuming my understanding is right, I'm surprised that you're seeing shortfall requests so regularly. Isn't the inactive queue usually quite large in steady state operation?

Yes, that is true; however, we admittedly don't know what is triggering this. It seems possible that this is caused by one of the periodic processes we run suddenly taking too much memory too quickly. Also, anecdotally, this seems to occur at least some of the time during the window where we are downloading content and writing it to the disk, rather than primarily serving content from the server. So, it is possible that this is occurring at a time when the inactive queue is relatively low. However, that is all conjecture.

Note that the page daemon uses vmd_free_target as the PID controller set point, but its target may be larger than the instantaneous difference vmd_free_target - vmd_free_count. So if it manages to free enough pages to satisfy vm_paging_target(), but not enough to satisfy the PID controller target, it'll trigger the laundry thread's shortfall mode, which then does nothing (unless pageout_deficit happens to be bigger than the negative difference). This suggests to me that the page daemon should be storing the value of page_shortage in the vmd_shortage field. In other words, the page daemon has failed to meet its target by page_shortage pages, and the laundry thread should try and make up that difference.

As I noted earlier, using page_shortage seemed to perform much worse than the version I've put in this review. However, it occurred to me that there could be cases where vmd_free_target - vmd_free_count might be very small, leading the shortfall laundry to be less aggressive than the background laundry. So, I'm testing out a version which calculates the shortfall using imax(page_shortage, vmd_background_launder_target). I'll report back once I have some data on this.

In D23517#516336, @jtl wrote:

So, the problem manifests as the laundry queue steadily growing without any swapping in response?

We noticed this when we tried enabling encrypted swap. On the console, we see a string of processes killed due to the server being out of memory. Then, eventually, the watchdog timer fires and kills the system. I don't know what triggers this cycle. It seems to happen on a small percentage of systems hours to days after boot. To the best of my knowledge, we have not been able to observe a system descend into this naturally.

We tried to recreate the problem by artificially creating memory pressure. We ran a program that allocates a lot of memory (equal to the sum of the free and inactive sizes) and sequentially writes to each page in a loop. When we did this, we saw:

  1. The laundry size is growing.
  2. Processes are continually killed due to low memory.
  3. Finally, the watchdog kicks in and reboots the system.

Now, we may have recreated a different problem that has similar symptoms. But, at minimum, it seems like this is showing *a* bug.

I've done tests like this in the past. Indeed, if the program is dirtying pages quickly enough we might give up an attempt an OOM kill, but we should definitely be targetting the runaway process first. It's possible that we may have swapped its kernel stack out, in which case I believe we have to swap it back in to reclaim anything, since reclamation happens during SIGKILL-triggered process exit. (I don't see offhand why another thread couldn't call pmap_remove_pages() on the target process before it is swapped back in though.)

I have not tried testing with a GELI-backed swap device though. Presumably you were using one? Do you see a difference in behaviour when swap is unencrypted?

And the theory is that shortfall requests are interrupting periodic background laundering, and the laundry thread fails to do anything because it frequently reads vmd_free_count after the page daemon has brought it up above vmd_free_target?

My theory is that the PID controller is predicting we will need additional free pages in the future, so it sets a large shortfall. vm_pageout_scan_inactive() is able to clear some of the shortfall and leaves vmd_free_count above vmd_free_target, but it is not able to completely satisfy all the shortfall predicted by the PID controller. So, vm_pageout_scan_inactive() still thinks there is a shortfall and sets VM_LAUNDRY_SHORTFALL.

When the laundry thread wakes up, it sees VM_LAUNDRY_SHORTFALL. However, when it calculates the shortfall, it comes up with a negative number. This interrupts the current background laundering (setting the target to 0), but functionally does not actually cause a "shortfall" laundering. I suspect the most likely scenario is that it starts a new background laundering; however, it looks like it is possible that there could be a perverse set of circumstances where the laundering actually goes back to idle.

This theory is supported by a core I gathered during an artificial recreation.

The laundry thread was sitting idle:

Thread 592 (Thread 100424):
#0  sched_switch (td=0xfffff8000483f6e0, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/sched_ule.c:2148
#1  0xffffffff8059d562 in mi_switch (flags=0x104) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:525
#2  0xffffffff805ec93f in sleepq_timedwait (wchan=<unavailable>, pri=<unavailable>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/subr_sleepqueue.c:705
#3  0xffffffff8059cd7b in _sleep (ident=0xffffffff80a6a89e <pause_wchan+14>, lock=0x0, priority=0x0, wmesg=0xffffffff809dfe8d "laundp", sbt=0x1999997c, pr=0x0, flags=0x100) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:216
#4  0xffffffff8059d284 in pause_sbt (wmesg=<unavailable>, sbt=0x1999997c, pr=<unavailable>, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:335
#5  0xffffffff8089ed10 in vm_pageout_laundry_worker (arg=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:1114
#6  0xffffffff8055660e in fork_exit (callout=0xffffffff8089dff0 <vm_pageout_laundry_worker>, arg=0x0, frame=0xfffffe038cd66480) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_fork.c:1058
#7  <signal handler called>

When it wakes up, it should find there is a shortfall, since vmd->vmd_laundry_request was set to VM_LAUNDRY_SHORTFALL:

  vmd_laundry_request = VM_LAUNDRY_SHORTFALL,

It should then calculate the shortfall as ((vmd->vmd_free_target - vmd->vmd_free_count) + vmd->vmd_pageout_deficit).

In this case, the core shows:

  vmd_free_target = 0x547a0, 
  vmd_free_count = 0x54a9f,
  vmd_pageout_deficit = 0x0,

Therefore, the "shortfall" would be calculated as -767.

Note that the page daemon uses vmd_free_target as the PID controller set point, but its target may be larger than the instantaneous difference vmd_free_target - vmd_free_count. So if it manages to free enough pages to satisfy vm_paging_target(), but not enough to satisfy the PID controller target, it'll trigger the laundry thread's shortfall mode, which then does nothing (unless pageout_deficit happens to be bigger than the negative difference). This suggests to me that the page daemon should be storing the value of page_shortage in the vmd_shortage field. In other words, the page daemon has failed to meet its target by page_shortage pages, and the laundry thread should try and make up that difference.

@gallatin said something similar. I tried using page_shortage instead of the PID controller target. In the artificial stress test we used to recreate this, the machine encountered significant problems due to failed memory allocations. In that respect, using page_shortage seemed to perform much worse than the version I've put in this review. @gallatin made the point that this version will probably over-estimate the number of pages which need to be laundered in normal circumstances. However, it appears there exist conditions where using page_shortage will not launder enough pages quickly enough.

As in, you still see OOM kills with this solution?

I'm not married to this exact mechanism for calculating the shortfall. Indeed, I agree with @gallatin that this mechanism will probably over-estimate the number of pages which need to be laundered. However, the question then becomes: what is a better metric?

@kib said above that it may be better to launder pages earlier rather than later. If that is true, it might be better to overestimate the shortfall than risk underestimating it. OTOH, I suspect it is possible that overestimating the shortfall too much would cause its own performance problems.

The shortfall state basically means, launder pages until we hit the free page target. (With pauses in between scans to allow pending swap operations to finish.) I don't think there's a lot of harm in overshooting that target. We could perhaps ask the PID controller for its target instead.

Assuming my understanding is right, I'm surprised that you're seeing shortfall requests so regularly. Isn't the inactive queue usually quite large in steady state operation?

Yes, that is true; however, we admittedly don't know what is triggering this. It seems possible that this is caused by one of the periodic processes we run suddenly taking too much memory too quickly. Also, anecdotally, this seems to occur at least some of the time during the window where we are downloading content and writing it to the disk, rather than primarily serving content from the server. So, it is possible that this is occurring at a time when the inactive queue is relatively low. However, that is all conjecture.

I think when @gallatin and I discussed this in the past, I suggested trying setting vm.panic_on_oom=1. Is that not viable?

In D23517#516336, @jtl wrote:

So, the problem manifests as the laundry queue steadily growing without any swapping in response?

We noticed this when we tried enabling encrypted swap. On the console, we see a string of processes killed due to the server being out of memory. Then, eventually, the watchdog timer fires and kills the system. I don't know what triggers this cycle. It seems to happen on a small percentage of systems hours to days after boot. To the best of my knowledge, we have not been able to observe a system descend into this naturally.

We tried to recreate the problem by artificially creating memory pressure. We ran a program that allocates a lot of memory (equal to the sum of the free and inactive sizes) and sequentially writes to each page in a loop. When we did this, we saw:

  1. The laundry size is growing.
  2. Processes are continually killed due to low memory.
  3. Finally, the watchdog kicks in and reboots the system.

Now, we may have recreated a different problem that has similar symptoms. But, at minimum, it seems like this is showing *a* bug.

I've done tests like this in the past. Indeed, if the program is dirtying pages quickly enough we might give up an attempt an OOM kill, but we should definitely be targetting the runaway process first. It's possible that we may have swapped its kernel stack out, in which case I believe we have to swap it back in to reclaim anything, since reclamation happens during SIGKILL-triggered process exit. (I don't see offhand why another thread couldn't call pmap_remove_pages() on the target process before it is swapped back in though.)

I have not tried testing with a GELI-backed swap device though. Presumably you were using one? Do you see a difference in behaviour when swap is unencrypted?

I'll need to recreate this. ISTR that the process was running for 10 or more seconds while it was being cleaned up. I don't recall the wait chan. But, I need to recreate it to gather more info.

Note that the page daemon uses vmd_free_target as the PID controller set point, but its target may be larger than the instantaneous difference vmd_free_target - vmd_free_count. So if it manages to free enough pages to satisfy vm_paging_target(), but not enough to satisfy the PID controller target, it'll trigger the laundry thread's shortfall mode, which then does nothing (unless pageout_deficit happens to be bigger than the negative difference). This suggests to me that the page daemon should be storing the value of page_shortage in the vmd_shortage field. In other words, the page daemon has failed to meet its target by page_shortage pages, and the laundry thread should try and make up that difference.

@gallatin said something similar. I tried using page_shortage instead of the PID controller target. In the artificial stress test we used to recreate this, the machine encountered significant problems due to failed memory allocations. In that respect, using page_shortage seemed to perform much worse than the version I've put in this review. @gallatin made the point that this version will probably over-estimate the number of pages which need to be laundered in normal circumstances. However, it appears there exist conditions where using page_shortage will not launder enough pages quickly enough.

As in, you still see OOM kills with this solution?

Actually, the root file system went away.

(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=61942f 0 0 0 0 0
(nda0:nvme0:0:0:1): CAM status: Resource Unavailable
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
GEOM_MIRROR: Cannot update metadata on disk nda0p2 (error=5).
GEOM_MIRROR: Device prim: provider nda0p2 disconnected.
g_vfs_done():mirror/prim[WRITE(offset=327680, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=393216, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=524288, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=786432, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=851968, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=917504, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=983040, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=1048576, length=65536)]error = 45
g_vfs_done():mirror/prim[WRITE(offset=1114112, length=65536)]error = 45
vm_fault: pager read error, pid 1191 (sshd)

I recognize that this is - on the surface - not obviously traceable to a memory shortage. However, I suspect it is related as I saw other memory-shortage messages intermingled with CAM events like this in the minutes leading up to the root drive disappearing. (And, there is no other indication that there is any problem with any of the many disks associated with these errors.)

I'm not married to this exact mechanism for calculating the shortfall. Indeed, I agree with @gallatin that this mechanism will probably over-estimate the number of pages which need to be laundered. However, the question then becomes: what is a better metric?

@kib said above that it may be better to launder pages earlier rather than later. If that is true, it might be better to overestimate the shortfall than risk underestimating it. OTOH, I suspect it is possible that overestimating the shortfall too much would cause its own performance problems.

The shortfall state basically means, launder pages until we hit the free page target. (With pauses in between scans to allow pending swap operations to finish.) I don't think there's a lot of harm in overshooting that target. We could perhaps ask the PID controller for its target instead.

That's what I was attempting to do with this review; however, I may have missed some subtleties. In particular, the PID controller target could be up to 10ms old by the time the laundry thread reads it.

Assuming my understanding is right, I'm surprised that you're seeing shortfall requests so regularly. Isn't the inactive queue usually quite large in steady state operation?

Yes, that is true; however, we admittedly don't know what is triggering this. It seems possible that this is caused by one of the periodic processes we run suddenly taking too much memory too quickly. Also, anecdotally, this seems to occur at least some of the time during the window where we are downloading content and writing it to the disk, rather than primarily serving content from the server. So, it is possible that this is occurring at a time when the inactive queue is relatively low. However, that is all conjecture.

I think when @gallatin and I discussed this in the past, I suggested trying setting vm.panic_on_oom=1. Is that not viable?

It caught things when we were in low memory situations, but not necessarily when we're in this unrecoverable death spiral where processes are regularly being killed. I modified the vm.panic_on_oom handling so it takes a count and panics after that number of processes are killed. That let me get a better view of what's happening. However, this recreation still occurred with a good deal of artificial changes, so it doesn't answer all of the questions about what is happening in a more realistic case.

In this artificial recreation, it looks like:

  1. The laundry thread is waiting for the last swap I/O to complete:
Thread 602 (Thread 100426):
#0  sched_switch (td=0xfffff8000b0b3000, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/sched_ule.c:2148
#1  0xffffffff8059d562 in mi_switch (flags=0x104) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:525
#2  0xffffffff8059cd9b in _sleep (ident=0xffffffff80ec34c4 <nsw_wcount_async>, lock=0xffffffff80ec34d0 <swbuf_mtx>, priority=0x54, wmesg=0xffffffff809fb018 "swbufa", sbt=0x0, pr=0x0, flags=0x100) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:220
#3  0xffffffff8086fabc in swap_pager_putpages (object=0xfffff804559d3318, ma=0xfffffe038cd70050, count=0x5, flags=<optimized out>, rtvals=0xfffffe038cd6fec0) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/swap_pager.c:1464
#4  0xffffffff8089bad7 in vm_pager_put_pages (object=<optimized out>, m=0xfffffe038cd70050, count=0x5, flags=0x4, rtvals=0xfffffe038cd6fec0) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pager.h:135
#5  vm_pageout_flush (mc=0xfffffe038cd70050, count=0x5, flags=0x4, mreq=0x0, prunlen=0x0, eio=0x0) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:509
#6  0xffffffff8089f4e5 in vm_pageout_cluster (m=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:460
#7  0xffffffff8089e81a in vm_pageout_clean (m=<optimized out>, numpagedout=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:708
#8  vm_pageout_launder (vmd=<optimized out>, launder=0x10ee61, in_shortfall=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:934
#9  vm_pageout_laundry_worker (arg=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_pageout.c:1116
#10 0xffffffff8055660e in fork_exit (callout=0xffffffff8089e000 <vm_pageout_laundry_worker>, arg=0x0, frame=0xfffffe038cd70480) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_fork.c:1058
#11 <signal handler called>
  1. The geli worker is waiting to malloc memory:
Thread 472 (Thread 100688):
#0  sched_switch (td=0xfffff801e3ef5000, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/sched_ule.c:2148
#1  0xffffffff8059d562 in mi_switch (flags=0x104) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:525
#2  0xffffffff8059cd9b in _sleep (ident=0xffffffff80c48340 <vm_min_domains>, lock=0xffffffff80c44300 <vm_domainset_lock>, priority=0x254, wmesg=0xffffffff809dfe72 "vmwait", sbt=0x0, pr=0x0, flags=0x100) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:220
#3  0xffffffff80898340 in vm_wait_doms (wdoms=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_page.c:3142
#4  0xffffffff8087b047 in keg_fetch_slab (keg=<optimized out>, zone=0xfffff800022ab000, rdomain=0xffffffff, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:3226
#5  zone_import (arg=<optimized out>, bucket=0xfffffe09ed1942a0, max=<optimized out>, domain=0xffffffff, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:3287
#6  0xffffffff80876b29 in zone_alloc_item_locked (zone=0xfffff800022ab000, udata=0x0, domain=<unavailable>, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:3454
#7  0xffffffff80875d41 in uma_zalloc_arg (zone=0xfffff800022ab000, udata=0x0, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:2964
#8  0xffffffff80570fbf in uma_zalloc (zone=0xfffff800022ab000, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma.h:348
#9  malloc (size=0x1100, mtp=<optimized out>, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_malloc.c:635
#10 0xffffffff805014ff in g_eli_crypto_run (wr=0xfffff8005ea28000, bp=0xfffff8048849f000) at /usr/home/jtl/ocafirmware/FreeBSD/sys/geom/eli/g_eli_privacy.c:265
#11 0xffffffff804fadf8 in g_eli_worker (arg=0xfffff8005ea28000) at /usr/home/jtl/ocafirmware/FreeBSD/sys/geom/eli/g_eli.c:692
#12 0xffffffff8055660e in fork_exit (callout=0xffffffff804faab0 <g_eli_worker>, arg=0xfffff8005ea28000, frame=0xfffffe09ed194480) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_fork.c:1058
#13 <signal handler called>

The VM system is obviously under severe pressure:

(kgdb) print vm_min_waiters
$5 = 0xb
(kgdb) print vm_severe_waiters
$6 = 0x0
(kgdb) print vm_dom[0].vmd_free_count
$7 = 0xf3b6
(kgdb) print vm_dom[0].vmd_free_min  
$8 = 0x1900d
(kgdb) print vm_dom[0].vmd_free_severe
$9 = 0xf174

I still think we should update the laundry thread to be more aggressive in shortfall situations, to make sure we stay ahead of the demand. However, I think I'm also going to propose some changes to the geli code to make it less likely it will need to allocate space for encrypted swap in the critical path.

In D23517#517180, @jtl wrote:
  1. The geli worker is waiting to malloc memory:
Thread 472 (Thread 100688):
#0  sched_switch (td=0xfffff801e3ef5000, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/sched_ule.c:2148
#1  0xffffffff8059d562 in mi_switch (flags=0x104) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:525
#2  0xffffffff8059cd9b in _sleep (ident=0xffffffff80c48340 <vm_min_domains>, lock=0xffffffff80c44300 <vm_domainset_lock>, priority=0x254, wmesg=0xffffffff809dfe72 "vmwait", sbt=0x0, pr=0x0, flags=0x100) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_synch.c:220
#3  0xffffffff80898340 in vm_wait_doms (wdoms=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/vm_page.c:3142
#4  0xffffffff8087b047 in keg_fetch_slab (keg=<optimized out>, zone=0xfffff800022ab000, rdomain=0xffffffff, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:3226
#5  zone_import (arg=<optimized out>, bucket=0xfffffe09ed1942a0, max=<optimized out>, domain=0xffffffff, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:3287
#6  0xffffffff80876b29 in zone_alloc_item_locked (zone=0xfffff800022ab000, udata=0x0, domain=<unavailable>, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:3454
#7  0xffffffff80875d41 in uma_zalloc_arg (zone=0xfffff800022ab000, udata=0x0, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma_core.c:2964
#8  0xffffffff80570fbf in uma_zalloc (zone=0xfffff800022ab000, flags=0x2) at /usr/home/jtl/ocafirmware/FreeBSD/sys/vm/uma.h:348
#9  malloc (size=0x1100, mtp=<optimized out>, flags=<optimized out>) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_malloc.c:635
#10 0xffffffff805014ff in g_eli_crypto_run (wr=0xfffff8005ea28000, bp=0xfffff8048849f000) at /usr/home/jtl/ocafirmware/FreeBSD/sys/geom/eli/g_eli_privacy.c:265
#11 0xffffffff804fadf8 in g_eli_worker (arg=0xfffff8005ea28000) at /usr/home/jtl/ocafirmware/FreeBSD/sys/geom/eli/g_eli.c:692
#12 0xffffffff8055660e in fork_exit (callout=0xffffffff804faab0 <g_eli_worker>, arg=0xfffff8005ea28000, frame=0xfffffe09ed194480) at /usr/home/jtl/ocafirmware/FreeBSD/sys/kern/kern_fork.c:1058
#13 <signal handler called>

For this case, perhaps geli should stop using M_WAITOK at all, and return ENOMEM up. Actually ENOMEM is handled somehow by GEOM, and this handling is somewhat better than just sleep.

I am researching and typing a larger note regarding this but I wanted to make these comments first.

There was a bug for a few weeks after my vm_fault refactoring that triggered oom more aggressively. I would make sure this isn't compounding the problem.

vm_pageout_deficit is really a small memory/large allocation fix. Imagine that you need to allocate more than free_target pages. The pageout daemon will never free enough so deficit bumps it up. It should be relatively small values or zero in practice. Now that we have cell phones with gigabytes of memory it feels a little anachronistic but Alan preferred to keep it and it's not a major hindrance.

As noted elsewhere, encrypted geli needs to do better with out of memory conditions. We may need to also detect that it is doing work on behalf of the swap pager and allow it to dig a little deeper so that it can succeed. This is somewhat systemic. It is hard to identify all of the allocations that may result in memory deadlock if they do not succeed.

I also want to point out that laundry may have vnode data or swap data and we likely want different regulation policies for each. I believe these two cases are different because vnodes are final storage where the dirty pages are likely to end up regardless of pressure. There is little reason to delay as long to write them as there is for swap. In contrast, it is completely reasonable that an application consisting mostly of heap may want to consume a large fraction of memory with dirty pages and not incur the write overhead to the swap device. In practice this is not too important because msync runs periodically and flushes dirty file data and filesystems aren't placing dirty data in the laundry queue from the file i/o path.

I think a large part of the problem here is that laundry is trying to balance too many plates in its control system. It really should be responsible for keeping the write workload sane while inactive should be responsible for freeing memory. I'm not sure we made a clean enough separation of responsibilities when we split the two and it has created an awkward inter-dependency.

I also note that vm_page_setdirty() does not place pages on the laundry queue so we may be underestimating its size depending on how the test program generates dirty pages.

In D23517#517851, @jeff wrote:

I am researching and typing a larger note regarding this but I wanted to make these comments first.

There was a bug for a few weeks after my vm_fault refactoring that triggered oom more aggressively. I would make sure this isn't compounding the problem.

To be specific, the affected revisions are r357026-r357252.

vm_pageout_deficit is really a small memory/large allocation fix. Imagine that you need to allocate more than free_target pages. The pageout daemon will never free enough so deficit bumps it up. It should be relatively small values or zero in practice. Now that we have cell phones with gigabytes of memory it feels a little anachronistic but Alan preferred to keep it and it's not a major hindrance.

As noted elsewhere, encrypted geli needs to do better with out of memory conditions. We may need to also detect that it is doing work on behalf of the swap pager and allow it to dig a little deeper so that it can succeed. This is somewhat systemic. It is hard to identify all of the allocations that may result in memory deadlock if they do not succeed.

I also want to point out that laundry may have vnode data or swap data and we likely want different regulation policies for each. I believe these two cases are different because vnodes are final storage where the dirty pages are likely to end up regardless of pressure. There is little reason to delay as long to write them as there is for swap. In contrast, it is completely reasonable that an application consisting mostly of heap may want to consume a large fraction of memory with dirty pages and not incur the write overhead to the swap device. In practice this is not too important because msync runs periodically and flushes dirty file data and filesystems aren't placing dirty data in the laundry queue from the file i/o path.

I think a large part of the problem here is that laundry is trying to balance too many plates in its control system. It really should be responsible for keeping the write workload sane while inactive should be responsible for freeing memory. I'm not sure we made a clean enough separation of responsibilities when we split the two and it has created an awkward inter-dependency.

I tend to agree. The situation is worse now that the page daemon is regulated by the PID controller. When the inactive queue scan target was static the laundry thread's shortfall handling basically just tried to ensure that the next inactive queue scan would be able to meet its target. I think it makes more sense now to make the page daemon control the laundry thread target, and allow the laundry thread to handle regulation, so this patch seems to head in the right direction. That said, in my own testing the laundry thread can saturate an SSD with 200-300MB/s of swap writes even with fairly conservative targets. If making the laundry thread's shortfall handling more aggressive works as a short-term solution, I am not opposed to it, but I am suspicious of GELI and of recent regressions like the OOM bug mentioned above.

I also note that vm_page_setdirty() does not place pages on the laundry queue so we may be underestimating its size depending on how the test program generates dirty pages.