Details

Reviewers

jhb
allanjude
kib
jtl

Commits

rG2dbc9a388eeb: Fix memory deadlock when GELI partition is used for swap.
rG183f8e1e575b: Externalize nsw_cluster_max and initialize it early.
rGc6213beff4f5: Add flag BIO_SWAP to mark IOs that are associated with swap.

Summary

When deploying encrypted swap to a large number of machines with memory pressure, we noticed periodic hangs where the machine would kill off process after process due to being unable to allocate memory for the process.

It looks like one source of this problem might be the circular dependency inherent in allocating memory in the swap path. When we get low on memory, the VM system tries to free some by swapping pages. However, if we are so low on free pages that GELI allocations block, then the swapout operation cannot complete. This keeps the VM system from being able to free enough memory so the allocation can complete.

To alleviate this, keep a UMA pool at the GELI layer which is used for data buffer allocation in the fast path, and reserve some of that memory for swap operations. Signal to the GELI layer that a BIO is part of a swap operation. If so, use the reserved memory. If the allocation still fails, return ENOMEM instead of blocking.

For non-swap allocations, change the default to using M_NOWAIT. In general, this *should* be better, since it gives upper layers a signal of the memory pressure and a chance to manage their failure strategy appropriately. However, a user can set the kern.geom.eli.blocking_malloc sysctl/tunable to restore the previous M_WAIT strategy.

Test Plan

I torture-tested a system for 48+ hours with enough memory pressure to cause repeated ENOMEM errors. However, we did not see the previous behavior where the machine would kill off process after process due to low memory.

A predecessor to this code is widely deployed and appears to be working correctly.

This code has received substantial testing without any known problems.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jtl created this revision.Apr 13 2020, 7:57 PM

Harbormaster completed remote builds in B30469: Diff 70512.Apr 13 2020, 7:57 PM

I do not have any objections against non-geli part, I do not have intent to look into geli.

The unstaticizing nsw_cluster_max and moving initialization probably should be done as separate commit. Might be, introduction of BIO_SWAP as well.

This revision is now accepted and ready to land.Apr 13 2020, 9:30 PM

jhb added inline comments.Apr 13 2020, 10:21 PM

sys/geom/eli/g_eli.c
679 ↗	(On Diff #70512)	This looks unrelated and should perhaps be a separate change? We should probably use the same condition here we use for ktls and crypto threads which also kick in for aarch64 IIRC.
sys/geom/eli/g_eli_integrity.c
403 ↗	(On Diff #70512)	I really worry that while this may be ok for swap to get `ENOMEM` errors, it will not be ok for a GELI disk holding real data. UFS is rather notorious for not handling `EIO` errors from disks well, and even the fixes that have gone in there are not meant to handle temporary errors but to instead degrade to a permanent read-only mode discarding any in-flight data AFAIK. I think if there are use cases where it might be ok for a disk to be lossy, it might need to be something that is opted into (e.g. an argument to geli(8)'s "attach" or "onetime" commands) that is per-disk and not on by default.

To be clear, you only tested encrypted swap? Did you do any testing with encrypted volumes meant to hold persistent data after a reboot (e.g. holding a UFS volume on a disk) and seeing how it was impacted by ENOMEM?

In D24400#536778, @jhb wrote:

To be clear, you only tested encrypted swap? Did you do any testing with encrypted volumes meant to hold persistent data after a reboot (e.g. holding a UFS volume on a disk) and seeing how it was impacted by ENOMEM?

I only tested swap. I'm happy to switch things so that the current behavior is maintained for non-swap partitions until someone can test it more thoroughly.

sys/geom/eli/g_eli.c
679 ↗	(On Diff #70512)	Ah, yes. Sorry. This should be a separate commit. The commit message will say something like: Declare the GELI threads to be kernel FPU threads. This keeps them from fighting over per-CPU locks in the AESNI code. Among other things, the code at the AESNI layer may hold the per-CPU lock while doing an M_NOWAIT memory allocation. While that should not block, it may take some time to gather and map the memory while the VM system is under enough pressure that it is using the swap code.
sys/geom/eli/g_eli_integrity.c
403 ↗	(On Diff #70512)	OK. If I change the default for kern.geom.eli.blocking_malloc to true, that will maintain the current (blocking) behavior for everything except swap. I really feel strongly that swap should not block. I don't feel strongly about anything else, so it will be easy to convince me to leave the current behavior alone.

Address comments by @jhb:

Delete the code to declare the GELI threads as kernel FPU threads. (I'll open a separate review for that.)
Switch the default to blocking mallocs for everything except swap requests.

This revision now requires review to proceed.Apr 13 2020, 10:53 PM

Harbormaster completed remote builds in B30476: Diff 70525.Apr 13 2020, 10:53 PM

In D24400#536778, @jhb wrote:

To be clear, you only tested encrypted swap? Did you do any testing with encrypted volumes meant to hold persistent data after a reboot (e.g. holding a UFS volume on a disk) and seeing how it was impacted by ENOMEM?

Geom will retry the ENOMEM and the upper layers won't see it. we'll just get bad performance when there's a lot of memory pressure, which is fine.

sys/geom/eli/g_eli_integrity.c
403 ↗	(On Diff #70512)	ENOMEM is fine. geom will retry it. UFS won't see that error, so it won't trigger chs' latest work.

imp added inline comments.Apr 14 2020, 5:28 PM

sys/geom/eli/g_eli_integrity.c
403 ↗	(On Diff #70512)	Though I agree we should make sure that my bold assertions work in practice :)

kib added inline comments.Apr 14 2020, 11:07 PM

sys/geom/eli/g_eli_integrity.c
403 ↗	(On Diff #70512)	ENOMEM processing is very special in the geom. It returns error to the upper layer but also set the pace variable which stops processing of the new bios for some time. I remember this because I was not able to make ENOMEM error work as I need for situations where transient map was full for unmapped bios. I ended up with some other error which avoided this stall mechanics and allowed the transient map to free some space.

With the change I made to keep the current behavior for everything except swap (which is fairly well tested), are there additional concerns?

gbe added a subscriber: gbe.May 1 2020, 1:23 PM

This needs rebase now after the zeroing changes

Jonathan asked me to finalize this review.

This is rebase of the D24400 to recent FreeBSD. It also matches the
code that has been running at Netflix for over a year.

Harbormaster completed remote builds in B41625: Diff 95458.Sep 21 2021, 4:53 PM

Given that @kib accepted the change in the past and that the last version from Jonathan would fallback to malloc() for non-swap requests, I plan to commit this in few days, unless any objections arise.

kib added inline comments.Sep 21 2021, 5:33 PM

sys/vm/swap_pager.c
387	I suspect this is no longer correct.

glebius added inline comments.Sep 22 2021, 2:15 AM

sys/vm/swap_pager.c
387	Oh, thanks for noticing!

Catch up on maxphys being a variable.

Harbormaster completed remote builds in B41637: Diff 95485.Sep 22 2021, 2:24 AM

glebius marked an inline comment as done.Sep 22 2021, 2:24 AM

kib added inline comments.Sep 22 2021, 1:48 PM

sys/vm/swap_pager.c
387	This cannot work either, maxphys is initialized only in init_param2. I am not even sure that non-constant initialization is tolerated (i.e. does it compile)? Why the two first chunks for swap_pager.c are needed at all?

jtl added inline comments.Sep 22 2021, 3:30 PM

sys/vm/swap_pager.c
387	The idea is to make nsw_cluster_max externally visible so we can calculate the maximum-sized swap buffer we are likely to see. See g_eli_init_uma() in g_eli.c

Put back the initialization of nsw_cluster_max. Just leave its
visibiliti for geli.

Harbormaster completed remote builds in B41639: Diff 95507.Sep 22 2021, 4:06 PM

I am fine with the sys/swap_pager.c diff. I did not checked the initialization order, i.e. it is up to you to recheck that geli sees nsw_cluster_max initialized.

In D24400#723433, @kib wrote:

I am fine with the sys/swap_pager.c diff. I did not checked the initialization order, i.e. it is up to you to recheck that geli sees nsw_cluster_max initialized.

Initialization order is actually a problem. The swap pager initializes the variable too late :( Looking for a solution.

Move nsw_cluster_max initialization to swap_pager_init(), which
is called in the very early SYSINIT, but after init_param2()
where maxphys is set.

Harbormaster completed remote builds in B41642: Diff 95512.Sep 22 2021, 6:33 PM

kib added inline comments.Sep 22 2021, 7:04 PM

sys/vm/swap_pager.c
594	I would put there a replacement sentence that nsw_cluster_max is initialized in swap_pager_init()

glebius added inline comments.Sep 24 2021, 12:50 AM

sys/vm/swap_pager.c
594	I'd rather expand the comment in swap_pager_init() to explain why we are doing this early. My change makes this comment block dedicated solely to explanation of what nsw_wcount_async is, which I find good. Putting back a note about nsw_cluster_max into the middle of that explanation would disrupt it. Do you agree?

kib added inline comments.Sep 24 2021, 12:56 AM

sys/vm/swap_pager.c
594	Of course I do not object against expanding the comment in swap_pager_init(). If you do not like putting a reference in the middle of this comment, put it at the end: I want something there that points out for the complete set of nsw_* controls.