Make swapoff reliable.
ClosedPublic
Actions

Authored by kib on Aug 29 2016, 8:02 AM.

Details

Reviewers

alc
markj

Commits

rS305129: Make swapoff reliable.

Summary

swap_pager_swapoff() does trylock for pagein, which means that either io to md(4) over swap, or intensive page faults can prevent swapoff() from any progress. Then the retry < 100 check fails and machine panics.

If trylock fails, acquire the object lock in the blockable way and restart the hash bucket walk. I kept retries logic for now.

Another improvement there would be to check the ENOMEM conditions inside swap_pager_swapoff() before each page-in attempt. similar to the ENOMEM check in swapoff_one(), but this is non-trivial because swapoff_one() fills sw_blist, which must be undone on failure.

Test Plan

Peter has specific test that triggers the panic in swap_pager_swapoff(): failed to locate XXX swap blocks. The patched system survives it.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib updated this revision to Diff 19782.Aug 29 2016, 8:02 AM

kib retitled this revision from to Make swapoff reliable..

kib updated this object.

kib edited the test plan for this revision. (Show Details)

kib added reviewers: alc, markj.

kib set the repository for this revision to rS FreeBSD src repository - subversion.

kib added a subscriber: pho.

Herald added a subscriber: imp. · View Herald TranscriptAug 29 2016, 8:02 AM

kib updated this object.Aug 29 2016, 8:03 AM

This seems ok to me. Is there a specific reason to keep the retry logic?

sys/vm/swap_pager.c
1691 ↗	(On Diff #19782)	Missing period.

This revision is now accepted and ready to land.Aug 29 2016, 5:36 PM

In D7688#159797, @markj wrote:

This seems ok to me. Is there a specific reason to keep the retry logic?

One vague reason is that I am not completely convinced that retry per bucket logic is enough.

But more concrete thought is that this logic would allow for more reliability later. Please look at the swp_pager_force_pagein(). Right now this function simply panic if any error is returned by getpages(). By passing the error and retrying if needed we could improve the case.

On my previous job we did see "swap_pager: indefinite wait buffer" on heavily loaded machines.

kib marked an inline comment as done.Aug 29 2016, 6:24 PM

emaste added a subscriber: emaste.Aug 30 2016, 2:31 PM

alc accepted this revision.Aug 31 2016, 12:15 AM

alc edited edge metadata.

Closed by commit rS305129: Make swapoff reliable. (authored by kib). · Explain WhyAug 31 2016, 2:50 PM

This revision was automatically updated to reflect the committed changes.