Page MenuHomeFreeBSD

tmpfs: Include inactive memory in "available" calculation
AcceptedPublic

Authored by jhibbits on Wed, Nov 5, 8:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 6, 9:57 PM
Unknown Object (File)
Thu, Nov 6, 9:56 PM
Unknown Object (File)
Thu, Nov 6, 5:44 PM
Unknown Object (File)
Thu, Nov 6, 2:24 PM
Unknown Object (File)
Thu, Nov 6, 1:55 PM
Unknown Object (File)
Thu, Nov 6, 10:42 AM
Unknown Object (File)
Thu, Nov 6, 7:48 AM
Unknown Object (File)
Thu, Nov 6, 7:45 AM
Subscribers

Details

Reviewers
markj
kib
mjg
Summary

Only accounting available swap and explicitly free memory ignores
potentially free (or freeable) memory, making it very pessimistic.
Since the laundry thread counts inactive memory as "free" in its
watermark calculation, it should be safe to do so for tmpfs as well.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 68432
Build 65315: arc lint + arc unit

Event Timeline

I do not object. tmpfs_mem_avail() is too naive without or with this augmentation.

Without it, it might be too pessimistic, with the change, it would be too optimistic. It effectively assumes that all inactive pages can be either freed or swapped out. Not talking about feasibility of such action, just note that there might be not enough swap to write out all inactive anon pages.

This revision is now accepted and ready to land.Thu, Nov 6, 3:58 AM
sys/fs/tmpfs/tmpfs_subr.c
460

I think it doesn't make sense to include both swap_pager_avail and vm_inactive_count(). Inactive pages may be dirty, in which case they require swap space in order to be reclaimed, so here we're effectively double-counting.

It would be better if we had separate queues/accounting for file- and swap-backed pages. That is, it should be okay to include inactive filesystem pages in this calculation.

sys/fs/tmpfs/tmpfs_subr.c
460

I agree this makes it very optimistic, and it would be better to have separate file- and swap-backed accounting, but at this point we don't, that I'm aware of. I chose vm_inactive_count() because it's used as "free" for the laundry calculation, so should suffice here as well. We can't completely disregard the free swap, so would it be better until we do have separate accounting to do a MAX(swap_pager_avail, vm_inactive_count())?

sys/fs/tmpfs/tmpfs_subr.c
460

I chose vm_inactive_count() because it's used as "free" for the laundry calculation, so should suffice here as well

What exactly are you referring to?

We can't completely disregard the free swap, so would it be better until we do have separate accounting to do a MAX(swap_pager_avail, vm_inactive_count())?

I think that'd be better than just adding them together, yes.

sys/fs/tmpfs/tmpfs_subr.c
460

I chose vm_inactive_count() because it's used as "free" for the laundry calculation, so should suffice here as well

What exactly are you referring to?

In vm_pageout_laundry_worker() nclean is the total of the domain's free count and the domain's inactive queue count.

sys/fs/tmpfs/tmpfs_subr.c
460

Sure, but that's a metric used to decide whether we should launder pages to keep pace with the activity of the system. It's not saying that that many pages are "free" per se, just that we should launder more frequently as that metric gets smaller. IMO it's not right to use the same threshold here and point to vm_pageout_laundry_worker() as justification without explaining why that makes sense.

Here we're trying to ensure that tmpfs isn't consuming more RAM than is available. The current limit is too conservative, and your proposed limit is too relaxed. Consider a system without swap: is it right to treat all inactive pages as reclaimable? Maybe yes, maybe no, we just don't have enough information to say. Those inactive pages may well belong to tmpfs itself.

To come up with a better policy, it might help to give some concrete examples where the current limit is inappropriate.

sys/fs/tmpfs/tmpfs_subr.c
460

That all makes sense, yes, and I agree this isn't the same scenario as the laundry thread. At HPE we sometimes run into problems where we hit the 95% default threshold of (swap + free) temporarily (and several of our devices don't even have swap), so config commits fail, but retrying a few seconds later works, after not doing anything else (so likely laundry thread ran). Looking at the top output we see a very high Inactive count at the failure time (>2GB on a 4GB system), and tmpfs only accounts for less than 20MB of that.

On a system with no swap would dirty non-filesystem pages ever get moved to the Inactive queue? There's nowhere for them to go from there.

sys/fs/tmpfs/tmpfs_subr.c
460

On a system with no swap would dirty non-filesystem pages ever get moved to the Inactive queue?

They may initially end up there, yes. If the page daemon scans those pages, they'll get moved to the laundry queue. They effectively stay there forever until they are freed.

sys/fs/tmpfs/tmpfs_subr.c
460

Or more swap is added/freed. In fact, I do not remember if periodic rescan is still done for pages stuck in laundry.

sys/fs/tmpfs/tmpfs_subr.c
460

There is no periodic rescan. If we try to launder a swap-backed page and there are no swap devices, we will move the page to PQ_UNSWAPPABLE. If a swap device is added later, we will move everything back to the laundry queue.

Going back to the original problem, I think splitting each page queue into two, for swap-backed pages and filesystem-backed pages, will help. This can be implemented fairly cheaply and we have lots of unused space in the queue field of struct vm_page. Then we need a policy which defines how to reclaim pages from each queue. Naively, I would reclaim in proportion of the relative sizes of the queues, e.g.,, if PQ_INACTIVE_SWAP contains 10 pages and PQ_INACTIVE_VNODE contains 90 pages and there is a shortage of 10 pages, then 1 swap page and 9 vnode pages should be reclaimed.

sys/fs/tmpfs/tmpfs_subr.c
460

For splitting, do you mean to split based on the object type (OBJT_SWAP) or on the anonymity (OBJ_ANON). I would expect that policies for named swap objects should not differ from the policies for the vnode objects, but named swap objects still take swap space.

If going with the typed inactive queues, for me it sounds as we need to maintain LRU for queues as well as for queues content. I.e. if a page is moved to the tail of specific queue, that queue should loose its position in the order of processing by the pagedaemon thread.

sys/fs/tmpfs/tmpfs_subr.c
460

For splitting, do you mean to split based on the object type (OBJT_SWAP) or on the anonymity (OBJ_ANON). I would expect that policies for named swap objects should not differ from the policies for the vnode objects, but named swap objects still take swap space.

Good point. For the purpose of counting pages that are reclaimable without consuming swap, the split should be determined by OBJ_SWAP, not OBJ_ANON. But yes I am unsure about the implications of treating, e.g., tmpfs pages differently from UFS pages.

OTOH, if we simply reclaim from both queues in proportion to the relative sizes of the queues, I am not sure that it really matters very much.

If going with the typed inactive queues, for me it sounds as we need to maintain LRU for queues as well as for queues content. I.e. if a page is moved to the tail of specific queue, that queue should loose its position in the order of processing by the pagedaemon thread.

Yes, I think I agree. At the beginning of a scan we could put a marker at the end of each queue, and refuse to scan past the marker until we have reached it in both queues.