Page MenuHomeFreeBSD

allow M_USE_RESERVE | M_WAITOK, use that to emulate illumos KM_PUSHPAGE
Needs ReviewPublic

Authored by avg on Feb 27 2017, 10:46 AM.

Details

Reviewers
kib
alc
markj
Summary

Use this combination to mean that if the calling thread runs into a page
shortage and has to wait in VM_WAIT, then it can be waken up and allowed
to try to grab memory when available pages go above v_pageout_free_min
as opposed to much higher (by default) v_free_min threshold.
Thus, M_USE_RESERVE | M_WAITOK sets the same threshold as automatically
given to the page daemon.

Use M_USE_RESERVE | M_WAITOK for illumos KM_PUSHPAGE, which used in
ZFS ARC code, which could be critical to pageout and ARC size reduction.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 7769
Build 7916: arc lint + arc unit

Event Timeline

avg retitled this revision from to allow M_USE_RESERVE | M_WAITOK, use that to emulate illumos KM_PUSHPAGE.
avg updated this object.
avg edited the test plan for this revision. (Show Details)
avg added reviewers: kib, markj.
avg added a subscriber: ZFS.

Is alc subscribed here? I couldn't find him.

sys/amd64/amd64/uma_machdep.c
59

Why did you only handled amd64 ? At least arm64 also needs this change.

sys/vm/vm_page.c
2508

You do not need vm_wait_flags(), add the argument to vm_wait(), and define VM_WAIT as vm_wait(0). Where needed, e.g. in uma_small_alloc(), explicitely call vm_wait().

Kostik, how does the general idea sound to you?
Thank you for the review!

sys/amd64/amd64/uma_machdep.c
59

When I wrote this code there was no arm64.
I'll handle it.

sys/vm/vm_page.c
2508

Okay, I'll do it that way.

In D9823#202569, @avg wrote:

Kostik, how does the general idea sound to you?

I do not have objections against the change. I would more care about ensuring that M_WAITOK | M_USE_RESERVE used only where really needed and not become a common practice.

At a high level, I think that this change is fine. I'll try to give some further thought to whether there are any low-level problems.

Is this a somewhat extreme interpretation of what KM_PUSHPAGE does in illumos? vm_pageout_free_min is quite small (34 pages), and in particular, is smaller than the item size of some of the zio buf zones. From looking at illumos code, v_free_severe might be a closer approximation. OTOH, we have no existing mechanism to wake up sleepers upon a transition across this threshold.

In D9823#203075, @markj wrote:

Is this a somewhat extreme interpretation of what KM_PUSHPAGE does in illumos?

Yes, it says so in the summary :-)

vm_pageout_free_min is quite small (34 pages), and in particular, is smaller than the item size of some of the zio buf zones. From looking at illumos code, v_free_severe might be a closer approximation. OTOH, we have no existing mechanism to wake up sleepers upon a transition across this threshold.

Yes, it is quite small. But I think that the worst that can happen is cycling back to vm_wait. As only the wakeup threshold is changed.

It is interesting that VM_ALLOC_SYSTEM allocations are allowed to succeed as long as v_free_count is greater than v_interrupt_free_min, that is, we have more than just two free pages. But if a thread hits that bottom, then it is forced to wait until there is v_free_min free pages which is 25718 on my 16GB system. Quite a drastic difference, IMO. And it is "unfair" in a sense that while the free page count is between v_interrupt_free_min and v_free_min the original "unlucky" thread would be sleeping in vm_wait while any other thread that makes a memory allocation would be allowed to do that.

In D9823#203128, @avg wrote:
In D9823#203075, @markj wrote:

Is this a somewhat extreme interpretation of what KM_PUSHPAGE does in illumos?

Yes, it says so in the summary :-)

Sure, I really just meant to ask if you agree with "extreme." :)

vm_pageout_free_min is quite small (34 pages), and in particular, is smaller than the item size of some of the zio buf zones. From looking at illumos code, v_free_severe might be a closer approximation. OTOH, we have no existing mechanism to wake up sleepers upon a transition across this threshold.

Yes, it is quite small. But I think that the worst that can happen is cycling back to vm_wait. As only the wakeup threshold is changed.

I'm just concerned by the fact that zio_buf_alloc() and zio_data_buf_alloc() specify KM_PUSHPAGE unconditionally, so it might be easy to get into a state where many threads are waiting in kmem_back() and are competing with the page daemon for free pages. I'm not sure if this ends up being problematic in practice.

It is interesting that VM_ALLOC_SYSTEM allocations are allowed to succeed as long as v_free_count is greater than v_interrupt_free_min, that is, we have more than just two free pages. But if a thread hits that bottom, then it is forced to wait until there is v_free_min free pages which is 25718 on my 16GB system. Quite a drastic difference, IMO. And it is "unfair" in a sense that while the free page count is between v_interrupt_free_min and v_free_min the original "unlucky" thread would be sleeping in vm_wait while any other thread that makes a memory allocation would be allowed to do that.

Some of the thresholds defined in vm_pageout_init() deserve to be revisited; that function still has handling for the case v_page_count <= 1024. IIRC, v_free_min is defined to be roughly ~0.5% of the total number of pages, but I think it'd be preferable to use a static value when the amount of system memory is larger than some threshold.

@markj Yes, I agree that the solution is somewhat extreme (is there such a thing?).
It's true that KM_PUSHPAGE is littered all over ZFS code and all zio buffer allocations are done with it.
I would love to re-work ZFS code, so that all allocations are moved outside locked sections, but the code is entangled that that would be a huge project.

The good news is that ARC tries to regulated itself and blocks new ARC buffer allocations when memory is low. So, in the low memory situations I would expect that it would be mostly the page out and ARC eviction that would be trying to make temporary memory allocations.

But let me think more about this.
Maybe we don't need such a drastic change.

Finally, I agree with you on re-tuning the thresholds.
Thank you for the analysis!