Paths

Table of Contentst

Differential D19390

Split kernel and user wire accounting.
AbandonedPublic
Actions

Authored by markj on Feb 27 2019, 6:04 PM.

Details

Reviewers

alc
kib
jeff

Summary

Pages are user-wired by mlock(2) and indirectly by mlockall(2).
User-wired pages carry a reference in m->wire_count to ensure that they
are not freed by the page daemon; unlike kernel-wired pages they also
have a flag set in m->flags.

The main motivation for the change is to provide accounting of
user-wired pages. mlock(2) currently fails if the total number of
wired pages exceeds vm_page_wired_max; with this change only the number
of user-wired pages is compared against this limit. I also "fixed"
mlockall(2) to respect this limit, as documented in its man page. In
particular, mlockall(MCL_CURRENT) uses a racy check to determine if the
corresponding wiring would exceed the user-wired limit, and wirings
triggered by the MAP_WIREFUTURE flag are subject to the same limit.
The changes to make mlockall(2) respect the limit should perhaps be
committed separately since they have the potential to introduce
regressions. In this change, old_mlock is extended to disable the
global limit as well.

Only managed physical pages are counted in the user wire count. This is
for two reasons: first, unmanaged or fictitious pages are unevictable
regardless of whether they are user-wired, so logically they should not
be counted against the limit. Second, we use pmap_page_wired_mappings()
to determine whether all user wirings are removed before clearing
PG_USER_WIRED; this does not work for unmanaged pages.

A couple of new KPIs are introduced: vm_page_wire_user() and
vm_page_unwire_user(). These respectively set and clear PG_USER_WIRED.
An alternative would be to account user wirings in the pmap layer, but
that approach is more complicated and I don't see any real benefits.

There are some corner cases in this diff:

In sys_mlockall() we compare map->size with the global limit, but this check is racy since the map size may change before we call vm_map_wire(). The per-process RLIMIT_MEMLOCK check has the same race.
Suppose a range of VAs is user-wired, and then kernel-wired, e.g., by vslock(). Suppose then that the range is user-unwired. For m in the range, pmap_page_wired_mappings(m) > 0 even though we removed the last user wiring, so v_user_wire_count will not be decremented until the kernel wiring is removed.
As mentioned above, unmanaged and ficitious pages are not counted towards the total number of user-wired pages. However, when checking the size of a mapping against the system limit, we do not exclude unmanaged mappings.

I believe these cases are not likely to be problematic in practice.

The diff does not update any documentation yet; I will work on that if
there are no major objections to the approach taken here.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 22789
Build 21881: arc lint + arc unit

Event Timeline

markj created this revision.Feb 27 2019, 6:04 PM

Harbormaster completed remote builds in B22789: Diff 54489.Feb 27 2019, 6:04 PM

markj added reviewers: alc, kib, jeff.Feb 27 2019, 6:05 PM

kib added inline comments.Feb 28 2019, 5:02 PM

sys/vm/vm_fault.c
1198	I do not understand why do you user-wire copied page based on the user-wire status of the original page. I believe that there you should only take into account the user-wire flag as passed to by the caller of vm_fault.
sys/vm/vm_page.c
3593	Did you tried to set PG_USER_WIRED in some wrapper for pmap_enter/pmap_enter_object, instead ?
3695	Can this function be called from vm_page_unwire() automatically ?

markj added inline comments.Feb 28 2019, 7:54 PM

sys/vm/vm_fault.c
1198	I think you're right.
sys/vm/vm_page.c
3593	Note that pmap_enter_object()/enter_quick() do not create PTEs with PG_W set. I don't see much advantage to doing that. The pmap layer does not distinguish between user and kernel wirings. To make a wrapper we'd need to introduce PMAP_ENTER_USER_WIRED or so, but it would not be consumed by the MD code, so it would be a bit strange. Are you concerned that some code may be calling pmap_enter(PMAP_ENTER_WIRED) without setting PG_USER_WIRED appropriately?
3695	I don't think it's a good idea: pmap_page_wired_mappings() can be quite expensive. In the case where an mlock()ed page is frequently evicted from the buffer cache, the vm_page_unwire() will end up frequently calling into the pmap. IMO "wiring" is somewhat overloaded in the VM. It might be better to replace the "user_wire" identifier with "mlock".

[I am not suggesting to redo the patch, only discussing the possible approaches].

I somewhat dislike per-page USER_WIRE flag which require so much hand-holding. By definition, user wire count is exactly equal to the count of pages in wired user map entries. Can we simply account there, in vm_map_wire(), vm_map_entry_wire_failure(), and vm_map_unwire(), without even doing anything with the page wire attribute ?

It would somewhat over-count for private read-only mappings, like libc text mapped into more than one vmspace, and cannot distinguish between managed (that you count now) and unmanaged pages. But isn't it much simpler so that the difference does not matter ?

In D19390#416489, @kib wrote:

[I am not suggesting to redo the patch, only discussing the possible approaches].

I'm happy to rewrite the patch if the result may be cleaner.

I somewhat dislike per-page USER_WIRE flag which require so much hand-holding. By definition, user wire count is exactly equal to the count of pages in wired user map entries. Can we simply account there, in vm_map_wire(), vm_map_entry_wire_failure(), and vm_map_unwire(), without even doing anything with the page wire attribute ?

I don't disagree that it is somewhat ugly, but is it really that much hand-holding? The only places that require modification are the fault handler and code paths which (un)wire pages according to a user's request. Most of the complexity is in CoW paths where the user-wiring state must be propagated, and even there is it not too bad.

It would somewhat over-count for private read-only mappings, like libc text mapped into more than one vmspace, and cannot distinguish between managed (that you count now) and unmanaged pages. But isn't it much simpler so that the difference does not matter ?

That would be fine for the immediate goal from D19247, but I don't really like it - I believe there is value in providing accurate system-wide accounting for mlock().

Looking further ahead, I would like us to stop counting wired pages (i.e., v_wire_count) directly. The reason is that I want struct vm_page's wire_count to become a reference counter, and it makes no sense to touch the per-CPU v_wire_count cacheline every time a transient reference is obtained or released. For accounting purposes, kernel memory usage can be defined to be v_page_count - sum(pagequeue sizes) - v_user_wire_count. Moreover, the notion of wiring for kernel pages is ambiguous: a page allocated with vm_page_alloc() is non-pageable regardless of whether VM_ALLOC_WIRED is specified. (For example, the amd64 pmap allocates some pages without VM_ALLOC_WIRED.) So what exactly does it mean for a page to be wired? IMO the concept should be defined at the mapping level, not at the level of the physical page; to effect wiring, the VM can use the reference counter to ensure that the page is not reclaimable, and the PG_USER_WIRED flag caches the information that at least one non-kernel mapping of the page is wired. This is similar in principle to, e.g., PGA_WRITEABLE.

If you still prefer the simpler route, I will implement it, but I am also trying to influence the long-term direction here.

sys/vm/vm_fault.c
1198	I thought about this some more and no longer agree. Consider the case where a breakpoint is placed on an mlock()ed text page mapped read-execute. Because the page is not writeable, the mlock() call will not copy the page. proc_rwmem() will trigger a COW fault, and the snippet above ensures that the new copy inherits the wire state of the old copy. This copy will remain mapped even when the breakpoint is removed, I believe. Later, when the range is munlocked(), vm_object_unwire() will encounter the page copy. Since the map entry is user-wired, the page copy must be user-wired as well. However, the code above is not right. We must only copy the user-wiring state if the map entry is user-wired. If it is system-wired (e.g., by vslock()), the boolean `wired` variable will be true, and if the original page is user-wired by a different mapping, we will incorrectly migrate the user-wired state to the copy.

In D19390#416570, @markj wrote:

IMO the concept should be defined at the mapping level, not at the level of the physical page; to effect wiring, the VM can use the reference counter to ensure that the page is not reclaimable, and the PG_USER_WIRED flag caches the information that at least one non-kernel mapping of the page is wired. This is similar in principle to, e.g., PGA_WRITEABLE.

This could be interpreted as an argument for maintaining PG_USER_WIRE in the pmap layer rather than MI code. That would probably be cleaner, though the diff would be larger.

emaste added a subscriber: emaste.Mar 5 2019, 3:38 PM

To get accurate user wiring accounting at the pmap layer, you need both the page flag (PG_USER_WIRED ?) and the mapping flag (PG_W), or yet another counter on the page. This seems to be too intrusive for such small feature.
Your formula for 'kernel-used pages' (v_page_count - sum(pagequeue sizes) - v_user_wire_count) does not account for the pageable kernel mappings, i.e. pipe buffers and exec strings. Also it not behaves with the pages wired by the buffer cache.
Since I mentioned buffer cache, suppose userspace wired a file mapping which has buffers instantiated for the pages. If the file is unmapped (or the mapping is munlocked) while buffers are still not reclaimed, then it seems that the current patch could leak the user wiring.

About your notion that it is undesirable to touch v_wire_count on each transition reference, you can only do that on 0->1 and 1->0 edges, and I do not see how could you avoid that.

My impression, after listing all the items above, is that all approaches except accounting at pmap are error-prone, either by leak or by overcounting, or by both. Then I really do not see much sense in doing much more complicated patch instead of almost trivial accounting of wired map entries total size.

In D19390#416928, @kib wrote:

To get accurate user wiring accounting at the pmap layer, you need both the page flag (PG_USER_WIRED ?) and the mapping flag (PG_W), or yet another counter on the page. This seems to be too intrusive for such small feature.

The only entity in the kernel which creates system-wired mappings in a pmap != kernel_pmap is vslock(). I think we can either ignore this special case or change it to not be special (e.g., count vslock() wirings as user wirings).

We don't strictly need even PG_USER_WIRED. All it does is cache information available at the pmap layer.

Your formula for 'kernel-used pages' (v_page_count - sum(pagequeue sizes) - v_user_wire_count) does not account for the pageable kernel mappings, i.e. pipe buffers and exec strings. Also it not behaves with the pages wired by the buffer cache.

Ok, let's say "non-pageable kernel memory" instead of "kernel-used pages." Pipe buffers have their own accounting and exec strings account for only a small fraction of memory anyway.

Since I mentioned buffer cache, suppose userspace wired a file mapping which has buffers instantiated for the pages. If the file is unmapped (or the mapping is munlocked) while buffers are still not reclaimed, then it seems that the current patch could leak the user wiring.

How? Buffer cache mappings are not managed, and vm_object_unwire() always updates user-wiring state based on the number of wired mappings of the page. Indeed, a similar leak exists with vslock(), mentioned in the description, but because it is transient I do not think it is a major problem. Arguably pages wired by sysctl_wire_old_buffer() should be included in v_user_wire_count even if vslock() is not subject to that limit.

About your notion that it is undesirable to touch v_wire_count on each transition reference, you can only do that on 0->1 and 1->0 edges, and I do not see how could you avoid that.

I think that v_wire_count should not be updated directly. It can be computed indirectly on demand.

My impression, after listing all the items above, is that all approaches except accounting at pmap are error-prone, either by leak or by overcounting, or by both. Then I really do not see much sense in doing much more complicated patch instead of almost trivial accounting of wired map entries total size.

In D19390#416973, @markj wrote:

In D19390#416928, @kib wrote:

To get accurate user wiring accounting at the pmap layer, you need both the page flag (PG_USER_WIRED ?) and the mapping flag (PG_W), or yet another counter on the page. This seems to be too intrusive for such small feature.

The only entity in the kernel which creates system-wired mappings in a pmap != kernel_pmap is vslock(). I think we can either ignore this special case or change it to not be special (e.g., count vslock() wirings as user wirings).

We don't strictly need even PG_USER_WIRED. All it does is cache information available at the pmap layer.

Your formula for 'kernel-used pages' (v_page_count - sum(pagequeue sizes) - v_user_wire_count) does not account for the pageable kernel mappings, i.e. pipe buffers and exec strings. Also it not behaves with the pages wired by the buffer cache.

Ok, let's say "non-pageable kernel memory" instead of "kernel-used pages." Pipe buffers have their own accounting and exec strings account for only a small fraction of memory anyway.

Since I mentioned buffer cache, suppose userspace wired a file mapping which has buffers instantiated for the pages. If the file is unmapped (or the mapping is munlocked) while buffers are still not reclaimed, then it seems that the current patch could leak the user wiring.

How? Buffer cache mappings are not managed, and vm_object_unwire() always updates user-wiring state based on the number of wired mappings of the page. Indeed, a similar leak exists with vslock(), mentioned in the description, but because it is transient I do not think it is a major problem. Arguably pages wired by sysctl_wire_old_buffer() should be included in v_user_wire_count even if vslock() is not subject to that limit.

I see, I mis-remembered the patch as predicating vm_page_unwire_user() on m->wire_count, which it does not.

About your notion that it is undesirable to touch v_wire_count on each transition reference, you can only do that on 0->1 and 1->0 edges, and I do not see how could you avoid that.

I think that v_wire_count should not be updated directly. It can be computed indirectly on demand.

My impression, after listing all the items above, is that all approaches except accounting at pmap are error-prone, either by leak or by overcounting, or by both. Then I really do not see much sense in doing much more complicated patch instead of almost trivial accounting of wired map entries total size.

sys/vm/vm_glue.c
187	Why did you removed the user wire count from the left side of '>' ? Also, if any page in the range is already wired, we overcount there.
sys/vm/vm_mmap.c
1066	Again, same note as for vslock: if doing precise per-page user wiring accounting, it is incorrect to reject based on the vm_map_entries size. I do not say that this is not acceptable (what I propose takes this to extreme end).
sys/vm/vm_object.c
1972	I think you need pmap_unwire_user() there.
sys/vm/vm_page.c
3702	Should this check become assert instead ?

In light of r345382, I don't think this approach can work as-is: vm_page_unwire_user() uses the number of wired mappings to determine whether a page is user wired, but apparently it is possible to remove mappings of wired pages from a user pmap without unwiring.

To make progress I might have to follow the accounting approach that you suggested earlier. But, what exactly is the purpose of max_wired? Is it a seatbelt to ensure that a programming error doesn't lock all of RAM? RLIMIT_MEMLOCK serves that purpose. Is it to protect against a DoS from a malicious application? It is quite broken for that case since mlockall() does not apply the limit, and there are trivial ways for a user application to consume large amounts of kernel memory, bypassing max_wired.

max_wired is a frequent source of complains when using the ZFS ARC, since that wires a large fraction of the system's RAM. Many tutorials and scripts just set it to -1, effectively disabling it.

markj added inline comments.Mar 22 2019, 3:21 PM

sys/vm/vm_glue.c
187	Note it's the same check as before. There was an old #if 0 block. Indeed, the check is not precise, but I'm not intentionally changing it here.
sys/vm/vm_object.c
1972	Indeed, but then the page is not counted as user-wired, so the max_wired limit may be bypassed.

In D19390#421392, @markj wrote:

In light of r345382, I don't think this approach can work as-is: vm_page_unwire_user() uses the number of wired mappings to determine whether a page is user wired, but apparently it is possible to remove mappings of wired pages from a user pmap without unwiring.

To make progress I might have to follow the accounting approach that you suggested earlier. But, what exactly is the purpose of max_wired? Is it a seatbelt to ensure that a programming error doesn't lock all of RAM? RLIMIT_MEMLOCK serves that purpose. Is it to protect against a DoS from a malicious application? It is quite broken for that case since mlockall() does not apply the limit, and there are trivial ways for a user application to consume large amounts of kernel memory, bypassing max_wired.

Yes, I believe max_wired is a safety belt. RLIMIT_MEMLOCK cannot substitute max_wired, becaue RLIMIT_MEMLOCK is per-process limit. For a user, the total limit is nprocs(per user) * RLMIT_MEMLOCK, which is both too large in total, and too low for individual process.

max_wired is a frequent source of complains when using the ZFS ARC, since that wires a large fraction of the system's RAM. Many tutorials and scripts just set it to -1, effectively disabling it.