Page MenuHomeFreeBSD

Use system wiring in vmm(4).
AbandonedPublic

Authored by markj on Apr 26 2019, 5:51 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 14, 12:03 AM
Unknown Object (File)
Fri, Jan 10, 7:23 PM
Unknown Object (File)
Sun, Jan 5, 12:18 PM
Unknown Object (File)
Sat, Jan 4, 12:12 PM
Unknown Object (File)
Oct 3 2024, 1:28 AM
Unknown Object (File)
Sep 21 2024, 8:12 AM
Unknown Object (File)
Sep 8 2024, 10:09 AM
Unknown Object (File)
Sep 5 2024, 3:55 PM
Subscribers

Details

Reviewers
kib
rgrimes
Group Reviewers
bhyve
Summary

With D19908 we are checking all user wirings against the
vm_max_user_wire limit, whereas previously it applied only to mlock(2)
callers. In particular, without this diff, wired bhyve VMs will be
subject to the limit, which is undesireable. Change vmm(8) to mark
wired segments as system-wired instead.

Note that the VM's vmspace is not a process' vmspace; it is created and
managed by vmm(4). Thus I suspect that system wiring is conceptually
more correct anyway.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 24012
Build 22899: arc lint + arc unit

Event Timeline

But bhyve' guest memory is allocated on behalf of the user created the vm, and it is wired on some conditions for the user' utilization. I do think that accounting it as user-wired is correct.

In D20065#431511, @kib wrote:

But bhyve' guest memory is allocated on behalf of the user created the vm, and it is wired on some conditions for the user' utilization. I do think that accounting it as user-wired is correct.

Do you agree though that VM wirings should not be subject to the vm_max_user_wired limit? It is easy to add some mechanism to provide an exception for vmm(4).

In D20065#431511, @kib wrote:

But bhyve' guest memory is allocated on behalf of the user created the vm, and it is wired on some conditions for the user' utilization. I do think that accounting it as user-wired is correct.

Do you agree though that VM wirings should not be subject to the vm_max_user_wired limit? It is easy to add some mechanism to provide an exception for vmm(4).

Why not ? One of the purpose of that global limit is to avoid a situation where user wiring removed so much memory from the reusable pool of pages that the system limits become too aggressive. From this PoV, very large vmm wiring must be accounted for.

rgrimes requested changes to this revision.Apr 26 2019, 3:48 PM

I can see both sides of this, can I ask that the bhyve developers group which meets every 2 weeks be given a chance to discuss this and provide feedback on it? Thanks. Many of us well be at BSDCan a day early for bhyvecon if you wish to discuss it in person.

This revision now requires changes to proceed.Apr 26 2019, 3:48 PM

I think that the use cases that wire memory for a guest mean that the host administrator wants to set aside memory just for that. bhyve requires root anyway, but if we didn't require root for bhyve, I think we would still require root for a wired guest image as wiring memory either means you are doing passthrough or you want to hard-reserve resources for a guest. Does root have a different wired memory limit by default that is infinite? If so, then a user wiring might still be ok, but my gut feeling is that a system wiring is more correct.

sys/amd64/vmm/vmm.c
812

If vm_map_remove() doesn't handle this implicitly then it seems like this bug fix of adding a missing vm_map_unwire should go in regardless (and possibly as a separate change?) Hmm, it seems that vm_map_remove() can delete a user wiring implicitly, but it will block for a system wiring, so you only need an explicit unwire for a system wiring?

In D20065#431702, @jhb wrote:

I think that the use cases that wire memory for a guest mean that the host administrator wants to set aside memory just for that.

But how much memory did administrator reserved ? There is no practical limit for the number of processes or vms that user is allowed to create (for the purpose of the discussion, the limit is 'too many').

bhyve requires root anyway, but if we didn't require root for bhyve, I think we would still require root for a wired guest image as wiring memory either means you are doing passthrough or you want to hard-reserve resources for a guest. Does root have a different wired memory limit by default that is infinite? If so, then a user wiring might still be ok, but my gut feeling is that a system wiring is more correct.

There are two limits for user wired memory:

  • per-process RLIMIT_MEMLOCK (default is 64k AFAIR)
  • global user wire limit for everything

The later is actually a safety measure as was noted above, to prevent userspace from deadlock system with no free memory.

IMO it is clearly a user-wire because the lifetime of the wiring is fully controlled by the user. System wires are typically transient (like sysctl copyouts) and much less demanding in the absolute numbers.

In D20065#431534, @kib wrote:
In D20065#431511, @kib wrote:

But bhyve' guest memory is allocated on behalf of the user created the vm, and it is wired on some conditions for the user' utilization. I do think that accounting it as user-wired is correct.

Do you agree though that VM wirings should not be subject to the vm_max_user_wired limit? It is easy to add some mechanism to provide an exception for vmm(4).

Why not ? One of the purpose of that global limit is to avoid a situation where user wiring removed so much memory from the reusable pool of pages that the system limits become too aggressive. From this PoV, very large vmm wiring must be accounted for.

I mean, vmm-wired memory can be accounted in v_user_wire_count (so future mlock() calls may fail, for example), but I do not think it makes sense to allow VM creation to fail if wiring would cross the vm_page_max_user_wired threshold. We do not enforce that limit today, so it would simply be a regression if D19908 is committed. I would simply add another flag to vm_map_wire() which causes it to adjust v_user_wire_count without failing if the threshold is crossed, and use that flag in vmm.

Alternately, we can revisit the vm_page_max_user_wired policy. Today it does not permit wiring of more than 1/3 of physical memory, but obviously we should allow wired VMs to use more than 1/3 of physical memory by default. Or, vmm can use its own accounting for wired pages (e.g., only permit creation of a wired VM if there are enough free pages to avoid crossing the vm_free_min threshold). In any case, today, we do not place any limits on vmm-wired memory; I am simply trying to make a minimal change to avoid introducing a regression.

sys/amd64/vmm/vmm.c
812

Right, the explicit unwire is needed only because I changed vmm to use system wiring.

This block should also be wrapped in a check for VM_MEMMAP_F_WIRED.

I can see both sides of this, can I ask that the bhyve developers group which meets every 2 weeks be given a chance to discuss this and provide feedback on it? Thanks. Many of us well be at BSDCan a day early for bhyvecon if you wish to discuss it in person.

Why can't the discussion happen here?

The change doesn't aim to change bhyve's behaviour or architecture in any user-visible way, it just changes some internal detail in the way vmm interacts with the core kernel. I don't think an in-person meeting is required.

I can see both sides of this, can I ask that the bhyve developers group which meets every 2 weeks be given a chance to discuss this and provide feedback on it? Thanks. Many of us well be at BSDCan a day early for bhyvecon if you wish to discuss it in person.

Why can't the discussion happen here?

Some of it is, but we meet on a cyclic basis to have high bandwidth low delay discussions and I was just asking for the opportunity to allow that discussion to occur. Or if you referring to the in person option that was offered up "if you wish".

The change doesn't aim to change bhyve's behaviour or architecture in any user-visible way, it just changes some internal detail in the way vmm interacts with the core kernel. I don't think an in-person meeting is required.

It was not even being requested, it was being offered up as a possible option.

I mean, vmm-wired memory can be accounted in v_user_wire_count (so future mlock() calls may fail, for example), but I do not think it makes sense to allow VM creation to fail if wiring would cross the vm_page_max_user_wired threshold. We do not enforce that limit today, so it would simply be a regression if D19908 is committed. I would simply add another flag to vm_map_wire() which causes it to adjust v_user_wire_count without failing if the threshold is crossed, and use that flag in vmm.

The fact that we do not enforce this limit today is a bug, and starting enforcing it fixes the bug, as opposed to introduce a regression. The limit prevents a situation where user (administrator) over-consumes the wired memory making the system deadlock. If user does not care, she can increase the limit.

Alternately, we can revisit the vm_page_max_user_wired policy. Today it does not permit wiring of more than 1/3 of physical memory, but obviously we should allow wired VMs to use more than 1/3 of physical memory by default. Or, vmm can use its own accounting for wired pages (e.g., only permit creation of a wired VM if there are enough free pages to avoid crossing the vm_free_min threshold). In any case, today, we do not place any limits on vmm-wired memory; I am simply trying to make a minimal change to avoid introducing a regression.

It is up to administrator to decide how much of the RAM is safe to dedicate to pass-through vm instances.

The current value of 1/3 is somewhat reasonable for low-memory machines, say in 1-2G range, with bunch of mlockall(2) processes (ntpd, amd) running. Situation where the large machine (128+ G) is used for a lot of VMs which are given direct access to PCI devices and leave only several GBs to the host, would require some manual tuning. I do not see this as a regression, but as a restoration of the needed safety check for normal usage scenarios.

In D20065#431852, @kib wrote:

I mean, vmm-wired memory can be accounted in v_user_wire_count (so future mlock() calls may fail, for example), but I do not think it makes sense to allow VM creation to fail if wiring would cross the vm_page_max_user_wired threshold. We do not enforce that limit today, so it would simply be a regression if D19908 is committed. I would simply add another flag to vm_map_wire() which causes it to adjust v_user_wire_count without failing if the threshold is crossed, and use that flag in vmm.

The fact that we do not enforce this limit today is a bug, and starting enforcing it fixes the bug, as opposed to introduce a regression. The limit prevents a situation where user (administrator) over-consumes the wired memory making the system deadlock. If user does not care, she can increase the limit.

I agree with the statements, but how many users are affected by deadlocks today? I have not seen a report of it. Meanwhile, if wired VM creation starts failing once 1/3 of memory is user-wired, I am sure we will see reports of the problem even if the solution is straightforward.

Alternately, we can revisit the vm_page_max_user_wired policy. Today it does not permit wiring of more than 1/3 of physical memory, but obviously we should allow wired VMs to use more than 1/3 of physical memory by default. Or, vmm can use its own accounting for wired pages (e.g., only permit creation of a wired VM if there are enough free pages to avoid crossing the vm_free_min threshold). In any case, today, we do not place any limits on vmm-wired memory; I am simply trying to make a minimal change to avoid introducing a regression.

It is up to administrator to decide how much of the RAM is safe to dedicate to pass-through vm instances.

The current value of 1/3 is somewhat reasonable for low-memory machines, say in 1-2G range, with bunch of mlockall(2) processes (ntpd, amd) running. Situation where the large machine (128+ G) is used for a lot of VMs which are given direct access to PCI devices and leave only several GBs to the host, would require some manual tuning. I do not see this as a regression, but as a restoration of the needed safety check for normal usage scenarios.

It is a regression because reasonable configurations will no longer work out of the box. It is a weakness of FreeBSD that we require manual tuning to enable reasonable behaviour; IMO it's the same sort of problem as the ZFS arc_max tunable, which results in numerous discussions on the lists. In this case, if maintaining the status quo is not an option, it is my preference to revisit the max_user_wired policy instead of forcing bhyve to comply with the existing policy.

In D20065#431852, @kib wrote:

I mean, vmm-wired memory can be accounted in v_user_wire_count (so future mlock() calls may fail, for example), but I do not think it makes sense to allow VM creation to fail if wiring would cross the vm_page_max_user_wired threshold. We do not enforce that limit today, so it would simply be a regression if D19908 is committed. I would simply add another flag to vm_map_wire() which causes it to adjust v_user_wire_count without failing if the threshold is crossed, and use that flag in vmm.

The fact that we do not enforce this limit today is a bug, and starting enforcing it fixes the bug, as opposed to introduce a regression. The limit prevents a situation where user (administrator) over-consumes the wired memory making the system deadlock. If user does not care, she can increase the limit.

I agree with the statements, but how many users are affected by deadlocks today? I have not seen a report of it. Meanwhile, if wired VM creation starts failing once 1/3 of memory is user-wired, I am sure we will see reports of the problem even if the solution is straightforward.

I suspect that what happens is a learn through pain. User configures provisioning a machine with overcommit, and get hang. After that he sees that it was a clear misconfiguration and silently reduces the population.

If the limit is enforced, user would see failed start instead of deadlock, and that probably save him a reboot.

Alternately, we can revisit the vm_page_max_user_wired policy. Today it does not permit wiring of more than 1/3 of physical memory, but obviously we should allow wired VMs to use more than 1/3 of physical memory by default. Or, vmm can use its own accounting for wired pages (e.g., only permit creation of a wired VM if there are enough free pages to avoid crossing the vm_free_min threshold). In any case, today, we do not place any limits on vmm-wired memory; I am simply trying to make a minimal change to avoid introducing a regression.

It is up to administrator to decide how much of the RAM is safe to dedicate to pass-through vm instances.

The current value of 1/3 is somewhat reasonable for low-memory machines, say in 1-2G range, with bunch of mlockall(2) processes (ntpd, amd) running. Situation where the large machine (128+ G) is used for a lot of VMs which are given direct access to PCI devices and leave only several GBs to the host, would require some manual tuning. I do not see this as a regression, but as a restoration of the needed safety check for normal usage scenarios.

It is a regression because reasonable configurations will no longer work out of the box. It is a weakness of FreeBSD that we require manual tuning to enable reasonable behaviour; IMO it's the same sort of problem as the ZFS arc_max tunable, which results in numerous discussions on the lists. In this case, if maintaining the status quo is not an option, it is my preference to revisit the max_user_wired policy instead of forcing bhyve to comply with the existing policy.

May be put the limit enforcement under a knob, and may be even default it to disabled.

Take a different approach: allow vm_map_wire() callers to ignore the
vm_page_max_user_wired limit when user-wiring pages.

In D20065#432561, @kib wrote:

I agree with the statements, but how many users are affected by deadlocks today? I have not seen a report of it. Meanwhile, if wired VM creation starts failing once 1/3 of memory is user-wired, I am sure we will see reports of the problem even if the solution is straightforward.

I suspect that what happens is a learn through pain. User configures provisioning a machine with overcommit, and get hang. After that he sees that it was a clear misconfiguration and silently reduces the population.

If the limit is enforced, user would see failed start instead of deadlock, and that probably save him a reboot.

Ok, fair enough.

Alternately, we can revisit the vm_page_max_user_wired policy. Today it does not permit wiring of more than 1/3 of physical memory, but obviously we should allow wired VMs to use more than 1/3 of physical memory by default. Or, vmm can use its own accounting for wired pages (e.g., only permit creation of a wired VM if there are enough free pages to avoid crossing the vm_free_min threshold). In any case, today, we do not place any limits on vmm-wired memory; I am simply trying to make a minimal change to avoid introducing a regression.

It is up to administrator to decide how much of the RAM is safe to dedicate to pass-through vm instances.

The current value of 1/3 is somewhat reasonable for low-memory machines, say in 1-2G range, with bunch of mlockall(2) processes (ntpd, amd) running. Situation where the large machine (128+ G) is used for a lot of VMs which are given direct access to PCI devices and leave only several GBs to the host, would require some manual tuning. I do not see this as a regression, but as a restoration of the needed safety check for normal usage scenarios.

It is a regression because reasonable configurations will no longer work out of the box. It is a weakness of FreeBSD that we require manual tuning to enable reasonable behaviour; IMO it's the same sort of problem as the ZFS arc_max tunable, which results in numerous discussions on the lists. In this case, if maintaining the status quo is not an option, it is my preference to revisit the max_user_wired policy instead of forcing bhyve to comply with the existing policy.

May be put the limit enforcement under a knob, and may be even default it to disabled.

I did that in the latest patch, defaulting to disabled for now. The default vm_page_max_user_wired limit still does not make sense for VM hosts, though. Maybe a more graceful solution is to permit vm_map_wire() to fail if a page allocation fails; currently it will always block in vm_waitpfault().

To be clear, the limit isn't being applied today? From the existing code it seems like it should be already in force if I'm reading the diff to vm_map_wire correctly? Also, if bhyve is allowed to ignore the limit, the amount of wired memory it uses does mean that small requests by, say, gpg will then fail on the host, yes?

In D20065#432917, @jhb wrote:

To be clear, the limit isn't being applied today? From the existing code it seems like it should be already in force if I'm reading the diff to vm_map_wire correctly?

The diff is being applied on top of D19908. Without that change, the limit does not get applied to bhyve.

Also, if bhyve is allowed to ignore the limit, the amount of wired memory it uses does mean that small requests by, say, gpg will then fail on the host, yes?

Yes, that's right (and true in head today). In the original version of this diff, where bhyve was doing system wiring, that was not the case since system-wired pages are not counted towards the limit.

Well this in any way help with the fact we do not apply any pressure to the zfs arc_max and end up in a memory shortfall that causes high paging? I suspect not as most people do not wire there VM's in memory unless they know that is what they need to do to keep zfs from killing your box with OOM

Well this in any way help with the fact we do not apply any pressure to the zfs arc_max and end up in a memory shortfall that causes high paging? I suspect not as most people do not wire there VM's in memory unless they know that is what they need to do to keep zfs from killing your box with OOM

No, that's unrelated. My aim is to provide separate accounting for user-wired memory as a prerequisite for D19247. This change tries to ensure that applying the user-wiring limit to all vm_map_wire() callers, including bhyve, does not force administrators to manually tune the limit. I'm not sure how wiring a VM would prevent OOM kills.

In D20065#432917, @jhb wrote:

To be clear, the limit isn't being applied today? From the existing code it seems like it should be already in force if I'm reading the diff to vm_map_wire correctly?

The diff is being applied on top of D19908. Without that change, the limit does not get applied to bhyve.

Ok.

Also, if bhyve is allowed to ignore the limit, the amount of wired memory it uses does mean that small requests by, say, gpg will then fail on the host, yes?

Yes, that's right (and true in head today). In the original version of this diff, where bhyve was doing system wiring, that was not the case since system-wired pages are not counted towards the limit.

Ok. I still don't really know how I feel. I think I probably want all long-term wirings to honor the limit as kib@ says and that users will just have to up the limit when using wired VMs with bhyve. If we can get a suitably unique errno value back in whatever creates the wired memory for bhyve and emit a useful error message to the user, that is probably sufficient.

In D20065#432936, @jhb wrote:
In D20065#432917, @jhb wrote:

To be clear, the limit isn't being applied today? From the existing code it seems like it should be already in force if I'm reading the diff to vm_map_wire correctly?

The diff is being applied on top of D19908. Without that change, the limit does not get applied to bhyve.

Ok.

Also, if bhyve is allowed to ignore the limit, the amount of wired memory it uses does mean that small requests by, say, gpg will then fail on the host, yes?

Yes, that's right (and true in head today). In the original version of this diff, where bhyve was doing system wiring, that was not the case since system-wired pages are not counted towards the limit.

Ok. I still don't really know how I feel. I think I probably want all long-term wirings to honor the limit as kib@ says and that users will just have to up the limit when using wired VMs with bhyve. If we can get a suitably unique errno value back in whatever creates the wired memory for bhyve and emit a useful error message to the user, that is probably sufficient.

Ok. I guess we can just try and see if there's any fallout. I do think it means that D19908 is not MFCable.

In D20065#432936, @jhb wrote:
In D20065#432917, @jhb wrote:

To be clear, the limit isn't being applied today? From the existing code it seems like it should be already in force if I'm reading the diff to vm_map_wire correctly?

The diff is being applied on top of D19908. Without that change, the limit does not get applied to bhyve.

Ok.

Also, if bhyve is allowed to ignore the limit, the amount of wired memory it uses does mean that small requests by, say, gpg will then fail on the host, yes?

Yes, that's right (and true in head today). In the original version of this diff, where bhyve was doing system wiring, that was not the case since system-wired pages are not counted towards the limit.

Ok. I still don't really know how I feel. I think I probably want all long-term wirings to honor the limit as kib@ says and that users will just have to up the limit when using wired VMs with bhyve. If we can get a suitably unique errno value back in whatever creates the wired memory for bhyve and emit a useful error message to the user, that is probably sufficient.

Ok. I guess we can just try and see if there's any fallout. I do think it means that D19908 is not MFCable.

If you add a knob to disable accounting for bhyve requests, it can be merged.

In D20065#433092, @kib wrote:
In D20065#432936, @jhb wrote:
In D20065#432917, @jhb wrote:

To be clear, the limit isn't being applied today? From the existing code it seems like it should be already in force if I'm reading the diff to vm_map_wire correctly?

The diff is being applied on top of D19908. Without that change, the limit does not get applied to bhyve.

Ok.

Also, if bhyve is allowed to ignore the limit, the amount of wired memory it uses does mean that small requests by, say, gpg will then fail on the host, yes?

Yes, that's right (and true in head today). In the original version of this diff, where bhyve was doing system wiring, that was not the case since system-wired pages are not counted towards the limit.

Ok. I still don't really know how I feel. I think I probably want all long-term wirings to honor the limit as kib@ says and that users will just have to up the limit when using wired VMs with bhyve. If we can get a suitably unique errno value back in whatever creates the wired memory for bhyve and emit a useful error message to the user, that is probably sufficient.

Ok. I guess we can just try and see if there's any fallout. I do think it means that D19908 is not MFCable.

If you add a knob to disable accounting for bhyve requests, it can be merged.

Why would it need a knob, if this fixes one of our "I have run the system out of some resource dead locks but no error appears" problems I am in support of it. Right now IIRC we just silently dead lock in this situation?