Fix vm_waitpfault on numa
ClosedPublic
Actions

Authored by jeff on Jul 9 2018, 5:00 AM.

Details

Reviewers

kib
markj
mjg
alc

Commits

rS338510: Use the correct terminology.
rS338507: Avoid resource deadlocks when one domain has exhausted its memory. Attempt

Summary

waitpfault waited if any domain was in min.

I changed vm_fault to save the object dset to pass to the wait function. I also needed to do the same for the severe check else we'd skip the alloc and spin forever.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jeff created this revision.Jul 9 2018, 5:00 AM

Herald added a subscriber: imp. · View Herald TranscriptJul 9 2018, 5:00 AM

This does not fix the problem for me -- now things start wedging on 'vmwait'.

The affected machine has 512GB of ram and 4 nodes. Using the prog below with 256MB passed (./a.out 262144) reproduces it. mmacy can be prodded to test on the box.

#include <err.h>
#include <stdlib.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
	unsigned long i, s;

	s = strtol(argv[1], (char **)NULL, 10);
	s *= (1024 * 1024);
	printf("%s %lu\n", argv[1], s);
	char *p = malloc(s);
	if (p == NULL)
		err(1, "malloc");
	for (i = 0; i < s; i += 4096)
		p[i] = 'A';
	printf("i %p s %d done\n", i , s);
	getchar();
}

In D16191#343412, @mjg wrote:

This does not fix the problem for me -- now things start wedging on 'vmwait'.

The affected machine has 512GB of ram and 4 nodes. Using the prog below with 256MB passed (./a.out 262144) reproduces it. mmacy can be prodded to test on the box.

Can you give me a stack trace? You mean 256GB right?

yes, GB. is there a problem reproducing the bug?

here is a sample process wedged while the testcase is running:

sched_switch() at sched_switch+0x8ad/frame 0xfffffe0302bae740
mi_switch() at mi_switch+0xe1/frame 0xfffffe0302bae770
sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe0302bae7a0
_sleep() at _sleep+0x237/frame 0xfffffe0302bae820
vm_wait_severe() at vm_wait_severe+0x84/frame 0xfffffe0302bae840
vm_forkproc() at vm_forkproc+0x95/frame 0xfffffe0302bae880
fork1() at fork1+0x1869/frame 0xfffffe0302bae930
sys_fork() at sys_fork+0x4c/frame 0xfffffe0302bae980
amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe0302baeab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0302baeab0

no changes to policies and whatnot

In D16191#343542, @mjg wrote:

yes, GB. is there a problem reproducing the bug?

This is a different bug. Most of the vm_wait_min & vm_wait_severe users will need to be modified. I will consider that further. For now I would like to get this patch in.

The basic issue is that there are a handful of places where we test for 'any domain' in min or severe and it may need to be 'every domain we try to allocate from'.

The other thing that would help is to skip first-touch and fall back to round-robin once the domain is below min pages. This would prevent us from falling into severe. I feel reluctant to skip severe tests for things like fork because the pages the process will allocate becomes unpredictable.

markj added inline comments.Jul 9 2018, 11:43 PM

vm/vm_page.c
2963 ↗	(On Diff #45050)	We're leaking the lock if the condition is false, ditto below.

Am I right that each node has 128G, and we completely exhaust one domain in the test, and the machine does not have swap configured ? I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.

As result, at least one domain would be listed as severe and fork correctly blocks the process.

In D16191#343719, @kib wrote:

I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.

I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated.

In D16191#343766, @markj wrote:

In D16191#343719, @kib wrote:

I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.

I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated.

Do we need to stop forking if there is a severe domain ? IMO if there is one non-severe then we can allow the fork to proceed. The process with non-fitting policy would be stopped waiting for a free page in the severly-depleted domain anyway. I think this check is more about preventing the kernel allocators from blocking on fork.

In D16191#343788, @kib wrote:

In D16191#343766, @markj wrote:

In D16191#343719, @kib wrote:

I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.

I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated.

Do we need to stop forking if there is a severe domain ? IMO if there is one non-severe then we can allow the fork to proceed. The process with non-fitting policy would be stopped waiting for a free page in the severly-depleted domain anyway. I think this check is more about preventing the kernel allocators from blocking on fork.

I still think the better question is, why are we allowing a domain preference to push into severe when another domain is completely unused? That's why I think the more general solution is on the allocator side.

For a single page it makes sense to look at the specific domains we may allocate from. But when we fork we have no idea what objects and policies may be involved. So I'm more reluctant to change that.

In D16191#343789, @jeff wrote:

In D16191#343788, @kib wrote:

In D16191#343766, @markj wrote:

In D16191#343719, @kib wrote:

I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.

I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated.

Do we need to stop forking if there is a severe domain ? IMO if there is one non-severe then we can allow the fork to proceed. The process with non-fitting policy would be stopped waiting for a free page in the severly-depleted domain anyway. I think this check is more about preventing the kernel allocators from blocking on fork.

I still think the better question is, why are we allowing a domain preference to push into severe when another domain is completely unused? That's why I think the more general solution is on the allocator side.

For a single page it makes sense to look at the specific domains we may allocate from. But when we fork we have no idea what objects and policies may be involved. So I'm more reluctant to change that.

This actually strikes me as a scheduling problem. The forking thread should be temporarily migrated to an underutilized domain. That said, the right time to do that migration may be execve(), not fork().

In D16191#343791, @alc wrote:

In D16191#343789, @jeff wrote:

In D16191#343788, @kib wrote:

In D16191#343766, @markj wrote:

In D16191#343719, @kib wrote:

I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation.

I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated.

Do we need to stop forking if there is a severe domain ? IMO if there is one non-severe then we can allow the fork to proceed. The process with non-fitting policy would be stopped waiting for a free page in the severly-depleted domain anyway. I think this check is more about preventing the kernel allocators from blocking on fork.

I still think the better question is, why are we allowing a domain preference to push into severe when another domain is completely unused? That's why I think the more general solution is on the allocator side.

For a single page it makes sense to look at the specific domains we may allocate from. But when we fork we have no idea what objects and policies may be involved. So I'm more reluctant to change that.

This actually strikes me as a scheduling problem. The forking thread should be temporarily migrated to an underutilized domain. That said, the right time to do that migration may be execve(), not fork().

I agree but I still feel uneasy because a forked process may not use the thread's domain policy for all of its allocations. Anyhow, I think we're heading towards the following solutions together:

A fork time domain scheduler. This needs to be more interesting than just 'most free pages' because in a situation with a saturated page cache you won't be able to tell where the most load is.
Conversion of wait_min and wait_severe to accept a mask in all (or almost all) cases so we get specific waits.
A change to the allocator to skip min/severe domains while !min/!severe domains are available in the set. I think this is important. We shouldn't let preference push us into un-pageable shortfall.

Only #1 has any complexity to it. I will update this review with #2 and #3 unless there are objections.

kbowling added a subscriber: kbowling.Jul 19 2018, 6:46 AM

This uses a more correct filter for the vm_wait_severe in vm_glue.c.

I also implemented an approach that will skip domains that are under the min threshold on the first allocation pass. This means that if the allocation policy has a preference it will be ignored if we're under paging pressure for that domain. The pid controlled page daemon should prevent us from getting to min in most scenarios. This change should help prevent single-domain low memory deadlocks by moving allocations that can be satisfied elsewhere.

If the allocation avoidance fails we will still resolve the basic case of this bug with the more selective wait criteria in glue and pfault.

kib added inline comments.Aug 20 2018, 3:15 PM

vm/vm_page.c
2963 ↗	(On Diff #45050)	It still leaks, when msleep() is not executed ?

Fix the lock leak. I would like to get this committed to 12. I have not yet had feedback on the low domain avoidance code in the iterator. How do people feel about this?

vm/vm_page.c
2963 ↗	(On Diff #45050)	Thank you. I forgot this review when I updated the diff.

I think this is reasonable. The other vm_page_count_severe() callers don't stand out to me as needing immediate attention.

vm/vm_glue.c
562 ↗	(On Diff #47363)	This was the last vm_wait_severe() caller.

This revision is now accepted and ready to land.Aug 27 2018, 9:17 PM

kib accepted this revision.Aug 28 2018, 8:46 PM

kib added inline comments.

vm/vm_domainset.c
103 ↗	(On Diff #47363)	zones or domains ?
288 ↗	(On Diff #47363)	zones

My policy question aside, I see no reason not to commit this change ASAP, once the two comments by Kostik about replacing the word "zone" with "domain" are addressed.

vm/vm_domainset.c
104 ↗	(On Diff #47363)	Should this assignment be conditional on the VM_ALLOC_CLASS_MASK being VM_ALLOC_NORMAL?

kbowling added a parent revision: D17059: Enable options NUMA on amd64 GENERIC/MINIMAL:.Sep 6 2018, 8:14 AM

Closed by commit rS338507: Avoid resource deadlocks when one domain has exhausted its memory. Attempt (authored by markj). · Explain WhySep 6 2018, 7:29 PM

This revision was automatically updated to reflect the committed changes.

In D16191#362698, @alc wrote:

My policy question aside, I see no reason not to commit this change ASAP, once the two comments by Kostik about replacing the word "zone" with "domain" are addressed.

Hrm, I missed this part. Will fix.