waitpfault waited if any domain was in min.
I changed vm_fault to save the object dset to pass to the wait function. I also needed to do the same for the severe check else we'd skip the alloc and spin forever.
Differential D16191
Fix vm_waitpfault on numa jeff on Jul 9 2018, 5:00 AM. Authored by Tags None Referenced Files
Details waitpfault waited if any domain was in min. I changed vm_fault to save the object dset to pass to the wait function. I also needed to do the same for the severe check else we'd skip the alloc and spin forever.
Diff Detail
Event TimelineComment Actions This does not fix the problem for me -- now things start wedging on 'vmwait'. The affected machine has 512GB of ram and 4 nodes. Using the prog below with 256MB passed (./a.out 262144) reproduces it. mmacy can be prodded to test on the box. #include <err.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { unsigned long i, s; s = strtol(argv[1], (char **)NULL, 10); s *= (1024 * 1024); printf("%s %lu\n", argv[1], s); char *p = malloc(s); if (p == NULL) err(1, "malloc"); for (i = 0; i < s; i += 4096) p[i] = 'A'; printf("i %p s %d done\n", i , s); getchar(); } Comment Actions yes, GB. is there a problem reproducing the bug? here is a sample process wedged while the testcase is running: sched_switch() at sched_switch+0x8ad/frame 0xfffffe0302bae740 no changes to policies and whatnot Comment Actions This is a different bug. Most of the vm_wait_min & vm_wait_severe users will need to be modified. I will consider that further. For now I would like to get this patch in. Comment Actions The basic issue is that there are a handful of places where we test for 'any domain' in min or severe and it may need to be 'every domain we try to allocate from'. The other thing that would help is to skip first-touch and fall back to round-robin once the domain is below min pages. This would prevent us from falling into severe. I feel reluctant to skip severe tests for things like fork because the pages the process will allocate becomes unpredictable.
Comment Actions Am I right that each node has 128G, and we completely exhaust one domain in the test, and the machine does not have swap configured ? I do not think that we have a mechanism that would allow us to migrate the pages to other domains in this situation. As result, at least one domain would be listed as severe and fork correctly blocks the process. Comment Actions I think vm_page_reclaim_contig_domain() provides most of the machinery needed to implement such a mechanism, FWIW. If there is a NUMA allocation policy which only permits allocations from a specific domain set, we would also need a mechanism to indicate that a given page is "pinned" to that set, and cannot be relocated. Comment Actions Do we need to stop forking if there is a severe domain ? IMO if there is one non-severe then we can allow the fork to proceed. The process with non-fitting policy would be stopped waiting for a free page in the severly-depleted domain anyway. I think this check is more about preventing the kernel allocators from blocking on fork. Comment Actions I still think the better question is, why are we allowing a domain preference to push into severe when another domain is completely unused? That's why I think the more general solution is on the allocator side. For a single page it makes sense to look at the specific domains we may allocate from. But when we fork we have no idea what objects and policies may be involved. So I'm more reluctant to change that. Comment Actions This actually strikes me as a scheduling problem. The forking thread should be temporarily migrated to an underutilized domain. That said, the right time to do that migration may be execve(), not fork(). Comment Actions I agree but I still feel uneasy because a forked process may not use the thread's domain policy for all of its allocations. Anyhow, I think we're heading towards the following solutions together:
Only #1 has any complexity to it. I will update this review with #2 and #3 unless there are objections. Comment Actions This uses a more correct filter for the vm_wait_severe in vm_glue.c. I also implemented an approach that will skip domains that are under the min threshold on the first allocation pass. This means that if the allocation policy has a preference it will be ignored if we're under paging pressure for that domain. The pid controlled page daemon should prevent us from getting to min in most scenarios. This change should help prevent single-domain low memory deadlocks by moving allocations that can be satisfied elsewhere. If the allocation avoidance fails we will still resolve the basic case of this bug with the more selective wait criteria in glue and pfault.
Comment Actions Fix the lock leak. I would like to get this committed to 12. I have not yet had feedback on the low domain avoidance code in the iterator. How do people feel about this?
Comment Actions I think this is reasonable. The other vm_page_count_severe() callers don't stand out to me as needing immediate attention.
Comment Actions My policy question aside, I see no reason not to commit this change ASAP, once the two comments by Kostik about replacing the word "zone" with "domain" are addressed.
Comment Actions Hrm, I missed this part. Will fix.
|