Page MenuHomeFreeBSD

Handle contiguous memory allocation failures through page relocation.
ClosedPublic

Authored by alc on Dec 8 2015, 5:47 PM.
Tags
None
Referenced Files
Unknown Object (File)
Apr 17 2017, 10:42 AM
Unknown Object (File)
Apr 14 2017, 3:34 PM
Unknown Object (File)
Apr 12 2017, 4:04 AM
Unknown Object (File)
Apr 8 2017, 6:27 PM
Unknown Object (File)
Apr 8 2017, 1:54 PM
Unknown Object (File)
Apr 8 2017, 8:31 AM
Unknown Object (File)
Apr 7 2017, 6:16 AM
Unknown Object (File)
Mar 23 2017, 6:54 AM
Subscribers
None

Details

Summary

The title says it all. There are still a few loose ends, but I think it's close to being commit worthy.

Test Plan

I've been performing the sysctl periodically while a background workload, e.g., postgresql, buildworld, runs.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

alc retitled this revision from to Handle contiguous memory allocation failures through page relocation..
alc updated this object.
alc edited the test plan for this revision. (Show Details)
alc added reviewers: jhb, kib, markj.

A note about cached pages. On one hand, I've written this patch to handle cached pages, so in principle it could be MFCed. On the other hand, this patch eliminates the last remaining call to vm_page_cache() in vm_pageout.c. The only remaining calls to vm_page_cache() are from the I/O completion handlers and will be eliminated by completing and merging the PQ_LAUNDRY branch.

Once that is completed, we can remove support for cached pages from the page relocation code.

I've read through this a couple of times, and the overall approach makes sense to me. I definitely need to read through it a few more times though. :)

I'll try testing this change over the next few days as well.

vm/vm_page.c
2235 ↗(On Diff #10924)

Am I correct in thinking that m can only be free here if it was freed during the relocking above?

Why does m != vm_page_lookup(...) imply that the page had been cached prior to being freed?

2433 ↗(On Diff #10924)

The only reason for using vm_page_alloc_contig() instead of vm_page_alloc() is so that we can specify the [round_page(high), ~(vm_paddr_t)0) constraint, right?

2441 ↗(On Diff #10924)

Might be worth adding comments above the second and third vm_page_alloc_contig() calls that are in the same vein as the first one.

alc marked 3 inline comments as done.Dec 10 2015, 5:16 AM
alc added inline comments.
vm/vm_page.c
2235 ↗(On Diff #10924)

No. Conversion from cached to free only requires the free page queues lock, so it could happen while the object and page lock was held (here).

As for the second question, the page and object lock are held, so the page's object field and the object's radix tree should be stable. "m->object != NULL" yet the radix tree doesn't contain the page "m". Suppose that we got here because of a concurrent vm_page_alloc() call. The trylock would have failed because whoever is running vm_page_alloc() and set "m->object" holds an exclusive lock. We'll block on the VM_OBJECT_RLOCK until the vm_page_alloc() completes and adds the page to the radix tree. So, in this case, we have a contradiction because the page "m" would be in the radix tree. On the other hand, suppose that we got here because of a concurrent vm_page_remove(). However, the page lock is going to ensure that vm_page_remove() can't remove the page "m" from the radix tree. That leaves a cached page as the only possibility.

2433 ↗(On Diff #10924)

Yes.

2441 ↗(On Diff #10924)

Will do.

vm/vm_page.c
2235 ↗(On Diff #10924)

Should the return value from vm_page_lookup() be NULL then ?

2486 ↗(On Diff #10924)

In the ideal world, we would be able to install the new page into all mappings of the old page. BTW, this might allow us to handle wired pages.

Or, for the other extreme end, if the source page is clean, why bother copying it ? This could reduce the frequency of ENOMEM situations.

alc marked 5 inline comments as done.Dec 11 2015, 7:32 PM
alc added inline comments.
vm/vm_page.c
2235 ↗(On Diff #10924)

Suppose that "m" was, in fact, just freed. Then, the value "m->pindex" might reflect the reallocation of "m" to a different object. Moreover, the original object "object" might have a page at that offset. That's why I compare the returned value to "m" rather than NULL.

2486 ↗(On Diff #10924)

Yes, we could handle the case of wired mappings belonging to a user address space, but we wouldn't be able to handle other types of page wiring, e.g., a page being transmitted by sendfile(). Ultimately, I think that we need another VPSC_option that requests the buffer map to eject (and unwire) the page.

As to the second question, one of my original design objectives was for this code to be "I/O neutral". In other words, it wouldn't directly or indirectly lead to I/O operations. Moreover, if a page is going to be reclaimed in the strongest sense, i.e., its data is no longer in memory, then that's a decision for the page daemon to make. By doing relocations, this code doesn't have to care about whether a page is recently used or not.

I would also point out that we're willing to use any page as "m_new" that doesn't fall within "m_run". So, I think that ENOMEM is extremely unlikely to occur for reasons other than a transient memory shortage.

That said, I'll give a variant of your suggestion some more thought. Specifically, if the page is clean and we can't allocate a new page to copy into, then just freeing the original page.

vm/vm_page.c
2486 ↗(On Diff #10924)

Yes, we can only handle user-space wiring. Also I am not sure what VPSC is.

I agree with everything you pointed out.

alc marked an inline comment as done.Dec 11 2015, 9:06 PM
alc added inline comments.
vm/vm_page.c
2486 ↗(On Diff #10924)

I was referring to the flags VPSC_NORESERV, VPSC_NOSUPER, etc.

alc edited edge metadata.

Most of the changes are to the comments. In particular, I've updated some of the comments based on your feedback.

There is one functional change in vm_page_reclaim_run(). I no longer immediately break a reservation when I first encounter a free page belonging to a reservation. Instead, I let the reservation get automatically freed at the end of the function when we free the relocated pages. The problem with the early breaking of reservations was that it placed short runs of pages in the buddy allocator free lists where they could be allocated by another thread before we were done with vm_page_reclaim_run().

vm/vm_page.c
2538 ↗(On Diff #11180)

I'm thinking about flushing the "free" page list here, hoping that any coalescing will make the free pages less likely to be allocated before we are done with this function.

2540 ↗(On Diff #11180)

I'm thinking about whether vm_reserv_is_page_free() should return a count of free pages starting at "m" so that I could jump ahead.

One note: there are some VM objects whose pages cannot be moved. e.g. objects created to exported a wired buffer to userland that is directly accessed by a device via DMA. The current objects I'm aware of would not be moved I think due to hacks in their implementation (e.g. using contigmalloc() and then building an sglist and OBJT_SG as nvidia.ko does). However, a less ugly variant might be to use OBJT_PHYS and allocate suitable pages and stuff them into the object. I'm not sure if in that case you would need to have some explicit "don't move my pages" flag on the object? (Suppose said VM object was only ever mapped into user address spaces and never mapped into kmem, or it was only mapped in the kernel transiently during initialization).

One question: it appears that this does not do an "atomic" allocate with the reclaim which is why you reclaim multiple runs (up to 8 if I'm reading correctly) in the hope that at least one of those is still around when the caller retries it's allocation?

vm/vm_page.c
2186 ↗(On Diff #11180)

Side question: I wonder if it makes sense to add a function that does this? Something like:

vm_page_switch_lock(m, &page_mtx);

There are two other places that follow this paradigm in the tree now. (And this patch adds a few more.)

In D4444#96051, @jhb wrote:

One note: there are some VM objects whose pages cannot be moved. e.g. objects created to exported a wired buffer to userland that is directly accessed by a device via DMA. The current objects I'm aware of would not be moved I think due to hacks in their implementation (e.g. using contigmalloc() and then building an sglist and OBJT_SG as nvidia.ko does). However, a less ugly variant might be to use OBJT_PHYS and allocate suitable pages and stuff them into the object. I'm not sure if in that case you would need to have some explicit "don't move my pages" flag on the object? (Suppose said VM object was only ever mapped into user address spaces and never mapped into kmem, or it was only mapped in the kernel transiently during initialization).

I thought about adding a "don't relocate my pages" flag to the vm object's set of flags, but I didn't because I couldn't see a use case for it that wasn't covered by the existing restrictions on relocation. For example, OBJT_PHYS pages can't be relocated because their mappings can't be destroyed by pmap_remove_all(). In a nutshell, the rule that I implemented is that a page can be relocated if and only if it could be laundered or reclaimed by the page daemon.

One question: it appears that this does not do an "atomic" allocate with the reclaim which is why you reclaim multiple runs (up to 8 if I'm reading correctly) in the hope that at least one of those is still around when the caller retries it's allocation?

In the end, i.e., before 11.0, I'd like to provide an option to reclaim and allocate pages atomically. In other words, relocated pages wouldn't go into the free lists and free pages (within the run) would be pulled out of the free lists. And, in fact, I actually started to implement that option last month but stopped. The reason that I stopped was that more than half of the code that I was writing was to deal with PG_CACHED pages and that seemed like wasted effort. So, I decided to postpone implementing that option until PG_CACHED pages are gone.

I reclaim multiple times when either the value of "npages" is small, e.g., 1 or 2, or reclamation of a larger run failed. Reclamation ends when I've successfully reclaimed a total of 8 or more pages in one or more runs. The reason for the "8 or more pages" rule is so that functions like kmem_alloc_attr() won't have to call reclaim for every single page allocated.

In other words, for the larger runs, I'm really relying on the retry logic in the callers to deal with the allocation races, i.e., the free pages still existing when the vm_page_alloc_contig() call is repeated.

vm/vm_page.c
2186 ↗(On Diff #11180)

Yes, it does.

vm/vm_page.c
2538 ↗(On Diff #11180)

Handling all of the possible cases here that can arise in the buddy allocator when flushing the current list of free, relocated pages is rather complex. Moreover, I'm not seeing many cases where it would have payed off.

In D4444#96082, @alc wrote:
In D4444#96051, @jhb wrote:

One note: there are some VM objects whose pages cannot be moved. e.g. objects created to exported a wired buffer to userland that is directly accessed by a device via DMA. The current objects I'm aware of would not be moved I think due to hacks in their implementation (e.g. using contigmalloc() and then building an sglist and OBJT_SG as nvidia.ko does). However, a less ugly variant might be to use OBJT_PHYS and allocate suitable pages and stuff them into the object. I'm not sure if in that case you would need to have some explicit "don't move my pages" flag on the object? (Suppose said VM object was only ever mapped into user address spaces and never mapped into kmem, or it was only mapped in the kernel transiently during initialization).

I thought about adding a "don't relocate my pages" flag to the vm object's set of flags, but I didn't because I couldn't see a use case for it that wasn't covered by the existing restrictions on relocation. For example, OBJT_PHYS pages can't be relocated because their mappings can't be destroyed by pmap_remove_all(). In a nutshell, the rule that I implemented is that a page can be relocated if and only if it could be laundered or reclaimed by the page daemon.

Ok, it wasn't entirely clear to me if all of the possible cases I could think of were covered by the existing logic, but it sounds like they are.

One question: it appears that this does not do an "atomic" allocate with the reclaim which is why you reclaim multiple runs (up to 8 if I'm reading correctly) in the hope that at least one of those is still around when the caller retries it's allocation?

In the end, i.e., before 11.0, I'd like to provide an option to reclaim and allocate pages atomically. In other words, relocated pages wouldn't go into the free lists and free pages (within the run) would be pulled out of the free lists. And, in fact, I actually started to implement that option last month but stopped. The reason that I stopped was that more than half of the code that I was writing was to deal with PG_CACHED pages and that seemed like wasted effort. So, I decided to postpone implementing that option until PG_CACHED pages are gone.

I reclaim multiple times when either the value of "npages" is small, e.g., 1 or 2, or reclamation of a larger run failed. Reclamation ends when I've successfully reclaimed a total of 8 or more pages in one or more runs. The reason for the "8 or more pages" rule is so that functions like kmem_alloc_attr() won't have to call reclaim for every single page allocated.

In other words, for the larger runs, I'm really relying on the retry logic in the callers to deal with the allocation races, i.e., the free pages still existing when the vm_page_alloc_contig() call is repeated.

Ok. I had misread the MIN_RECLAIM logic then. This sounds reasonble, I was just asking for clarification.

I think the change looks ok as far as I understand it.

alc added a reviewer: adrian.

Again, mostly changes to the comments. I've asked adrian@ to test this patch on 32-bit MIPS, and so I've added him as a reviewer.

Aside from removing the sysctl that I've used for testing, I think that this patch is ready to be committed.

There is one functional change from the last revision that I posted. When I relocate a virtual page to a new physical address, I try to allocate the new physical page at an address above "high":

m_new = vm_page_alloc_contig(
    NULL, 0, req, 1,
    round_page(high),
    ~(vm_paddr_t)0,
    PAGE_SIZE, 0,
    VM_MEMATTR_DEFAULT);

Suppose that the caller wanted memory below 4 GB. Moreover, suppose that the machine only has 2 GB of RAM. The problem that I've fixed in this version is in vm_phys_alloc_contig(). Specifically, when the above call to vm_page_alloc_contig() in turn called vm_phys_alloc_contig(), a *lot* of time was wasted pointlessly scanning the buddy queues in vm_phys_alloc_contig(). In other words, vm_phys_alloc_contig() wasn't smart enough to realize quickly that there was no memory above 4 GB and return NULL. Now, vm_phys_alloc_contig()'s search for free memory is limited to the free lists that could contain suitable pages.

I was able to test a 32-bit MIPS kernel on gxemul last night. I'm inclined to commit this patch in the next 24 hours unless you guys want more time to look at it.

Should I wait or commit?

yeah, give me a couple days to fire this up in qemu and test it some more. :)

Sorry!

Kostik, Mark, do you guys have any other comments or questions?

In D4444#97505, @alc wrote:

Kostik, Mark, do you guys have any other comments or questions?

No, I don't think so. I've been testing this on amd64 with a little program that fragments memory and the debug sysctl that you added, and haven't seen behaviour that looks problematic or incorrect.

Mark, is there any reason for me not to commit this patch to the PQ_LAUNDRY branch?

In D4444#97509, @alc wrote:

Mark, is there any reason for me not to commit this patch to the PQ_LAUNDRY branch?

Nope. If anything that'd make it a bit easier for me to maintain my testing branches in git, so please go ahead.

kib edited edge metadata.
This revision is now accepted and ready to land.Dec 18 2015, 8:05 PM
This revision was automatically updated to reflect the committed changes.

Tried some buildworld'ing in a single-cpu mips qemu, with 64, 128, 512, 1024, 1536MB of RAM. So far so good.