Page MenuHomeFreeBSD

Add PG_CACHE_ALLOC.
ClosedPublic

Authored by markj on Jun 25 2019, 8:43 PM.

Details

Summary

This flag is set when a physical page structure is allocated from the
per-CPU cache. Use it to decide whether to free to the per-CPU cache.
This ensures that pages allocated from a now-broken reservation do not
end up in the per-CPU caches. Instead, they go to the physical memory
allocator, where they will get a chance to join pages freed by
vm_reserv_break().

While here remove a useless assertion that m != NULL from
vm_page_alloc_domain_after(). It made more sense in earlier revisions
of the allocator but it is now superfluous.

Test Plan

Here is the state of the physical memory allocator before and after
the patch. This snapshot is taken after poudriere has finished calculating
dependencies and before it has started building packages.

Before: https://reviews.freebsd.org/P276
After: https://reviews.freebsd.org/P275

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 25044
Build 23758: arc lint + arc unit

Event Timeline

markj created this revision.Jun 25 2019, 8:43 PM
markj edited the test plan for this revision. (Show Details)Jun 25 2019, 8:44 PM
markj added reviewers: alc, dougm, kib.
markj added inline comments.Jun 26 2019, 4:24 PM
sys/vm/vm_page.c
3507

I think we could also use this flag to avoid the vm_reserv_free_page() call in vm_page_free_prep(). As I noted in the email thread, the access of the reservation structure usually results in a cache miss.

markj updated this revision to Diff 59098.Jun 27 2019, 1:51 PM

Simplify initialization of m->flags.

alc added inline comments.Jun 27 2019, 2:53 PM
sys/vm/vm_page.c
1872

If you initialize flags to PG_CACHE_ALLOC here ...

1874

and change this to flags |= PG_ZERO, then PG_CACHE_ALLOC would be "inherited" like PG_ZERO. And, so you would only set PG_CACHE_ALLOC in vm_phys_alloc_npages() and clear it in vm_page_release().

IMO, the downside to this suggested approach is that setting PG_CACHE_ALLOC in vm_phys_alloc_npages() feels like we would be blurring the abstraction boundary between vm_phys and vm_page. Thoughts?

dougm added inline comments.Jun 27 2019, 2:59 PM
sys/vm/vm_page.c
1815

The optimization of moving m = NULL to before 'again' is also available with 'm_ret' and 'again' in vm_page_alloc_contig_domain.

markj updated this revision to Diff 59102.Jun 27 2019, 3:35 PM

Initialize m_ret earlier in vm_page_alloc_contig_domain() per
Doug's suggestion.

markj added a comment.Jun 27 2019, 3:37 PM

(I'm open to suggestions on a better name than PG_CACHE_ALLOC.)

sys/vm/vm_page.c
1874

Indeed, I thought about setting it in vm_phys_alloc_npages() and decided against it because it felt like a layering violation. (And vm_phys_alloc_npages() could conceivably be used for non-pcpu cache allocations, for example in the vm_page_alloc_pages() function I wrote a while ago.)

If we set m->flags |= PG_CACHE_ALLOC after allocating from UMA and clear it in vm_page_release() we could do as you suggested above:

flags = PG_CACHE_ALLOC;
if ((req & VM_ALLOC_ZERO) != 0)
    flags |= PG_ZERO;
flags &= m->flags;
if ((req & VM_ALLOC_NODUMP) != 0)
    flags |= PG_NODUMP;
m->flags = flags;

There would be some asymmetry in that we set PG_CACHE_ALLOC after allocating from the cache and clear it only when releasing pages from the cache, but to me that seems preferable to setting PG_CACHE_ALLOC in vm_phys.

3507

... this would still not be optimal, e.g., since pages allocated with VM_ALLOC_NOOBJ never come from the per-CPU caches but will also never be allocated from a reservation.

alc added inline comments.Jun 28 2019, 4:25 PM
sys/vm/vm_page.c
3507

Is there any reason that we don't have a per-pool vmd_cache?

markj added inline comments.Jun 28 2019, 5:39 PM
sys/vm/vm_page.c
3507

I can't think of one. When I first implemented the UMA-based page cache I indeed had a zone per freepool.

I just asked Drew about it, and it turns out that Netflix still a version with one zone per freepool. It's necessary for their in-kernel TLS implementation, which does VM_ALLOC_NOOBJ allocations in the encrypted sendfile path. Given this information, I would propose bringing that extension (one zone per freepool) to FreeBSD. Then we can use this flag or some variant to elide vm_reserv_free_page() calls for many NOOBJ allocations as well.

markj added a reviewer: jeff.Jun 28 2019, 5:39 PM
jeff added a comment.Jun 28 2019, 5:47 PM

My recollection is that when I implemented the cache there simply wasn't a lot of traffic for the other free pools and I didn't want to increase fragmentation. I have no objection to doing so now. I do believe we should have some separation for UMA NOFREE zones whether that is a different pool or some other layer I can not say.

Allow me to suggest a slight alternative to the above that I think will cover more cases. If we invert the sense of the flag and put it in an atomic field we can then set it anywhere that we discover a page that is creating fragmentation or may do so. For example, in the defrag code we could discover a single page in a superpage that is presently allocated and set the bit. We would additionally set the bit whenever we break a reservation to allocate a page because we are low on memory, which is what this case handles.

You would name it something like PG_CACHEBYPASS or whatever you like.

alc added a comment.EditedJul 1 2019, 8:43 PM
In D20763#450130, @jeff wrote:

My recollection is that when I implemented the cache there simply wasn't a lot of traffic for the other free pools and I didn't want to increase fragmentation. I have no objection to doing so now. I do believe we should have some separation for UMA NOFREE zones whether that is a different pool or some other layer I can not say.

I wrote an experimental patch last year to segregate NOFREE zones using a new vm_phys pool. You can find it at D16620. It's better than nothing. :-) However, its primary shortcomings are that (1) it only applies to architectures with a direct map and (2) that pools only provide "soft" segregation. In other words, I don't think vm_phys pools are the right solution.

Allow me to suggest a slight alternative to the above that I think will cover more cases. If we invert the sense of the flag and put it in an atomic field we can then set it anywhere that we discover a page that is creating fragmentation or may do so. For example, in the defrag code we could discover a single page in a superpage that is presently allocated and set the bit. We would additionally set the bit whenever we break a reservation to allocate a page because we are low on memory, which is what this case handles.

Yes, I think something like this flag could prove useful. However, in regards to reservations, when we break a reservation, the code is actually "clever" enough that it does not touch every vm_page structure. From the starting physical address of the reservation and the reservation's popmap we figure out the maximal order chunks of pages to reinsert into the buddy queues, and so vm_phys never winds up coalescing any of the pages from a broken reservation. Setting this proposed flag would introduce new vm_page structure touches.

You would name it something like PG_CACHEBYPASS or whatever you like.

alc accepted this revision.Jul 1 2019, 8:49 PM
alc added inline comments.
sys/vm/vm_page.c
1860–1862

I might keep this comment, but I agree with deleting the KASSERT.

This revision is now accepted and ready to land.Jul 1 2019, 8:49 PM
markj marked an inline comment as done.Jul 2 2019, 3:03 AM
In D20763#450948, @alc wrote:
In D20763#450130, @jeff wrote:

I wrote an experimental patch last year to segregate NOFREE zones using a new vm_phys pool. You can find it at D16620. It's better than nothing. :-) However, its primary shortcomings are that (1) it only applies to architectures with a direct map and (2) that pools only provide "soft" segregation. In other words, I don't think vm_phys pools are the right solution.

For (1), could we use a separate vmem arena for _NOFREE allocations like we do for M_EXEC?

Allow me to suggest a slight alternative to the above that I think will cover more cases. If we invert the sense of the flag and put it in an atomic field we can then set it anywhere that we discover a page that is creating fragmentation or may do so. For example, in the defrag code we could discover a single page in a superpage that is presently allocated and set the bit. We would additionally set the bit whenever we break a reservation to allocate a page because we are low on memory, which is what this case handles.

Yes, I think something like this flag could prove useful. However, in regards to reservations, when we break a reservation, the code is actually "clever" enough that it does not touch every vm_page structure. From the starting physical address of the reservation and the reservation's popmap we figure out the maximal order chunks of pages to reinsert into the buddy queues, and so vm_phys never winds up coalescing any of the pages from a broken reservation. Setting this proposed flag would introduce new vm_page structure touches.

I don't quite see how inverting the sense would make it possible to handle more cases. With this diff, we free to the per-CPU cache if and only if PG_CACHE_ALLOC is set. Defragmentation code, upon encountering an allocated page in an otherwise free 2MB chunk, could clear PG_CACHE_ALLOC to ensure that the page would bypass the per-CPU caches when it is freed by its owner. Today, vm_page_reclaim_run() frees pages directly to vm_phys and thus always bypasses the per-CPU cache, otherwise it would need to be updated by this diff.

markj added a comment.Jul 2 2019, 3:13 AM
In D20763#450948, @alc wrote:
In D20763#450130, @jeff wrote:

I wrote an experimental patch last year to segregate NOFREE zones using a new vm_phys pool. You can find it at D16620. It's better than nothing. :-) However, its primary shortcomings are that (1) it only applies to architectures with a direct map and (2) that pools only provide "soft" segregation. In other words, I don't think vm_phys pools are the right solution.

For (1), could we use a separate vmem arena for _NOFREE allocations like we do for M_EXEC?

That's not as good, since it will only provide physical contiguity so long as there are free reservation-sized chunks of physical memory.

alc added a comment.Jul 2 2019, 4:57 AM
In D20763#450948, @alc wrote:
In D20763#450130, @jeff wrote:

I wrote an experimental patch last year to segregate NOFREE zones using a new vm_phys pool. You can find it at D16620. It's better than nothing. :-) However, its primary shortcomings are that (1) it only applies to architectures with a direct map and (2) that pools only provide "soft" segregation. In other words, I don't think vm_phys pools are the right solution.

For (1), could we use a separate vmem arena for _NOFREE allocations like we do for M_EXEC?

Allow me to suggest a slight alternative to the above that I think will cover more cases. If we invert the sense of the flag and put it in an atomic field we can then set it anywhere that we discover a page that is creating fragmentation or may do so. For example, in the defrag code we could discover a single page in a superpage that is presently allocated and set the bit. We would additionally set the bit whenever we break a reservation to allocate a page because we are low on memory, which is what this case handles.

Yes, I think something like this flag could prove useful. However, in regards to reservations, when we break a reservation, the code is actually "clever" enough that it does not touch every vm_page structure. From the starting physical address of the reservation and the reservation's popmap we figure out the maximal order chunks of pages to reinsert into the buddy queues, and so vm_phys never winds up coalescing any of the pages from a broken reservation. Setting this proposed flag would introduce new vm_page structure touches.

I don't quite see how inverting the sense would make it possible to handle more cases. With this diff, we free to the per-CPU cache if and only if PG_CACHE_ALLOC is set. Defragmentation code, upon encountering an allocated page in an otherwise free 2MB chunk, could clear PG_CACHE_ALLOC to ensure that the page would bypass the per-CPU caches when it is freed by its owner. Today, vm_page_reclaim_run() frees pages directly to vm_phys and thus always bypasses the per-CPU cache, otherwise it would need to be updated by this diff.

Sorry, I wasn't clear. There are two aspects to Jeff's proposal: (1) inverting the sense of the flag and (2) making it an "aflag" so that it can be modified asynchronously. I don't think (1) helps. As I described, for breaking reservations, it would be a pessimization. However, I think (2) might prove beneficial. For example, it's not obvious to me that we could safely clear PG_CACHE_ALLOC asynchronously on an allocated page without the page and/or object lock.

alc added a comment.Jul 2 2019, 5:05 AM
In D20763#450948, @alc wrote:
In D20763#450130, @jeff wrote:

I wrote an experimental patch last year to segregate NOFREE zones using a new vm_phys pool. You can find it at D16620. It's better than nothing. :-) However, its primary shortcomings are that (1) it only applies to architectures with a direct map and (2) that pools only provide "soft" segregation. In other words, I don't think vm_phys pools are the right solution.

For (1), could we use a separate vmem arena for _NOFREE allocations like we do for M_EXEC?

That's not as good, since it will only provide physical contiguity so long as there are free reservation-sized chunks of physical memory.

For _NOFREE allocations, I think that we should be willing to do some page relocations to create contiguity. However, if we wait until allocation time to create that contiguity, we may add too much latency.

markj added a comment.Jul 2 2019, 3:12 PM
In D20763#451040, @alc wrote:

Sorry, I wasn't clear. There are two aspects to Jeff's proposal: (1) inverting the sense of the flag and (2) making it an "aflag" so that it can be modified asynchronously. I don't think (1) helps. As I described, for breaking reservations, it would be a pessimization. However, I think (2) might prove beneficial. For example, it's not obvious to me that we could safely clear PG_CACHE_ALLOC asynchronously on an allocated page without the page and/or object lock.

Ok. I propose that we commit this diff, add a page cache for VM_FREEPOOL_DIRECT, and use PG_CACHE_ALLOC to skip vm_reserv_free_page() calls in vm_page_free_prep(). Then we can try and come up with a policy for reserving contiguous memory for _NOFREE allocations, which might involve making PG_CACHE_ALLOC an aflag.

I'm not a big fan of the name PG_CACHE_ALLOC. How about PG_PCPU_CACHE?

alc added a comment.Jul 2 2019, 3:54 PM
In D20763#451040, @alc wrote:

Sorry, I wasn't clear. There are two aspects to Jeff's proposal: (1) inverting the sense of the flag and (2) making it an "aflag" so that it can be modified asynchronously. I don't think (1) helps. As I described, for breaking reservations, it would be a pessimization. However, I think (2) might prove beneficial. For example, it's not obvious to me that we could safely clear PG_CACHE_ALLOC asynchronously on an allocated page without the page and/or object lock.

Ok. I propose that we commit this diff, add a page cache for VM_FREEPOOL_DIRECT, and use PG_CACHE_ALLOC to skip vm_reserv_free_page() calls in vm_page_free_prep(). Then we can try and come up with a policy for reserving contiguous memory for _NOFREE allocations, which might involve making PG_CACHE_ALLOC an aflag.

Agreed.

I'm not a big fan of the name PG_CACHE_ALLOC. How about PG_PCPU_CACHE?

I think that PG_PCPU_CACHE is okay.

This revision was automatically updated to reflect the committed changes.