Page MenuHomeFreeBSD

Replace global swhash in swap pager with per-object trie to track swap blocks assigned to the object pages.
ClosedPublic

Authored by kib on Jul 1 2017, 4:13 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jun 23, 7:01 AM
Unknown Object (File)
Sun, Jun 23, 6:59 AM
Unknown Object (File)
Sun, Jun 23, 6:57 AM
Unknown Object (File)
Sun, Jun 23, 6:48 AM
Unknown Object (File)
Sun, Jun 23, 6:46 AM
Unknown Object (File)
Sat, Jun 1, 9:53 PM
Unknown Object (File)
May 9 2024, 12:55 PM
Unknown Object (File)
May 3 2024, 3:00 PM
Subscribers

Details

Summary

The global swhash_mtx is removed, tries are synchronized by the corresponding object locks.

The swp_pager_meta_free_all() function used during object termination, is optimized by only looking at the trie instead of having to search whole hash for the swap blocks owned by the object.

On swap_pager_swapoff(), instead of iterating over the swhash, global object list have to be inspected. There, we have to ensure that we do see valid trie content if we see that the object type is swap.

Sizing of the swblk zone is same as for swblock zone, each swblk maps SWAP_META_PAGES pages.

Proposed by alc.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kib marked an inline comment as done.Jul 15 2017, 7:21 AM

More Mark notes handled.

sys/vm/swap_pager.c
1718 ↗(On Diff #30372)

Oops, I see.

2013 ↗(On Diff #30284)

Sorry, that wasn't very clear. Right - we know that sb->p is divisible by SWAP_META_PAGES and sb->p <= pindex, so sb->p + i >= pindex is false when i < pindex % SWAP_META_PAGES, and thus we can micro-optimize by initializing i as you suggested.

sys/vm/vm_object.c
235 ↗(On Diff #30372)

Yeah, vm_object_zinit() sets only the object type. I think your proposal makes sense.

sys/vm/swap_pager.c
2013 ↗(On Diff #30284)

Hmm, in fact it is not true that sb->p <= pindex, we might find an swblk past the pindex block. Then using pindex mod SWAP_META_PAGES as the start index would skip some blocks which might be a valid candidate to return.

It might be that your suggestion could be implemented by checking that sb->p <= pindex. If not, then an additional optimization is that the second LOOKUP_GE is not needed. I tried to rewrite this fragment but I did not liked the result.

sys/vm/swap_pager.c
2013 ↗(On Diff #30284)

Ugh, right, sorry for the noise. I think what you have now is fine then.

If sb->p > pindex after the first lookup, then we will never reach the second lookup since sb must map at least one block.

sys/vm/swap_pager.c
2465 ↗(On Diff #30807)

Given that we are now iterating over the radix trie to compute "bcount", there is no reason for this function to return an approximation. Instead, start "pi" at the offset specified by the map entry and stop iterating once "pi" gets to the end of the map entry. This may, in fact, reduce the number of iterations.

Then, line 2476 below simply becomes "count += bcount;"

sys/vm/swap_pager.c
131 ↗(On Diff #30807)

This has reverted one of my recent changes. I intentionally decoupled SWAP_META_PAGE from MAX_PAGEOUT_CLUSTER. See line 137 on the left-hand side.

kib marked 2 inline comments as done.

Two alc' notes.

sys/vm/swap_pager.c
2013 ↗(On Diff #30284)

Doesn't it suffice to wrap lines 2011 through 2018 in the "then" block of "if (sb->p < pindex)"? (Don't test for equality. The loop at line 2019 can handle the case where they are equal.) Then, you can safely write "for (i = pindex % SWAP_META_PAGES; ..." inside the "then" block.

Optimize swap_pager_find_least().

In D11435#241087, @kib wrote:

Optimize swap_pager_find_least().

I think this is not quite what Alan suggested - with this change, we're doing an unnecessary lookup when no swap block with rounddown(pindex, SWAP_META_PAGES) <= sb->p < roundup(pindex, SWAP_META_PAGES) is present in the trie.

Eliminate unnecessary lookup.

sys/kern/subr_pctrie.c
682–687 ↗(On Diff #30925)

Could this be an inline alongside pctrie_is_empty() in sys/_pctrie.h?

sys/vm/swap_pager.c
140–143 ↗(On Diff #30925)

I'm not sure that there will be anything actionable in this comment, but I think that it is worth considering how these "fat" leaves interact with the pctrie code.

Suppose that we have a densely populated trie, i.e., the trie has allocated, contiguous leaves. On amd64, where the radix is 16 (and SWAP_META_PAGES is 32), I believe that the interior nodes just above the leaves will have every other slot empty. In other words, 8 out of 16 slots will be wasted.

On i386, the radix is only 8, so we waste 6 out of 8 slots in a leaf's parent.

1579 ↗(On Diff #30925)

Does this function actually have callers? If not, could we just eliminate it?

1919 ↗(On Diff #30925)

"+ SWAP_META_PAGES" instead? That's the way that you've typically written this elsewhere.

kib marked 3 inline comments as done.Jul 19 2017, 6:26 AM
kib added inline comments.
sys/vm/swap_pager.c
140–143 ↗(On Diff #30925)

Do you mean that we should align the d[] run size with the pctrie level width ? At least this is the best idea I could end up with.

Remove unused function, other minor rearrangements.

Try to define SWAP_META_PAGES according to pctrie width.

I did some digging in the swap pager's history. swap_pager_isswapped() was introduced to implement "swapoff". Specifically, to determine whether the U area for a process needed to be swapped in. As such, it probably should have been deleted when most U area code was removed in r137910. I think that it would fine to commit a change to HEAD eliminating it. It's been dead code for a very long time.

Removed swap_pager_isswapped().

sys/sys/_pctrie.h
51 ↗(On Diff #30943)

I'd like to have the same inline for the vm_radix version and use it in vm_object.c. There, however, the corresponding function name vm_radix_init() is taken already. Ideas?

(In short, I'm suggesting that the next commit be a coordinated change to add this function (by whatever name) and a vm_radix version.)

This revision is now accepted and ready to land.Jul 23 2017, 7:55 PM
sys/sys/pctrie.h
79 ↗(On Diff #30372)

I tried removing it from all of the definitions and got:

../../../kern/vfs_subr.c:343:1: warning: unused function 'BUF_PCTRIE_RECLAIM' [-Wunused-function]
PCTRIE_DEFINE(BUF, buf, b_lblkno, buf_trie_alloc, buf_trie_free);
^
../../../sys/pctrie.h:93:28: note: expanded from macro 'PCTRIE_DEFINE'
static __inline void \

^

<scratch space>:168:1: note: expanded from here
BUF_PCTRIE_RECLAIM
^
1 warning generated.

sys/sys/pctrie.h
79 ↗(On Diff #30372)

To be clear, I think that this snippet could be committed now.

kib edited edge metadata.

Update after r321512.

This revision now requires review to proceed.Jul 26 2017, 6:47 AM
sys/vm/swap_pager.c
1683 ↗(On Diff #30372)

At the very least, we ought to segregate the NOFREE allocations from the rest, so that they are not causing either virtual (kmem arena) or physical memory fragmentation.

Kostik, I suggest committing the remaining pctrie changes.

sys/vm/swap_pager.c
546–547 ↗(On Diff #31210)

If we had to reduce the reservation for the swblk zone, then aren't we almost certainly going to fail here?

Taking a step back, I want to ask if we even need these reservations. The sizing of the reservations is strange in the sense that it is based the number of physical pages that we have. Arguably, it should be based on the amount of swap space that we have, which we don't know at this point.

I am tempted to suggest that we have Peter explore what happens without the reservations under a PAE configuration.

sys/vm/swap_pager.c
546–547 ↗(On Diff #31210)

Do you mean testing the following ?

diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
index c906b96eee4..f6d187355df 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
@@ -526,28 +526,32 @@ swap_pager_swap_init(void)
 	swblk_zone = uma_zcreate("swblk", sizeof(struct swblk), NULL, NULL,
 	    NULL, NULL, _Alignof(struct swblk) - 1,
 	    UMA_ZONE_NOFREE | UMA_ZONE_VM);
 	if (swblk_zone == NULL)
 		panic("failed to create swap blk zone.");
 	n2 = n;
+#if 0
 	do {
 		if (uma_zone_reserve_kva(swblk_zone, n))
 			break;
 		/*
 		 * if the allocation failed, try a zone two thirds the
 		 * size of the previous attempt.
 		 */
 		n -= ((n + 2) / 3);
 	} while (n > 0);
+#endif
 	if (n2 != n)
 		printf("Swap blk zone entries reduced from %lu to %lu.\n",
 		    n2, n);
 	swap_maxpages = n * SWAP_META_PAGES;
 	swzone = n * sizeof(struct swblk);
+#if 0
 	if (!uma_zone_reserve_kva(swpctrie_zone, n))
 		panic("Cannot reverse swap pctrie zone entries.");
+#endif
 }
 
 static vm_object_t
 swap_pager_alloc_init(void *handle, struct ucred *cred, vm_ooffset_t size,
     vm_ooffset_t offset)
 {

Yes. However, there is a call to uma_zone_get_max() where we add swap space that you might also need to deal with, because I think that uma_zone_get_max() will return non-sense without the reservation or a call to uma_zone_set_max().

In D11435#243657, @alc wrote:

Yes. However, there is a call to uma_zone_get_max() where we add swap space that you might also need to deal with, because I think that uma_zone_get_max() will return non-sense without the reservation or a call to uma_zone_set_max().

I always return 0 from swapon_check_swzone(), it seems to be a simplest way to ignore.

diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
index c906b96eee4..77fcd6f89d7 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
@@ -526,28 +526,32 @@ swap_pager_swap_init(void)
 	swblk_zone = uma_zcreate("swblk", sizeof(struct swblk), NULL, NULL,
 	    NULL, NULL, _Alignof(struct swblk) - 1,
 	    UMA_ZONE_NOFREE | UMA_ZONE_VM);
 	if (swblk_zone == NULL)
 		panic("failed to create swap blk zone.");
 	n2 = n;
+#if 0
 	do {
 		if (uma_zone_reserve_kva(swblk_zone, n))
 			break;
 		/*
 		 * if the allocation failed, try a zone two thirds the
 		 * size of the previous attempt.
 		 */
 		n -= ((n + 2) / 3);
 	} while (n > 0);
+#endif
 	if (n2 != n)
 		printf("Swap blk zone entries reduced from %lu to %lu.\n",
 		    n2, n);
 	swap_maxpages = n * SWAP_META_PAGES;
 	swzone = n * sizeof(struct swblk);
+#if 0
 	if (!uma_zone_reserve_kva(swpctrie_zone, n))
 		panic("Cannot reverse swap pctrie zone entries.");
+#endif
 }
 
 static vm_object_t
 swap_pager_alloc_init(void *handle, struct ucred *cred, vm_ooffset_t size,
     vm_ooffset_t offset)
 {
@@ -2069,12 +2073,13 @@ done:
  * exceed half the theoretical maximum.  If it does, print a warning
  * message and return -1; otherwise, return 0.
  */
 static int
 swapon_check_swzone(unsigned long npages)
 {
+#if 0
 	unsigned long maxpages;
 
 	/* absolute maximum we can handle assuming 100% efficiency */
 	maxpages = uma_zone_get_max(swblk_zone) * SWAP_META_PAGES;
 
 	/* recommend using no more than half that amount */
@@ -2083,12 +2088,13 @@ swapon_check_swzone(unsigned long npages)
 		    "exceeds maximum recommended amount (%lu pages).\n",
 		    npages, maxpages / 2);
 		printf("warning: increase kern.maxswzone "
 		    "or reduce amount of swap.\n");
 		return (-1);
 	}
+#endif
 	return (0);
 }
 
 static void
 swaponsomething(struct vnode *vp, void *id, u_long nblks,
     sw_strategy_t *strategy, sw_close_t *close, dev_t dev, int flags)

Yes, let's try this patched version on a swap-intensive workload.

sys/vm/swap_pager.c
1683 ↗(On Diff #30372)

Should we perhaps have NOFREE zones use larger slab sizes? It looks like all the NOFREE zones my system have uk_ppera == 1, so they'll be using uma_small_malloc() and thus competing with non-NOFREE single-page UMA allocations from physical memory chunks in FREEPOOL_DIRECT.

I guess the question then would be what the slab size of each NOFREE zone should be. If it's too big then we might end up wasting memory over time as a result of internal fragmentation. Maybe instead we should have NOFREE allocations use a different freepool, at least on architectures with a direct map. Then the physical memory allocator will attempt to allocate 16MB chunks that get shared among all NOFREE zones, but free pages in the pool can be repurposed during memory shortages.

sys/vm/swap_pager.c
1683 ↗(On Diff #30372)

UMA has never implemented a rather basic and fundamental feature of Bonwick's original slab allocator: Setting the slab size based on the object size so that there is never more than 1/8 of the slab wasted on fragmentation. In Jeff's defense, when he first implemented UMA, vm_map_findspace() was O(n) in the number of map entries, and he was justifiably afraid of the number of map entries becoming large if multipage kmem allocations became common. Today, with vmem, I don't think that we have a reason to be concerned about this.

As motivation, consider where we are today with the vmspace zone. It is a NOFREE zone and the object size is about 2.5KB.

And, as you suggest, I could see the point of increasing Bonwick's default fragmentation threshold for NOFREE zones.

I suspect that you probably know the UMA code better than I do. How easy would it be to implement Bonwick's fragmentation control strategy. At the very least, I am intensely curious how many zones would be affected.

In D11435#243818, @alc wrote:

Yes, let's try this patched version on a swap-intensive workload.

This somehow got lost in my private inbox.

@pho wrote:
I ran tests for four hours, without finding any thing interesting, I think.
I ran the PAE test with the #if 0 added. On amd64 I ran the original patch.

sys/vm/swap_pager.c
1881–1885 ↗(On Diff #31262)

Can't this be simplified to

while ((sb = SWAP_PCTRIE_LOOKUP_GE(...)) != NULL) {

?

sys/vm/swap_pager.c
1669–1673 ↗(On Diff #31262)

Similarly, a while loop would seem more natural here.

pi = 0;
while ((sb = SWAP...

Rearrange loops. Update comment.

sys/vm/swap_pager.c
1656 ↗(On Diff #31262)

I think this fixes a race that Peter Holm reported at Isilon. When an object is collapsed into its shadow, we first clean its memq and then call swap_pager_copy() to update swap blocks that now belong to the shadow. swap_pager_copy() may sleep, at which point a concurrent swapoff might bring pages into the parent's memq. This leads to the object being freed with a non-empty memq.

We set OBJ_DEAD before starting the collapse, so now the swapoff will just ignore blocks belonging to the parent. The unchanged retry logic will hopefully give us enough time to finish the swap_pager_copy() call.

sys/vm/swap_pager.c
131 ↗(On Diff #31466)

One of my earlier commits replaced the space after the #define with a tab, which I thought is the correct style. Now we're back to a space. :-)

132 ↗(On Diff #31466)

I think that SWAP_META_MASK can be deleted. I don't think that it is used anymore.

135–138 ↗(On Diff #31466)

I think that the last sentence belongs somewhere else.

For the first two sentences, I would suggest:
"A swblk structure maps each page index within a SWAP_META_PAGES-aligned and sized range to the address of an on-disk swap block (or SWAPBLK_NONE). The collection of these mappings for an entire vm object is implemented as a pc-trie."

563 ↗(On Diff #31466)

"The un_pager.swp.swp_blks trie ...

564 ↗(On Diff #31466)

"... to ensure the correct order ...

565 ↗(On Diff #31466)

"visibility to other threads."

kib marked 6 inline comments as done.

Update comments.

sys/vm/swap_pager.c
515 ↗(On Diff #31511)

Isn't this sentence, specifically, the 32MB number, bogus? I'm not a fan of such comments because they don't age well.

1742 ↗(On Diff #31511)

"swap_pager_swapoff()" -> "swap_pager_swapoff()'s"

1743 ↗(On Diff #31511)

-> "... see a garbage ..."

2421 ↗(On Diff #31511)

Elsewhere, you write "sb->p + SWAP_META_PAGES" instead of "sb->p + 1".

546–547 ↗(On Diff #31210)

Yes. To be transparent, the zone reservations are the only aspect of this patch that I'm still thinking about.

kib marked 4 inline comments as done.

Edit comments, use consistent loop iteration construct.

By the way, once upon a time, you asked me about whether we should continue using the blist allocator for swap space or switch to vmem. Perhaps the most compelling reason, which I failed to give at the time, is that every allocated range from an arena consumes a boundary tag. In other words, while vmem coalesces free ranges, it maintains a boundary tag for each allocated range.

Also, from a functionality standpoint, with vmem, you have to free an entire allocation at once. In order to support the pager's unswapped function, we would have to allocate swap space one page at a time; so in fact we would wind up with a boundary tag in use for every page of allocated swap space.

Does that make sense?

In D11435#249799, @alc wrote:

By the way, once upon a time, you asked me about whether we should continue using the blist allocator for swap space or switch to vmem. Perhaps the most compelling reason, which I failed to give at the time, is that every allocated range from an arena consumes a boundary tag. In other words, while vmem coalesces free ranges, it maintains a boundary tag for each allocated range.

Ok.

Also, from a functionality standpoint, with vmem, you have to free an entire allocation at once. In order to support the pager's unswapped function, we would have to allocate swap space one page at a time; so in fact we would wind up with a boundary tag in use for every page of allocated swap space.

Or enhance vmem to allow partial free, with a specialized interface. I think that this could be done in a way similar to the stack gap handling for edges, and new bt allocations if in middle.
I do not propose to do this, just discussing a hypothetical possibility.

In D11435#249802, @kib wrote:
In D11435#249799, @alc wrote:

By the way, once upon a time, you asked me about whether we should continue using the blist allocator for swap space or switch to vmem. Perhaps the most compelling reason, which I failed to give at the time, is that every allocated range from an arena consumes a boundary tag. In other words, while vmem coalesces free ranges, it maintains a boundary tag for each allocated range.

Ok.

Also, from a functionality standpoint, with vmem, you have to free an entire allocation at once. In order to support the pager's unswapped function, we would have to allocate swap space one page at a time; so in fact we would wind up with a boundary tag in use for every page of allocated swap space.

Or enhance vmem to allow partial free, with a specialized interface. I think that this could be done in a way similar to the stack gap handling for edges, and new bt allocations if in middle.
I do not propose to do this, just discussing a hypothetical possibility.

My recollection from Bonwick's paper, which I haven't looked at in very long time, is that vmem doesn't coalesce allocated ranges for at least two reasons. Chief among the reasons, it allows vmem to trivially guarantee that a free will never fail because of failure to allocate a boundary tag. Any new interface that allowed for partially freeing an allocated range would have to allow for the possibility of failure.

That said, maintaining boundary tags in vmem for UMA _NOFREE slabs is downright silly.

sys/vm/swap_pager.c
529–541 ↗(On Diff #32119)

With the addition of the below reservation, I think that this entire block of code is pointless. Let's consider 32-bit architectures, where this autoreduction could matter. I believe that the sizes of a swblk and a trie node are 72 and 44, respectively.

The trie node reservation could fail regardless of whether we autoreduce the swblk reservation or not. Let's consider the case where we did autoreduce it. In those cases, we reduce the swblk reservation size by 1/3. The trouble is that 1/3 of 72 is only 24. So, I claim that the subsequent trie node reservation is certain to fail.

Can we instead turn these panics into diagnostic output saying that you need to reduce your swzone manually? And, let the machine come up with swapping disabled?

sys/vm/swap_pager.c
529–541 ↗(On Diff #32119)

I think that disabling swapping is too much overreaction. I believe it is enough to replace the panic by printf:

 	swzone = n * sizeof(struct swblk);
 	if (!uma_zone_reserve_kva(swpctrie_zone, n))
-		panic("Cannot reverse swap pctrie zone entries.");
+		printf("Cannot reverse swap pctrie zone, "
+		    "reduce kern.maxswzone.\n");
 }

Just an FYI, I'm doing some performance testing while sipping my morning coffee.

I've tried to create a test where the vm object has discontiguous swap space usage and the radix tree should be advantageous. In particular, I've tried to create a situation where vm object destruction should be faster, but I'm getting inconsistent results.

In regards to memory utilization, I've looked at a best-case scenario for the old code, where you have contiguous swap space usage. (To be clear, I'm talking about contiguous page index ranges within the vm object being swapped out, not contiguous on-disk storage.) My scenario was using about 48GB of swap space, and the total kernel memory used by the new radix trie code was slightly less than the old code if you account for the size of the swhash array.

In D11435#251395, @alc wrote:

I've tried to create a test where the vm object has discontiguous swap space usage and the radix tree should be advantageous. In particular, I've tried to create a situation where vm object destruction should be faster, but I'm getting inconsistent results.

In regards to memory utilization, I've looked at a best-case scenario for the old code, where you have contiguous swap space usage. (To be clear, I'm talking about contiguous page index ranges within the vm object being swapped out, not contiguous on-disk storage.) My scenario was using about 48GB of swap space, and the total kernel memory used by the new radix trie code was slightly less than the old code if you account for the size of the swhash array.

Isn't the main advantage of the new code is the fact that the per-object tracking of the pindex->swap block relation is protected by the vm object lock instead of the global hash lock. If the memory usage by the new code is approximately the same, as expected, then this reason alone should be significant.

In D11435#251430, @kib wrote:
In D11435#251395, @alc wrote:

I've tried to create a test where the vm object has discontiguous swap space usage and the radix tree should be advantageous. In particular, I've tried to create a situation where vm object destruction should be faster, but I'm getting inconsistent results.

In regards to memory utilization, I've looked at a best-case scenario for the old code, where you have contiguous swap space usage. (To be clear, I'm talking about contiguous page index ranges within the vm object being swapped out, not contiguous on-disk storage.) My scenario was using about 48GB of swap space, and the total kernel memory used by the new radix trie code was slightly less than the old code if you account for the size of the swhash array.

Isn't the main advantage of the new code is the fact that the per-object tracking of the pindex->swap block relation is protected by the vm object lock instead of the global hash lock. If the memory usage by the new code is approximately the same, as expected, then this reason alone should be significant.

Originally, I proposed the conversion from a single, global hash table to a per-object radix trie because I thought it would make operations like swap_pager_find_least() and swp_pager_meta_free_all() faster because they would no longer have to probe the hash table iteratively. However, I don't disagree with your statement about the change in locking being important. For example, vm_object_scan_all_shadowed() should hypothetically benefit from both a more efficient swap_pager_find_least() and the finer-grained locking.

Returning to memory utilization, the new code should deal with fragmented swap usage within vm objects more efficiently than the old code because the size of swblk is smaller than swblock, especially on 32-bit architectures where it is roughly 1/4 the size.

sys/vm/swap_pager.c
534–535 ↗(On Diff #32119)

Isn't this comment wrong? Aren't we subtracting 2/3rds of "n" from "n" below?

sys/vm/swap_pager.c
534–535 ↗(On Diff #32119)

We are substracing 1/3 of the n, rounded up.

I replaced panic on swpctrie_zone reservation with a printf for some time in my repo. Update the review.

sys/vm/swap_pager.c
534–535 ↗(On Diff #32119)

I read the "+ 2" as "* 2".

sys/vm/swap_pager.c
546–547 ↗(On Diff #31210)

"reverse" should be "reserve".

I've now completed 8+ hours of testing under a "make -j7 buildworld" workload on 1.5GB of RAM swapping to a Samsung 850 PRO, both with and without the patch. I think that there is too much variability in the execution time conclude anything about the execution time. However, here is the memory utilization story:

With the patch: 4699560 bytes of swblk and swpctrie structures

Without the patch: 5399776 bytes of SWAPMETA structures and swhash array

I'm pretty happy with this result.

This revision is now accepted and ready to land.Aug 25 2017, 5:33 PM
This revision was automatically updated to reflect the committed changes.