Page MenuHomeFreeBSD

pctrie: create iterator
ClosedPublic

Authored by dougm on Jun 18 2024, 8:05 AM.
Tags
None
Referenced Files
F108311958: D45627.id142135.diff
Thu, Jan 23, 6:51 PM
F108284662: D45627.id140886.diff
Thu, Jan 23, 12:25 PM
Unknown Object (File)
Wed, Jan 22, 2:35 AM
Unknown Object (File)
Wed, Jan 22, 2:32 AM
Unknown Object (File)
Wed, Jan 22, 2:31 AM
Unknown Object (File)
Mon, Jan 20, 9:13 PM
Unknown Object (File)
Mon, Jan 20, 2:57 PM
Unknown Object (File)
Sun, Jan 19, 4:04 PM

Details

Summary

Define a pctrie iterator type. A pctrie iterator is a wrapper around a pctrie that remembers a position in the trie where the last search left off, and where a new search can resume. When the next search is for an item very near in the trie to where the last search left off, iter-based search is faster because instead of starting from the root, the search usually only has to back up one or two steps up the root-to-last-search path to find the branch that leads to the new search target.

Every kind of lookup (plain, lookup_ge, lookup_le) that can begin with the trie root can begin with an iterator instead. An iterator can also do a relative search ("look for the item 4 greater than the last item I found") because it remembers where that last search ended. It can also search within limits ("look for the item bigger than this one, but it has to be less than 100"), which can save time when the next item beyond the limits and that is known before we actually know what that item it is. An iterator can also be used to remove an item that has already been found, without having to search for it again.

Iterators are vulnerable to unsynchronized data changes. If the iterator is created with a lock held, and that lock is released and acquired again, there's no guarantee that the iterator path remains valid.

Test Plan

I've run as much of the stress2 test suite as I can before my machine runs out of disk space.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dougm edited the summary of this revision. (Show Details)

Drop a useless variable.

Here's a panic with D45627.141818.patch:

20240805 10:00:41 all (4/17): pfl2.sh
Aug  5 10:06:17 mercat1 kernel: pid 60195 (swap), jid 0, uid 2007, was killed: a thread waited too long to allocate a page
panic: ASan: Invalid access, 2-byte read at 0xfffffe0182681998, UMAUseAfterFree(fd)
cpuid = 7
time = 1722845232
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0xa5/frame 0xfffffe014d3dd3d0
kdb_backtrace() at kdb_backtrace+0xc6/frame 0xfffffe014d3dd530
vpanic() at vpanic+0x226/frame 0xfffffe014d3dd6d0
panic() at panic+0xb5/frame 0xfffffe014d3dd790
kasan_report() at kasan_report+0xdf/frame 0xfffffe014d3dd860
pctrie_iter_jump_ge_limit() at pctrie_iter_jump_ge_limit+0xe5/frame 0xfffffe014d3dd8b0
swp_pager_meta_free() at swp_pager_meta_free+0x1cf/frame 0xfffffe014d3ddab0
vm_object_madvise() at vm_object_madvise+0x192/frame 0xfffffe014d3ddb30
vm_map_madvise() at vm_map_madvise+0x485/frame 0xfffffe014d3ddc50
sys_madvise() at sys_madvise+0x16a/frame 0xfffffeamd64_syscall() at amd64_syscall+0x39e/frame 0xfffffe014d3ddf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe014d3ddf30
--- syscall (75, FreeBSD ELF64, madvise), rip = 0x3baae8268a5a, rsp = 0x3baae5a540c8, rbp = 0x3baae5a540d0 ---
KDB: enter: panic
[ thread pid 60231 tid 100566 ]
Stopped

https://people.freebsd.org/~pho/stress/log/log0544.txt

Deleted nodes must not be left on the path and detected by their popmaps, iter_remove has to shorten the path to remove them.

dougm edited the summary of this revision. (Show Details)
dougm edited the test plan for this revision. (Show Details)
dougm edited reviewers, added: markj, kib; removed: alc, rlibby.
dougm edited subscribers, added: alc, rlibby; removed: markj.

Use iterators again for swap_meta_transfer. Use helper functions to wrap the pctrie operators to make the swap pager code more readable. Separate the iter initialization function from iter lookup functions.

cc -c -O2 -pipe  -fno-strict-aliasing  -g -nostdinc  -I. -I../../.. -I../../../contrib/ck/include -I../../../contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -MD  -MF.depend.subr_rangeset.o -MTsubr_rangeset.o -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include -fdebug-prefix-map=./x86=/usr/src/sys/x86/include -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding -fwrapv -fstack-protector -gdwarf-4 -Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wswitch -Wno-error=tautological-compare -Wno-error=empty-body -Wno-error=parentheses-equality -Wno-error=unused-function -Wno-error=pointer-sign -Wno-error=shift-negative-value -Wno-address-of-packed-member -Wno-format-zero-length   -mno-aes -mno-avx  -std=gnu99 -Werror ../../../kern/subr_rangeset.c
../../../kern/subr_rangeset.c:329:25: error: use of undeclared identifier 'src_rs'
  329 |         pctrie_iter_init(&it, &src_rs->rs_trie);
      |                                ^
1 error generated.
*** Error code 1

Fix error in DIAGNOSTIC compile.

There are witness_norelease() and witness_releaseok() functions which could be used to mark a region where specific lock must not be dropped. They might be useful to ensure that there is no parallel invalidation of iterators?

Initialize pctrie after reacquiring lock before lookup.

In D45627#1054575, @kib wrote:

There are witness_norelease() and witness_releaseok() functions which could be used to mark a region where specific lock must not be dropped. They might be useful to ensure that there is no parallel invalidation of iterators?

I don't find them used anywhere in the source now, so I don't know how they might be useful.

In D45627#1054575, @kib wrote:

There are witness_norelease() and witness_releaseok() functions which could be used to mark a region where specific lock must not be dropped. They might be useful to ensure that there is no parallel invalidation of iterators?

I don't find them used anywhere in the source now, so I don't know how they might be useful.

I do not quite understand this statement. The functions statically mark a region where the specific lock must not be dropped. Since typical problem affecting the patch seems to be an internal re-lock of the lock protecting pctrie, application of such marks should help detect unsafe places.

In D45627#1054949, @kib wrote:
In D45627#1054575, @kib wrote:

There are witness_norelease() and witness_releaseok() functions which could be used to mark a region where specific lock must not be dropped. They might be useful to ensure that there is no parallel invalidation of iterators?

I don't find them used anywhere in the source now, so I don't know how they might be useful.

I do not quite understand this statement. The functions statically mark a region where the specific lock must not be dropped. Since typical problem affecting the patch seems to be an internal re-lock of the lock protecting pctrie, application of such marks should help detect unsafe places.

Can you direct me to an example of the use of these functions in the kernel source code that I might use as an example to follow?

Suppose that I want to ensure that swp_pager_meta_transfer() does not drop srcobject lock when called from swap_pager_copy(). (This is not true, and running the patch below should catch it). Then I would do the following:

diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
index 59d947c71279..9c93d7ac0861 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
@@ -1084,8 +1084,10 @@ swap_pager_copy(vm_object_t srcobject, vm_object_t dstobject,
 	/*
 	 * Transfer source to destination.
 	 */
+	WITNESS_NORELEASE(&srcobject->lock);
 	swp_pager_meta_transfer(srcobject, dstobject, offset, dstobject->size,
 	    NULL);
+	WITNESS_RELEASEOK(&srcobject->lock);
 
 	/*
 	 * Free left over swap blocks in source.

Running the load which causes the unlock inside swp_pager_meta_transfer() should result in witness message 'forbidden unlock' and then panic, with WITNESS kernel option.

Add some NORELEASE/RELEASEOK witnesses to swap_pager.c. Verified by rerunning the *swap*.sh stress tests.

sys/kern/subr_pctrie.c
791–795

There is a macro DEBUG_POISON_POINTER(). Note that it leaves the value uninitialized in the non-INVARIANT builds.

Use POISON_POINTER instead of 0xdeadbeef to initialize meaningless value.

The functions that call swap_pager_find_least usually do it iteratively, with steadily increasing values for the pindex parameter. To exploit that, have the callers initialize an iterator and pass it to the function, instead of passing the object.

Add asserts to swblk_iter_init.

Give iter variables more domain-specific names than just 'it'.

Fix typo in swp_pager_meta_free.

Times for arm64 buildworld reflect some small benefit from this change.
Before:

World build completed on Thu Aug 15 03:30:02 CDT 2024
World built in 2733 seconds, ncpu: 16, make -j24


real 45m32.919s
user 675m22.247s
sys 21m49.799s

After:

World build completed on Thu Aug 15 04:21:10 CDT 2024
World built in 2715 seconds, ncpu: 16, make -j24


real 45m14.819s
user 673m54.010s
sys 21m19.693s

Ran a 8 hour test with D45627.142095.patch. No problems seen.

Can this review split into core + one review per subsystem (swap, tmpfs, arm64 pmap, amd64 pmap ...)?

sys/vm/swap_pager.c
2296

I do not like this interface change; you always pass some_object->size as size. This is because you otherwise do not need the object at all, right?

Another issue with iterators that I somewhat do not like is that it is possible to work on wrong object with an iterator. In fact, I think I want to add some field of type void * in the iterator, pointing back at the pctrie owner. Then e.g. swap_pager_find_least() can still access object.

In D45627#1056186, @kib wrote:

Can this review split into core + one review per subsystem (swap, tmpfs, arm64 pmap, amd64 pmap ...)?

I can easily split it into three pieces, though that will make testing more challenging.

sys/vm/swap_pager.c
2296

I don't need the object at all. In fact, I don't really need the size at all; I can just return OBJ_MAX_SIZE; neither caller is using the return value in a way where that's a problem. So I'll drop the size argument.

Adding a void* token is therefore not necessary to address your concern.

dougm edited the summary of this revision. (Show Details)

Discard everything but the pctrie_iter interface and implementation. The use of pctrie_iters for ranges is available for review in [D46314], and the use of pctrie_iters for swap-paging is available for review in [D46315]. Feel free to add yourself as reviewer or watcher to either of them, if you are so inclined.

Add a KASSERT to _pctrie_iter_lookup_node to detect freed nodes in the iter path, as might appear if an item is removed without using iter_remove, and frees the top path node.

I don't intend to make any more updates to this proposed change, and would appreciate a final verdict sometime soon.

I ran a 14 hour test with D45627.142138.patch without finding any issues.

sys/sys/pctrie.h
336

Rather than having _limit variants of various operations, did you consider storing a limit in the iterator itself and conditionally checking it? Since limit is an exclusive limit in this interface, that'd require another conditional, which we'd want to avoid if possible. But, it'd make the interface somewhat simpler. For example, the repetition of last in the loop below seems undesirable:

for (sb = swblk_start_limit(&blks, pindex, last),
            start = (sb != NULL && sb->p < pindex) ? pindex - sb->p : 0;
            sb != NULL; sb = swblk_next_limit(&blks, last), start = 0) {
sys/sys/pctrie.h
336

I did consider that, but it seemed to me that it would make those who don't need limits to pay a price for them. Because they will end up calling pctrie_iter_lookup_ge_limit instead of pctrie_iter_lookup_ge, they won't get the chance to have the compiler optimize out all the limit checks from _pctrie_iter_lookup_ge that they don't need.

I suppose that I could define two kinds of iterators, limited and unlimited, with the former storing a limit value and the latter avoiding the needless tests, with the iterator type determining whether the plain or limit functions get called. Is that a simplification or a complication?

Create a second iterator type that stores a limit, and provide interfaces that use that type, rather than an extra limit parameter.

This does mean that implementations that use the limit-storing iterator sometimes have to pull the regular iterator out of it to do some things. I have not added things like plain lookup to the limited-lookup iterator, so in rangesets, for example, there are places where the iterator has to be found within the limiting iterator to do a lookup call.

dougm added a reviewer: cem.

I think a good test of this interface would be to use it to iterate over the resident pages of VM objects. To make integration easier it would be reasonable IMO to commit to the interface you've implemented, and revise later if necessary based on experience gained from converting the vm_radix consumer, among others.

I left some comments, though I haven't fully read through and understood some of the pctrie subroutines. If you're happy with the implementation and interface so far, I think you should go ahead and commit it.

sys/kern/subr_pctrie.c
502

This part of the change (including the addition of pctrie_match_value()) should be committed separately. Or is it actually related to the rest of the patch?

636

Is this used anywhere? As far as I can see, iteration must be serialized by a mutex (i.e., the PCTRIE_LOCKED case); it's not safe to access the value from an iterator without that.

656

Would this benefit from a helper similar to pctrie_match_value()?

855

Shouldn't this be PCTRIE_LOCKED?

sys/sys/pctrie.h
44

Why is this needed? I'd expect an unused function wouldn't generate a warning if it's defined as inline.

336

As implemented, it's not obvious to me that the small optimization of omitting a limit check is worth the increase in complexity of the interface.

339

I wish there was a way to make the API surface smaller. _Generic is potentially helpful, but it can only really be used in a macro, and we can't generate macros here. Is there another way? I guess you could define something like:

#define PCTRIE_ITER_LOOKUP_LE(name, it, index) \
    name##_PCTRIE_VAL2PTR(_Generic(*(it), \
        struct pctrie_iter: pctrie_iter_lookup_le(it, index), \
        struct pctrie_iter_limit: pctrie_iter_limit_lookup_le(it, index)))

but that's inconsistent with the existing pctrie interface.

399

The indentation here is wrong.

402

Since 0 is used as a sentinel value for limit in the iterator implementation, should we assert here that limit > 0?

dougm added inline comments.
sys/kern/subr_pctrie.c
502

I'm using pctrie_match_value several places besides this. I didn't have to use it here, but it works here.

636

I had a plan for it once, but I've changed plans, so it is removed.

656

In lookup_le, the comparisons are '>'. So I don't see the benefit.

855

Okay.

sys/sys/pctrie.h
44

I was correcting an inconsistency. Almost every macro-generated inline function was already labelled unused. But I don't see its removal introducing any warnings, so it's removed.

336

I surrender.

339

If every pctrie function that returns a uint64_t that gets fed into VAL2PTR instead returns a void* and had an extra macro-invocation-generated argument to be passed to it, then a call written now as

SWAP_PCTRIE_LOOKUP(ptree, index)

would be written as

pctrie_lookup(SWAP_PCTRIE, ptree, index);

and we'd avoid all these 1-line wrapper functions.

I'd do it, if it would be accepted.

402

No. Consider how 'end' is used in vm_object_page_noreuse.

dougm marked 7 inline comments as done.

Discard separate structs and functions for seaching with limits. Now, everybody can have a limit.

Discard a few pieces unlikely to be useful.

sys/sys/pctrie.h
399

Presumably this assignment is supposed to be in pctrie_iter_init().

dougm marked an inline comment as done.

Fix iter initialization typo.
Add limit check to iter_stride.

I ran a full stress2 test with D45627.143204.patch added and saw no (new) issues.

I'm happy to see this land so that we can try converting consumers and get a feel for the interface.

This revision is now accepted and ready to land.Sep 13 2024, 2:56 PM
This revision was automatically updated to reflect the committed changes.