⚙ D13155 Do not loop infinitely in vm_map

kib created this revision.Nov 19 2017, 12:19 PM

Herald added a subscriber: imp. · View Herald TranscriptNov 19 2017, 12:19 PM

kib edited the summary of this revision. (Show Details)Nov 19 2017, 12:24 PM

cy added a subscriber: cy.Nov 19 2017, 7:22 PM

markj accepted this revision.Nov 20 2017, 2:22 AM

markj added inline comments.

sys/vm/vm_map.c
1501–1502	Why do we drop the map lock here in the find_space == VMFS_OPTIMAL_SPACE case?

This revision is now accepted and ready to land.Nov 20 2017, 2:22 AM

kib added inline comments.Nov 20 2017, 8:40 AM

sys/vm/vm_map.c
1501–1502	It comes from jhb' r253471, I do not know why did he decided to to the fallback this way.

markj added inline comments.Nov 20 2017, 8:35 PM

sys/vm/vm_map.c
1501–1502	I ask because I believe we otherwise hold the map lock for the duration of the loop. If we entirely avoid dropping the lock, then we can be sure that the map has not changed between successive iterations, so a single attempt at a given address is sufficient.

emaste added a subscriber: jhb.Nov 21 2017, 4:07 AM

kib added inline comments.Nov 21 2017, 9:16 AM

sys/vm/vm_map.c
1501–1502	I think that drop of the lock can be avoided, but the second iteration cannot. Look at the VFS_OPTIMAL_SPACE case, where we retry with the VMFS_ANY_SPACE specifier if our allocation failed.

alc added inline comments.Nov 21 2017, 5:29 PM

sys/vm/vm_map.c
1501–1502	I don't recall there being any particular reason for dropping the map lock.

I'm presuming that this change is motivated by something in the ALSR/ASR patch. Otherwise, in isolation, i.e., not looking at the ALSR/ASR patch, I don't really see the point of this change. There is nothing incorrect about the added explicit test, but I would characterize it as pessimizing the expected case. In other words, the expected case is that vm_map_insert() succeeds. Failures should be the exception and the run-time cost of a failure is not great. So, why introduce a test that will be false in the expected case?

I think that the change to avoid dropping the lock is straightforward, i.e., it doesn't make the control flow any more convoluted.

Index: vm/vm_map.c
===================================================================
--- vm/vm_map.c (revision 326016)
+++ vm/vm_map.c (working copy)
@@ -1511,18 +1511,18 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_o
        } else
                alignment = 0;
        initial_addr = *addr;
+       vm_map_lock(map);
 again:
        start = initial_addr;
-       vm_map_lock(map);
        do {
                if (find_space != VMFS_NO_SPACE) {
                        if (vm_map_findspace(map, start, length, addr) ||
                            (max_addr != 0 && *addr + length > max_addr)) {
-                               vm_map_unlock(map);
                                if (find_space == VMFS_OPTIMAL_SPACE) {
                                        find_space = VMFS_ANY_SPACE;
                                        goto again;
                                }
+                               vm_map_unlock(map);
                                return (KERN_NO_SPACE);
                        }
                        switch (find_space) {

In D13155#274552, @alc wrote:

I'm presuming that this change is motivated by something in the ALSR/ASR patch. Otherwise, in isolation, i.e., not looking at the ALSR/ASR patch, I don't really see the point of this change. There is nothing incorrect about the added explicit test, but I would characterize it as pessimizing the expected case. In other words, the expected case is that vm_map_insert() succeeds. Failures should be the exception and the run-time cost of a failure is not great. So, why introduce a test that will be false in the expected case?

It came out of this bug report: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223732

In particular, the do-while loop's test doesn't really handle the case where find_space specifies an alignment.

In D13155#274558, @alc wrote:

I think that the change to avoid dropping the lock is straightforward, i.e., it doesn't make the control flow any more convoluted.

That change looks ok to me.

In D13155#274552, @alc wrote:

I'm presuming that this change is motivated by something in the ALSR/ASR patch. Otherwise, in isolation, i.e., not looking at the ALSR/ASR patch, I don't really see the point of this change. There is nothing incorrect about the added explicit test, but I would characterize it as pessimizing the expected case. In other words, the expected case is that vm_map_insert() succeeds. Failures should be the exception and the run-time cost of a failure is not great. So, why introduce a test that will be false in the expected case?

Yes, I fixed this in a way which does not require complete rewrite of the ASR patch. I tried to note this in the revision annotation, in somewhat vague form.

The relock elimination should be the next change, or you may commit it now, it looks fine.

In D13155#274559, @markj wrote:

In D13155#274552, @alc wrote:

I'm presuming that this change is motivated by something in the ALSR/ASR patch. Otherwise, in isolation, i.e., not looking at the ALSR/ASR patch, I don't really see the point of this change. There is nothing incorrect about the added explicit test, but I would characterize it as pessimizing the expected case. In other words, the expected case is that vm_map_insert() succeeds. Failures should be the exception and the run-time cost of a failure is not great. So, why introduce a test that will be false in the expected case?

It came out of this bug report: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223732

In particular, the do-while loop's test doesn't really handle the case where find_space specifies an alignment.

Hmm, the assumption was that when aligning a request, if the alignment moved the address such that it didn't fit, it would fail to insert and we would retry starting at the aligned address and progressively moving "farther" along in the address space until vm_findspace() failed. VMFS_SUPER_SPACE and VMFS_OPTIMAL_SPACE work the same way (pmap_align_superpage() just moves the address up). Do we know why vm_map_stack_locked() is failing in the looping case?

sys/vm/vm_map.c
1501–1502	I don't recall either, I think removing the lock drop is fine (and I think Alan's suggestion of doing that standalone first is probably a decent first change)
1525	This probably breaks VMFS_OPTIMAL_SPACE since it would not retry with VMFS_ANY_SPACE if we failed to find a superpage aligned region?

In D13155#274593, @jhb wrote:

Hmm, the assumption was that when aligning a request, if the alignment moved the address such that it didn't fit, it would fail to insert and we would retry starting at the aligned address and progressively moving "farther" along in the address space until vm_findspace() failed. VMFS_SUPER_SPACE and VMFS_OPTIMAL_SPACE work the same way (pmap_align_superpage() just moves the address up). Do we know why vm_map_stack_locked() is failing in the looping case?

The request was made for large allocation with large alignment.

[Removed nonsense]

sys/vm/vm_map.c
1525	If VMFS_OPTIMAL_SPACE case failed we execute goto again, which resets result to KERN_SUCCESS.

In D13155#274596, @kib wrote:

In D13155#274593, @jhb wrote:

Hmm, the assumption was that when aligning a request, if the alignment moved the address such that it didn't fit, it would fail to insert and we would retry starting at the aligned address and progressively moving "farther" along in the address space until vm_findspace() failed. VMFS_SUPER_SPACE and VMFS_OPTIMAL_SPACE work the same way (pmap_align_superpage() just moves the address up). Do we know why vm_map_stack_locked() is failing in the looping case?

The request was made for large allocation with large alignment.

I was more asking about which of the 'return (foo)' paths triggered in vm_map_stack_locked(), as the process of progressively walking the address space is used for VMFS_SUPER_SPACE and VMFS_OPTIMAL_SPACE (which is the default for mmap), not just for MAP_ALIGNED(). It's not clear to me why you couldn't get a similar hang if you used a large allocation with MAP_STACK with just VMFS_OPTIMAL_SPACE? The infinite loop in that case should result in never falling back to VMFS_ANY_SPACE?

sys/vm/vm_map.c
1525	Ah, ok.

In D13155#274622, @jhb wrote:

In D13155#274596, @kib wrote:

In D13155#274593, @jhb wrote:

Hmm, the assumption was that when aligning a request, if the alignment moved the address such that it didn't fit, it would fail to insert and we would retry starting at the aligned address and progressively moving "farther" along in the address space until vm_findspace() failed. VMFS_SUPER_SPACE and VMFS_OPTIMAL_SPACE work the same way (pmap_align_superpage() just moves the address up). Do we know why vm_map_stack_locked() is failing in the looping case?

The request was made for large allocation with large alignment.

I was more asking about which of the 'return (foo)' paths triggered in vm_map_stack_locked(), as the process of progressively walking the address space is used for VMFS_SUPER_SPACE and VMFS_OPTIMAL_SPACE (which is the default for mmap), not just for MAP_ALIGNED(). It's not clear to me why you couldn't get a similar hang if you used a large allocation with MAP_STACK with just VMFS_OPTIMAL_SPACE? The infinite loop in that case should result in never falling back to VMFS_ANY_SPACE?

Mmm, yes, in the case of the test program I have the issue is that KERN_NO_SPACE is returned for guard page size check failure, see D13186. So D13186, which returns a code different from KERN_NO_SPACE, would break the loop in this case.

What I do not like is that this loop makes KERN_NO_SPACE only allowed when there is an intersection of the tried mapping and some existing map entry. We usually allow for more flexible interpretation of the error codes, not prohibiting error codes just because some place makes specific use of specific code. So I think that a check to avoid a stuck step for the loop in vm_map_find() is the good generic fix, while D13186 cover only very specific bug, there might be more just undetected.

Should we have an explicit check for address wrap resulting from the various alignment options? On architectures where the kernel address space uses the upper portion of the virtual address space and we disallow the use of virtual page 0, I believe that we are inherently protected from all cases of address wrap within the user address space. However, should this code be implicitly relying on those assumptions?

In D13155#274810, @alc wrote:

Should we have an explicit check for address wrap resulting from the various alignment options? On architectures where the kernel address space uses the upper portion of the virtual address space and we disallow the use of virtual page 0, I believe that we are inherently protected from all cases of address wrap within the user address space. However, should this code be implicitly relying on those assumptions?

We can reserve the page at the top of the KVA to avoid the wrap, automatically and without the added checks.

In D13155#274830, @kib wrote:

In D13155#274810, @alc wrote:

Should we have an explicit check for address wrap resulting from the various alignment options? On architectures where the kernel address space uses the upper portion of the virtual address space and we disallow the use of virtual page 0, I believe that we are inherently protected from all cases of address wrap within the user address space. However, should this code be implicitly relying on those assumptions?

We can reserve the page at the top of the KVA to avoid the wrap, automatically and without the added checks.

Yes, most of the time that approach will work, but what about the following scenario. Suppose that X is the requested alignment, and the specified address is greater than (unsigned)-X but less than the maximum possible address. Address wrap will occur before the first call to vm_map_insert(), so we also have to ensure that the kernel map never includes virtual page 0.

Let's do this.

In D13155#274862, @alc wrote:

In D13155#274830, @kib wrote:

In D13155#274810, @alc wrote:

Should we have an explicit check for address wrap resulting from the various alignment options? On architectures where the kernel address space uses the upper portion of the virtual address space and we disallow the use of virtual page 0, I believe that we are inherently protected from all cases of address wrap within the user address space. However, should this code be implicitly relying on those assumptions?

We can reserve the page at the top of the KVA to avoid the wrap, automatically and without the added checks.

Yes, most of the time that approach will work, but what about the following scenario. Suppose that X is the requested alignment, and the specified address is greater than (unsigned)-X but less than the maximum possible address. Address wrap will occur before the first call to vm_map_insert(), so we also have to ensure that the kernel map never includes virtual page 0.

Ok.

Let's do this.

Could you, please, be more specific about 'this' ? What do you propose to do:

Reserve a page at top of KVA.
Reserve a page at address 0 (how ? kernel map does not include userspace address range, but I did not rechecked)
Both 1 and 2.
Add explicit checks for wrap.

emaste mentioned this in D5603: ASLR.Nov 23 2017, 7:11 PM

I would perhaps modify this patch to panic (maybe KASSERT()) in debug kernels as I think an infinite loop probably indicates a bug elsewhere, but I think it is fine for production kernels to fail with an error from mmap() rather than hanging.

My guess response to your last question Konstantin from reading Alan's comments is 4), that Alan would like explicit checks. I think that is pretty simple to do as you would just cache '*addr' before the switch on find_space and check for '*addr < saved_addr' after the switch and fail with KERN_NO_SPACE if it is true.

Follow John' explanations:

Convert error termination on no progress into assert.
Add overflow checks into the loop.

An issue with the patch is that prev_start is write-only in non-debugging kernels.

This revision now requires review to proceed.Nov 27 2017, 8:04 PM

Mmm, you'd have to put prev_start under #ifdef INVARIANTS I guess, either that or do this:

if (result != KERN_SUCCESS && start <= prev_start) {
#ifdef INVARIANTS
    panic("vm_map_findspace infinite loop");
#endif
    vm_map_unlock(map);
    return (KERN_NO_SPACE);
}

In D13155#276876, @jhb wrote:
Mmm, you'd have to put prev_start under #ifdef INVARIANTS I guess, either that or do this:
if (result != KERN_SUCCESS && start <= prev_start) {
#ifdef INVARIANTS
    panic("vm_map_findspace infinite loop");
#endif
    vm_map_unlock(map);
    return (KERN_NO_SPACE);
}

Do we ever have a place where non-debugging kernel works (and perhaps returns an error to userspace), but panics in debugging config ? Note that this is not how normal assertion works: we do not detect a situation and then handle it for non-debug, we only detect it for panic, as an additional service for developer.

Hide prev_start under invariants.

We do not, I just wasn't sure how careful/paranoid we wanted to be. I think it's fine to just use an assert.

This revision is now accepted and ready to land.Nov 28 2017, 5:59 PM

In D13155#274886, @kib wrote:

In D13155#274862, @alc wrote:

In D13155#274830, @kib wrote:

In D13155#274810, @alc wrote:

Should we have an explicit check for address wrap resulting from the various alignment options? On architectures where the kernel address space uses the upper portion of the virtual address space and we disallow the use of virtual page 0, I believe that we are inherently protected from all cases of address wrap within the user address space. However, should this code be implicitly relying on those assumptions?

We can reserve the page at the top of the KVA to avoid the wrap, automatically and without the added checks.

Yes, most of the time that approach will work, but what about the following scenario. Suppose that X is the requested alignment, and the specified address is greater than (unsigned)-X but less than the maximum possible address. Address wrap will occur before the first call to vm_map_insert(), so we also have to ensure that the kernel map never includes virtual page 0.

Ok.

Let's do this.

Could you, please, be more specific about 'this' ? What do you propose to do:

Reserve a page at top of KVA.

Reserve a page at address 0 (how ? kernel map does not include userspace address range, but I did not rechecked)

Both 1 and 2.

Add explicit checks for wrap.

Sorry, let me clarify my position. I don't think that #1 by itself solves address wrap problems resulting from the alignment code. I tried to describe why in the quoted text above. Instead, we would have to exclude address 0, i.e., implement #2. However, there are legitimate cases where we want to allow address 0 in the vm map, e.g., bhyve's guest physical address space. So, I see no other option besides #4, explicit checks.

Set aside the proposed change for a moment. Suppose that find_space == VMFS_OPTIMAL_SPACE, that we are approaching the end of the address space, and that vm_map_findspace() succeeds. However, the address is not optimally aligned, so pmap_align_superpage() adjusts the address (upward). I believe that vm_map_insert() could return KERN_INVALID_ARGUMENT because the adjusted address plus length wraps, and as a result the loop terminates whereas it should have restarted after updating find_space.

In D13155#277218, @alc wrote:

Set aside the proposed change for a moment. Suppose that find_space == VMFS_OPTIMAL_SPACE, that we are approaching the end of the address space, and that vm_map_findspace() succeeds. However, the address is not optimally aligned, so pmap_align_superpage() adjusts the address (upward). I believe that vm_map_insert() could return KERN_INVALID_ARGUMENT because the adjusted address plus length wraps, and as a result the loop terminates whereas it should have restarted after updating find_space.

I decided to remove the reciprocally-nesting loops from vm_map_find(). By coalescing the place where VMFS_OPTIMAL_SPACE is converted into VMFS_ANY_SPACE attempt, it should solve the issue you pointed out.

Untangle the loops in vm_map_find().

This revision now requires review to proceed.Nov 29 2017, 11:26 AM

Do not use goto.

jhb added inline comments.Nov 29 2017, 5:23 PM

sys/vm/vm_map.c
1572	I would perhaps remove the loop entirely and just do: result = vm_map_find_locked(...); if (result != KERN_SUCESS && find_space == VMFS_OPTIMAL_SPACE) { *addr = initial_addr; result = vm_map_find_locked(..., VMFS_ANY_SPACE, ...); }

kib added inline comments.Nov 29 2017, 5:35 PM

sys/vm/vm_map.c
1572	I considered this. I structured the second attempt with the loop because it is not obvious to the reader that the second invocation is same. See the next function vm_map_find_min() which uses loop for the same reason.

alc added inline comments.Nov 30 2017, 7:19 AM

sys/vm/vm_map.c
1572	See the next function vm_map_find_min() which uses loop for the same reason. This comment led me to look at vm_map_find_min() for the first time in months, and I think that that function would really benefit having an explanatory comment added to the code. (The commit message for r320430 was quite clear, and could be a good starting point.)

Add comment explaining vm_map_find_min().

alc added inline comments.Nov 30 2017, 5:15 PM

sys/vm/vm_map.c
1585	I suggest: "vm_map_find_min() is a variant of vm_map_find() that takes an additional parameter (min_addr) and treats the given address (addr) differently. Specifically, it treats addr as a hint and not as the minimum address where the mapping is created.
1589	"This function ... First, it tries to ...
1590	"... fails and the hint is greater than min_addr, it performs a second pass, replacing the hint with min_addr as the minimum address for the allocation.

Edit vm_map_find_min() comments.

The vm_map_find_min() comment looks ready for committing.

sys/vm/vm_map.c
1587	Please add another space after the period.

Trim patch after r326424.

jhb accepted this revision.Dec 2 2017, 5:10 PM

This revision is now accepted and ready to land.Dec 2 2017, 5:10 PM

In both the unmodified and modified versions of this code, it appears to me that a call requesting alignment can fail returning KERN_INVALID_ADDRESS. This is a minor issue compared to the others that we've discussed, but I still think that this function should only return KERN_INVALID_ADDRESS when VMFS_NO_SPACE was specified.

More generally, I've come to the conclusion that speculatively calling vm_map_insert() is more trouble than it's worth. I'm tinkering with an alternate implementation where the loop consists of two steps: (1) alignment and (2) vm_map_findspace(). The loop terminates when either step does not result in any change to the address, so the sanity check proposed here is a regular part of the control flow.

This approach does introduce an additional call to vm_map_findspace(), that the speculative vm_map_insert() call avoided. However, in situations where the speculative vm_map_insert() call would have succeeded, the extra call to vm_map_findspace() will complete in constant time, because the map entry we are checking to verify that there is still sufficient space at the now aligned address is at the root of the tree.

I'll post something shortly.

In D13155#278433, @alc wrote:

More generally, I've come to the conclusion that speculatively calling vm_map_insert() is more trouble than it's worth. I'm tinkering with an alternate implementation where the loop consists of two steps: (1) alignment and (2) vm_map_findspace(). The loop terminates when either step does not result in any change to the address, so the sanity check proposed here is a regular part of the control flow.

This approach does introduce an additional call to vm_map_findspace(), that the speculative vm_map_insert() call avoided. However, in situations where the speculative vm_map_insert() call would have succeeded, the extra call to vm_map_findspace() will complete in constant time, because the map entry we are checking to verify that there is still sufficient space at the now aligned address is at the root of the tree.

I also considered doing vm_map_findspace() after the realignment, but decided that it pessimization for the common case. My feel is that it is rare to have tightly fragmented address space or to allocate from the fragmented part of the address space, so typically vm_map_insert() would just succeed. Adding one more vm_map_findspace() (we cannot remove the first one without seriously degrading the search) did not felt right.

Here is what I have:

        if (find_space >> 8 != 0) {
                KASSERT((find_space & 0xff) == 0, ("bad VMFS flags"));
                alignment = (vm_offset_t)1 << (find_space >> 8);
        } else
                alignment = 0;
        vm_map_lock(map);
        if (find_space != VMFS_NO_SPACE) {
                min_addr = *addr;
again:
                if (vm_map_findspace(map, min_addr, length, addr) ||
                    (max_addr != 0 && *addr + length > max_addr)) {
                        rv = KERN_NO_SPACE;
                        goto done;
                }
                while (find_space != VMFS_ANY_SPACE) {
                        /*                                                                          
                         * At the start of every iteration, the free space at                       
                         * address "*addr" is at least "length" bytes.                              
                         */
                        orig_addr = *addr;
                        if (find_space == VMFS_SUPER_SPACE ||
                            find_space == VMFS_OPTIMAL_SPACE) {
                                pmap_align_superpage(object, offset, addr,
                                    length);
                        } else if ((*addr & (alignment - 1)) != 0) {
                                *addr &= ~(alignment - 1);
                                *addr += alignment;
                        }
                        aligned_addr = *addr;
                        if (aligned_addr == orig_addr) {
                                /*                                                                  
                                 * Alignment did not change "*addr", so                             
                                 * "*addr" must still provide sufficient                            
                                 * free space.                                                      
                                 */
                                break;
                        }

                        /*                                                                          
                         * Test for address wrap on "*addr".  A wrapped                             
                         * "*addr" could be a valid address, in which case                          
                         * vm_map_findspace() cannot be relied upon to fail.                        
                         */
                        if (aligned_addr < orig_addr ||
                            vm_map_findspace(map, aligned_addr, length,
                            addr) ||
                            (max_addr != 0 && *addr + length > max_addr)) {
                                if (find_space == VMFS_OPTIMAL_SPACE) {
                                        find_space = VMFS_ANY_SPACE;
                                        goto again;
                                }
                                rv = KERN_NO_SPACE;
                                goto done;
                        }
                        if (*addr == aligned_addr) {
                                /*                                                                  
                                 * If a successful call to vm_map_findspace()                       
                                 * did not change "*addr", then "*addr" must                        
                                 * still be aligned and provide sufficient                          
                                 * free space.                                                      
                                 */
                                break;
                        }
                }
        }
        if ((cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP)) != 0) {
                rv = vm_map_stack_locked(map, *addr, length, sgrowsiz, prot,
                    max, cow);
        } else {
                rv = vm_map_insert(map, object, offset, *addr, *addr + length,
                    prot, max, cow);
        }
done:
        vm_map_unlock(map);
        return (rv);

Essentially, the "redundant" call to vm_map_findspace() provides the rest of the address wrap checks:

/*                                                                                          
 * Request must fit within min/max VM address and must avoid                                
 * address wrap.                                                                            
 */
if (start < map->min_offset)
        start = map->min_offset;
if (start + length > map->max_offset || start + length < start)
        return (1);

/* Empty tree means wide open address space. */
if (map->root == NULL) {
        *addr = start;
        return (0);
}

/*                                                                                          
 * After splay, if start comes before root node, then there                                 
 * must be a gap from start to the root.                                                    
 */
map->root = vm_map_entry_splay(start, map->root);
if (start + length <= map->root->start) {
        *addr = start;
        return (0);
}

And, even if alignment changed the address, as long as the free space preceding the root is sufficiently large, the redundant call completes in constant time. The splay is not going to restructure the tree.

Should I create a new review for my patch?

In D13155#278471, @alc wrote:

Should I create a new review for my patch?

I think yes.

emaste added a subscriber: emaste.Dec 5 2017, 3:13 PM

Can we abandon this now?

kib abandoned this revision.Dec 26 2017, 8:53 PM

Do not loop infinitely in vm_map_find().
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 36000

sys/vm/vm_map.c

Do not loop infinitely in vm_map_find().AbandonedPublicActions