Page MenuHomeFreeBSD

Partially implement Linux MADV_DONTNEED semantics.
ClosedPublic

Authored by markj on Jun 17 2020, 6:39 PM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 6 2024, 4:25 AM
Unknown Object (File)
Dec 1 2024, 4:58 PM
Unknown Object (File)
Nov 25 2024, 4:39 PM
Unknown Object (File)
Nov 25 2024, 4:39 PM
Unknown Object (File)
Nov 25 2024, 4:38 PM
Unknown Object (File)
Nov 25 2024, 4:38 PM
Unknown Object (File)
Nov 21 2024, 11:37 PM
Unknown Object (File)
Nov 21 2024, 11:37 PM
Subscribers

Details

Summary

This is based on Edward's D25272.

Linux MADV_DONTNEED is not advisory - it has side effects for anonymous
memory, and some system software depends on that. In particular,
MADV_DONTNEED causes anonymous pages to be discarded. If the mapping is
a private mapping of a named object then subsequent faults are to
repopulate the range from that object, otherwise pages will be
zero-filled. For non-anonymous pages, Linux MADV_DONTNEED can be
implemented in the same way as our MADV_DONTNEED.

Note, this implementation is not fully correct. If a private mapping of
a file is inherited through fork(), then Linux MADV_DONTNEED is supposed
to bypass the inherited copy-on-write mapping. That is, a subsequent
fault should not fetch pages from the shadowed object allocated by the
parent, but instead should go straight to the vnode object. To
implement this, we must descend through the shadow chain to the root,
and then re-shadow the root with a new anonymous object, or somehow
graft the leaf directly onto the root. This would require some support
from sys/vm, so it is not implemented for now. I do not know how common
this case is.

Test Plan

I wrote some test cases to verify Linux's behaviour, and tried them on
FreeBSD. Some cases result in different behaviour due to the missing
handling described above.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Jun 17 2020, 6:39 PM
markj created this revision.
markj edited the test plan for this revision. (Show Details)
markj added reviewers: trasz, kib, alc, brooks, Linux Emulation.

For snmalloc, we use this call in the Linux PAL. On FreeBSD, we use mmap to map a new anonymous memory object over the old one, which has the same behaviour. We don't do this on shared memory mappings, but for sandboxing use cases we do need some equivalent mechanism (i.e. something that can zero a large range of shared memory, without requiring the OS to actually allocate physical pages for all of the zeroes). On Linux, we can do this with fallocate on the underlying memory object (I think we can do it with mem_fd objects, we can definitely do it in tmpfs). I am not sure what the correct mechanism for doing this in FreeBSD is.

sys/compat/linux/linux_mmap.c
303 ↗(On Diff #73234)

ONEMAPPING is not an indicator of the shared mapping, and it is only advisory.

322 ↗(On Diff #73234)

Hm, wouldn't this destroy dirty pages of the vnode when the mapping is a shared mapping of the file ?

This has obvious security implications.

For something like a device managed mappings it has other but equally interesting effects.

sys/compat/linux/linux_mmap.c
303 ↗(On Diff #73234)

If ONEMAPPING is set, then the object is privately mapped, but the converse statement is not true. I am trying to be conservative here: if it is possible that the mapping is shared, then we do not free any pages.

There is a race here though. OBJ_ANON is stable (i.e., never cleared during the object lifecycle), but ONEMAPPING is not. The flags must be checked under the object lock.

322 ↗(On Diff #73234)

I don't quite follow: to reach this point we must have OBJ_ANON set, which implies that the object type is OBJT_DEFAULT or _SWAP.

For snmalloc, we use this call in the Linux PAL. On FreeBSD, we use mmap to map a new anonymous memory object over the old one, which has the same behaviour. We don't do this on shared memory mappings, but for sandboxing use cases we do need some equivalent mechanism (i.e. something that can zero a large range of shared memory, without requiring the OS to actually allocate physical pages for all of the zeroes). On Linux, we can do this with fallocate on the underlying memory object (I think we can do it with mem_fd objects, we can definitely do it in tmpfs). I am not sure what the correct mechanism for doing this in FreeBSD is.

I'm not aware of any such mechanism other than ftruncate(), which is not sufficient if you need to zero a region that ends before the end of the object.

kib added inline comments.
sys/compat/linux/linux_mmap.c
322 ↗(On Diff #73234)

I missed this, ok.

This revision is now accepted and ready to land.Jun 18 2020, 5:42 PM

Fix racy OBJ_ONEMAPPING check.

This revision now requires review to proceed.Jun 18 2020, 9:41 PM
sys/compat/linux/linux_mmap.c
314 ↗(On Diff #73296)

I do not think this recheck is needed. The map is locked (even if for read), and since we saw OBJ_ONEMAPPING set, it cannot be cleared by any other way than forking or doing some operation on our map.

This revision is now accepted and ready to land.Jun 19 2020, 6:11 AM
sys/compat/linux/linux_mmap.c
314 ↗(On Diff #73296)

I believe you are right. This assumption is rather subtle though and will lead to some nasty bugs if it ever becomes false. Since rechecking is not particularly expensive, I prefer to keep it.

If we had some way of asserting that the corresponding vm_map lock is held when ONEMAPPING is checked or cleared, then I would be willing to change this. Maybe instead of checking the flag directly, we could have

bool
vm_map_owns_object(vm_map_t map, vm_map_entry_t entry)
{
    vm_object_t object;

    VM_MAP_ASSERT_LOCKED(map);
    object = entry->object.vm_object;
    return ((object->flags & OBJ_ONEMAPPING) != 0);
}

and a similar function to clear OBJ_ONEMAPPING which asserts that both the map and object locks are held.

kib added inline comments.
sys/compat/linux/linux_mmap.c
314 ↗(On Diff #73296)

I am mostly fine with it, but 'owns' is arguably the wrong term, and I cannot propose anything better.

Please go ahead with the current patch.