Page MenuHomeFreeBSD

Add linux_madvise(2).
ClosedPublic

Authored by trasz on Jun 14 2020, 9:59 PM.

Details

Summary

Add linux_madvise(2) instead of having Linux apps call native
FreeBSD madvise(2) directly. While some of the flag values match,
most don't.

PR: 230160

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

trasz requested review of this revision.Jun 14 2020, 9:59 PM
sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

Is Linux behavior to always invalidate/free the pages in DONTNEED range, or is it only optional ?

Can you return success there without doing anything at all ?

sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

From theraven's testcase (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230160), it seems the zeroing should be immediate.

However, after some more testing, I've found that simply returning EINVAL breaks a whole lot of stuff. I think I'll add a sysctl to make it possible to enable the technically correct behaviour, but defaulting to a sane one instead, ie aliasing it for MADV_DONTNEED, or perhaps MADV_FREE.

sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

The thing that is not clear to me, assume that the range is backed by a shared file mapping that contains some dirty (not yet written back to the file) pages. What should happen after LINUX_MADV_DONTNEED ?

sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

Indeed, Linux has non-traditional semantics for DONTNEED. Anonymous pages must be zeroed, low-level system software like jemalloc depends on it (at least with older kernels that don't implement MADV_FREE). MADV_FREE does not provide this semantic.

It is not clear from the documentation what MADV_DONTNEED does for dirty file-backed pages (or named swap-backed pages). From reading the code and some hints on a few ML posts, I believe it always unmaps the pages but preserves dirty state, i.e., the page is marked dirty if PG_M was set. In other words, modifications to shared data are not discarded, and "repopulating the memory contents from the up-to-date contents of the underlying mapped file" may consist of simply mapping resident file pages.

I think we need a new madvise verb to implement these semantics. illumos added it and called it MADV_PURGE. Maybe MADV_LINUX_DONTNEED would be better, given that it is not really intended for use by native code.

sys/vm/vm_mmap.c
726 ↗(On Diff #73118)

Commit it separately?

sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

So it is unmap/fresh anon map for COW mappings (even if backed by a vnode down the shadow chain), and just unmap for shared vnode mappings ?

It is very weird, I do not think there is a reason to expose that in native madvise(), better to implement that in linuxolator.

sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

I believe that is right, but I am not yet certain.

The implementation would need to use vm_map_madvise() and vm_object_madvise(), wouldn't it? I do not see how it can be kept entirely separate without code duplication. We can avoid exposing the new madvise behaviour to native userspace, but this would make testing/fuzzing more challenging. I would propose simply leaving it undocumented instead.

sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

No I do not think that vm_object_madvise() is useful there.

If my interpretation is correct, we would just need to do pmap_remove() for vnode shared mapping, and vm_map_delete()/vm_map_insert() for CoW. Of course this should be accompanied by clipping and other usual VM API dances, but I do not see why cannot it be contained in linux_mmap.c.

Add support for arm64 and the compat.linux.madv_dontneed sysctl.

sys/amd64/linux/linux_machdep.c
147 ↗(On Diff #73152)

PTROUT is an odd choice here, it's not meant for this purpose. Just casting to uintptr_t seems like more the right thing.

sys/arm64/linux/syscalls.master
1314 ↗(On Diff #73152)

The void *addr, should be on a line on its own.

sys/vm/vm_mmap.c
719 ↗(On Diff #73152)

Since this in a new function, why not just expose there proper types and not have the temporaries? Is there a header pollution concern?

sys/vm/vm_mmap.c
719 ↗(On Diff #73152)

For consistency. The sys/syscallsubr.h header contains vm_offset_t and vm_size_t anyway; yet all the function definiotions except for one use the standard types instead.

sys/compat/linux/linux_mmap.c
268 ↗(On Diff #73152)

Instead of having a sysctl, let's just map it to FreeBSD MADV_DONTNEED for now. I will to write an implementation providing Linux's semantics and commit it as a followup. I can't imagine a scenario where changing this sysctl is going to have a useful effect for a user.

326 ↗(On Diff #73152)

I think this function would be a bit clearer if every case contained a return statement. We can call kern_madvise() directly from cases where a Linux madvise() verb maps to a FreeBSD madvise() verb.

sys/amd64/linux/linux_machdep.c
147 ↗(On Diff #73152)

This simply follows the code that's already there. However, I'm not sure what is the purpose of PTROUT - generally speaking, not just here; how is it different from just a uintptr_t typecast?

sys/compat/linux/linux_mmap.c
292 ↗(On Diff #73152)

DOFORK translates to VM_INHERIT_DEFAULT (i.e. VM_INHERIT_COPY), not _SHARE.

sys/compat/linux/linux_mmap.c
295 ↗(On Diff #73152)

MERGEABLE can simply return 0.

sys/compat/linux/linux_mmap.c
268 ↗(On Diff #73152)

Perfect, thank you!

292 ↗(On Diff #73152)

Thanks; I'll also change the spelling for MADV_KEEPONFORK.

295 ↗(On Diff #73152)

I'm not sure about this one. Note that this is not about merging map entries. Quoting Linux madvise(2):

Enable Kernel Samepage Merging (KSM) for the pages in the range specified by addr and length.  The kernel regularly scans those areas of user memory that have  been  marked
as  mergeable,  looking  for  pages  with identical content.  These are replaced by a single write-protected page (which is automatically copied if a process later wants to
update the content of the page).  KSM merges only private anonymous pages (see mmap(2)).

The KSM feature is intended for applications that generate many instances of the same data (e.g., virtualization systems such as KVM).  It can consume a lot  of  processing
power; use with care.  See the Linux kernel source file Documentation/vm/ksm.txt for more details.

The MADV_MERGEABLE and MADV_UNMERGEABLE operations are available only if the kernel was configured with CONFIG_KSM.
sys/compat/linux/linux_mmap.c
269 ↗(On Diff #73118)

I tried it. Clipping cannot be done outside of vm_map.c (and it should probably stay that way).

For CoW mappings, the situation is more complicated. See D25330. I think it is sufficient for jemalloc to work in most cases.

sys/amd64/linux/linux_machdep.c
147 ↗(On Diff #73152)

PTROUT is for cases where a pointer type is to be assigned to an integer when converting between struct types. e.g.:

struct foo {
  void *ptr;
  ...
};
struct foo32 {
  int32_t ptr;
  ...
};

void
foo2foo32(const struct foo *foo, struct foo32 *foo32)
{
  foo32->ptr = PTROUT(foo->ptr);
  ...
}

In practice PTROUT is almost always a sign of a bad API.

sys/vm/vm_mmap.c
719 ↗(On Diff #73152)

OK. I checked and in CheriBSD I went with:

int     kern_minherit(struct thread *td, vm_offset_t addr, vm_size_t size,
            int inherit);
sys/amd64/linux/linux_machdep.c
147 ↗(On Diff #73152)

Thank you, that makes sense.

I think I'd prefer going with the current state of code - with PTROUT, simply because of consistency with the code around it, which uses it all over the place - and then open a separate review to remove it throughout the linuxulator. Is that ok?

sys/vm/vm_mmap.c
719 ↗(On Diff #73152)

I think I still prefer keeping consistent with the rest of that API, ie kern_madvise(). I know, it might be my OCD speaking again :-)

trasz added inline comments.
sys/vm/vm_mmap.c
726 ↗(On Diff #73118)

Definitely.

markj added inline comments.
sys/compat/linux/linux_mmap.c
295 ↗(On Diff #73152)

Right, it is an advisory flag that tells the kernel it is allowed to deduplicate anonymous pages. The kernel is allowed to do nothing.

But reading the last sentence you quoted, I guess it is ok to leave it as-is, since we can also just pretend that we are a kernel without CONFIG_KSM defined.

This revision is now accepted and ready to land.Jun 18 2020, 2:33 PM
sys/vm/vm_mmap.c
726 ↗(On Diff #73118)

And of course I only remembered it five minutes after committing the whole thing. Hopefully it's ok as it is. Next time I should make it two separate reviews, I guess.