Page MenuHomeFreeBSD

Add linux_madvise(2).
ClosedPublic

Authored by trasz on Jun 14 2020, 9:59 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 4, 11:44 AM
Unknown Object (File)
Wed, Dec 25, 12:47 AM
Unknown Object (File)
Thu, Dec 12, 6:52 PM
Unknown Object (File)
Nov 22 2024, 12:57 AM
Unknown Object (File)
Nov 15 2024, 3:32 PM
Unknown Object (File)
Nov 6 2024, 11:04 PM
Unknown Object (File)
Oct 31 2024, 3:21 AM
Unknown Object (File)
Oct 24 2024, 2:17 PM
Subscribers

Details

Summary

Add linux_madvise(2) instead of having Linux apps call native
FreeBSD madvise(2) directly. While some of the flag values match,
most don't.

PR: 230160

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 31725
Build 29293: arc lint + arc unit

Event Timeline

trasz requested review of this revision.Jun 14 2020, 9:59 PM
sys/compat/linux/linux_mmap.c
353

Is Linux behavior to always invalidate/free the pages in DONTNEED range, or is it only optional ?

Can you return success there without doing anything at all ?

sys/compat/linux/linux_mmap.c
353

From theraven's testcase (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230160), it seems the zeroing should be immediate.

However, after some more testing, I've found that simply returning EINVAL breaks a whole lot of stuff. I think I'll add a sysctl to make it possible to enable the technically correct behaviour, but defaulting to a sane one instead, ie aliasing it for MADV_DONTNEED, or perhaps MADV_FREE.

sys/compat/linux/linux_mmap.c
353

The thing that is not clear to me, assume that the range is backed by a shared file mapping that contains some dirty (not yet written back to the file) pages. What should happen after LINUX_MADV_DONTNEED ?

sys/compat/linux/linux_mmap.c
353

Indeed, Linux has non-traditional semantics for DONTNEED. Anonymous pages must be zeroed, low-level system software like jemalloc depends on it (at least with older kernels that don't implement MADV_FREE). MADV_FREE does not provide this semantic.

It is not clear from the documentation what MADV_DONTNEED does for dirty file-backed pages (or named swap-backed pages). From reading the code and some hints on a few ML posts, I believe it always unmaps the pages but preserves dirty state, i.e., the page is marked dirty if PG_M was set. In other words, modifications to shared data are not discarded, and "repopulating the memory contents from the up-to-date contents of the underlying mapped file" may consist of simply mapping resident file pages.

I think we need a new madvise verb to implement these semantics. illumos added it and called it MADV_PURGE. Maybe MADV_LINUX_DONTNEED would be better, given that it is not really intended for use by native code.

sys/vm/vm_mmap.c
726

Commit it separately?

sys/compat/linux/linux_mmap.c
353

So it is unmap/fresh anon map for COW mappings (even if backed by a vnode down the shadow chain), and just unmap for shared vnode mappings ?

It is very weird, I do not think there is a reason to expose that in native madvise(), better to implement that in linuxolator.

sys/compat/linux/linux_mmap.c
353

I believe that is right, but I am not yet certain.

The implementation would need to use vm_map_madvise() and vm_object_madvise(), wouldn't it? I do not see how it can be kept entirely separate without code duplication. We can avoid exposing the new madvise behaviour to native userspace, but this would make testing/fuzzing more challenging. I would propose simply leaving it undocumented instead.

sys/compat/linux/linux_mmap.c
353

No I do not think that vm_object_madvise() is useful there.

If my interpretation is correct, we would just need to do pmap_remove() for vnode shared mapping, and vm_map_delete()/vm_map_insert() for CoW. Of course this should be accompanied by clipping and other usual VM API dances, but I do not see why cannot it be contained in linux_mmap.c.

Add support for arm64 and the compat.linux.madv_dontneed sysctl.

sys/amd64/linux/linux_machdep.c
147

PTROUT is an odd choice here, it's not meant for this purpose. Just casting to uintptr_t seems like more the right thing.

sys/arm64/linux/syscalls.master
1314

The void *addr, should be on a line on its own.

sys/vm/vm_mmap.c
719

Since this in a new function, why not just expose there proper types and not have the temporaries? Is there a header pollution concern?

sys/vm/vm_mmap.c
719

For consistency. The sys/syscallsubr.h header contains vm_offset_t and vm_size_t anyway; yet all the function definiotions except for one use the standard types instead.

sys/compat/linux/linux_mmap.c
268

Instead of having a sysctl, let's just map it to FreeBSD MADV_DONTNEED for now. I will to write an implementation providing Linux's semantics and commit it as a followup. I can't imagine a scenario where changing this sysctl is going to have a useful effect for a user.

326

I think this function would be a bit clearer if every case contained a return statement. We can call kern_madvise() directly from cases where a Linux madvise() verb maps to a FreeBSD madvise() verb.

sys/amd64/linux/linux_machdep.c
147

This simply follows the code that's already there. However, I'm not sure what is the purpose of PTROUT - generally speaking, not just here; how is it different from just a uintptr_t typecast?

sys/compat/linux/linux_mmap.c
292

DOFORK translates to VM_INHERIT_DEFAULT (i.e. VM_INHERIT_COPY), not _SHARE.

sys/compat/linux/linux_mmap.c
295

MERGEABLE can simply return 0.

sys/compat/linux/linux_mmap.c
268

Perfect, thank you!

292

Thanks; I'll also change the spelling for MADV_KEEPONFORK.

295

I'm not sure about this one. Note that this is not about merging map entries. Quoting Linux madvise(2):

Enable Kernel Samepage Merging (KSM) for the pages in the range specified by addr and length.  The kernel regularly scans those areas of user memory that have  been  marked
as  mergeable,  looking  for  pages  with identical content.  These are replaced by a single write-protected page (which is automatically copied if a process later wants to
update the content of the page).  KSM merges only private anonymous pages (see mmap(2)).

The KSM feature is intended for applications that generate many instances of the same data (e.g., virtualization systems such as KVM).  It can consume a lot  of  processing
power; use with care.  See the Linux kernel source file Documentation/vm/ksm.txt for more details.

The MADV_MERGEABLE and MADV_UNMERGEABLE operations are available only if the kernel was configured with CONFIG_KSM.
sys/compat/linux/linux_mmap.c
353

I tried it. Clipping cannot be done outside of vm_map.c (and it should probably stay that way).

For CoW mappings, the situation is more complicated. See D25330. I think it is sufficient for jemalloc to work in most cases.

sys/amd64/linux/linux_machdep.c
147

PTROUT is for cases where a pointer type is to be assigned to an integer when converting between struct types. e.g.:

struct foo {
  void *ptr;
  ...
};
struct foo32 {
  int32_t ptr;
  ...
};

void
foo2foo32(const struct foo *foo, struct foo32 *foo32)
{
  foo32->ptr = PTROUT(foo->ptr);
  ...
}

In practice PTROUT is almost always a sign of a bad API.

sys/vm/vm_mmap.c
719

OK. I checked and in CheriBSD I went with:

int     kern_minherit(struct thread *td, vm_offset_t addr, vm_size_t size,
            int inherit);
sys/amd64/linux/linux_machdep.c
147

Thank you, that makes sense.

I think I'd prefer going with the current state of code - with PTROUT, simply because of consistency with the code around it, which uses it all over the place - and then open a separate review to remove it throughout the linuxulator. Is that ok?

sys/vm/vm_mmap.c
719

I think I still prefer keeping consistent with the rest of that API, ie kern_madvise(). I know, it might be my OCD speaking again :-)

trasz added inline comments.
sys/vm/vm_mmap.c
726

Definitely.

markj added inline comments.
sys/compat/linux/linux_mmap.c
295

Right, it is an advisory flag that tells the kernel it is allowed to deduplicate anonymous pages. The kernel is allowed to do nothing.

But reading the last sentence you quoted, I guess it is ok to leave it as-is, since we can also just pretend that we are a kernel without CONFIG_KSM defined.

This revision is now accepted and ready to land.Jun 18 2020, 2:33 PM
sys/vm/vm_mmap.c
726

And of course I only remembered it five minutes after committing the whole thing. Hopefully it's ok as it is. Next time I should make it two separate reviews, I guess.