Page MenuHomeFreeBSD

Add a file ops hook for mmap().
ClosedPublic

Authored by jhb on May 26 2015, 9:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jul 6, 7:37 PM
Unknown Object (File)
Sat, Jul 6, 1:51 AM
Unknown Object (File)
Sun, Jun 23, 3:48 AM
Unknown Object (File)
Sun, Jun 23, 3:45 AM
Unknown Object (File)
Sun, Jun 23, 3:45 AM
Unknown Object (File)
Sat, Jun 15, 10:41 PM
Unknown Object (File)
Jun 14 2024, 7:10 AM
Unknown Object (File)
Jun 10 2024, 1:45 PM
Subscribers

Details

Summary

Add a new file ops hook to allow file types to provide support for
type-specific mmap rather than having all the logic live in vm_mmap.c.
Currently file-type specific logic is split between both sys_mmap()
and vm_mmap(), so this requires reshuffling the order of certain
operations in the sys_mmap() flow. In particular, we now lookup the
VM object for a mapping request a bit sooner.

I would like to add a new file type for some work I'm doing and would
like this new file type to support mmap. The current interface to do
handle new file types between sys_mmap() and vm_mmap() is a bit clunky.

I considered putting vm_mmap_cdev() logic into a devfs-specific fo_mmap
method, but OBJT_DEVICE is used by existing vm_mmap() consumers, and
a devfs fo_mmap method would have to duplicate the rest of vn_mmap()
except for the call to vn_mmap_vnode().

There is at least one #if 0 to clean up yet. Also, the interface is
rather clunky having to pass so many args in. I also don't like having
to allow writing of cap_maxprot but vnodes need this to allow writes to
MAP_PRIVATE mappings. I guess the capabilities only disallow writing
for MAP_SHARED mappings? Would be nice if there was a cleaner way to
handle that in sys_mmap() rather than vn_mmap().

Test Plan
  • Booted a kernel to multiuser (this should test MAP_ANON and various files).
  • Used a simple test program I've used previously for shm_open().

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jhb retitled this revision from to Add a file ops hook for mmap()..
jhb updated this object.
jhb edited the test plan for this revision. (Show Details)
jhb added reviewers: alc, kib.
jhb added a subscriber: np.

Could you, please write 2-3 lines summary of the interface for fo_mmap() ? Possibly as a comment.

sys/kern/uipc_shm.c
869 ↗(On Diff #5716)

You are mixing styles. I usually test the flags explicitly with != 0.

Hmm, I just had a thought about a way to make this a lot less hacky perhaps. Part of what makes the interface so yucky is that it's a callout in the "middle" of the system call. Perhaps instead I should expose vm_mmap_shared() as a new vm_mmap_object() function and change the fo_mmap() routines to invoke that directly (MAP_ANON can be handled explicitly in sys_mmap()). This would remove several of the the variables that are passed back and forth (max_prot, obj, writecounted, etc.). Instead, fo_mmap() would just need to return an error and the virtual address on success and would need to take a few inputs such as the syscall args and cap_maxprot.

I will try updating this to do that and see if it is cleaner. I think this is more consistent with other fileops where the fileop is expected to call other routines to "do the work" rather than just being queried to "return some of the info needed for the work"

jhb edited edge metadata.
  • Expose vm_mmap_shared() as vm_mmap_object() and change fo_mmap methods to call it.
  • Tidy and diff-reduce.
  • Move the foff check back to the shared callback.
  • Remove the nested call to vm_mmap_cdev() by making vn_mmap() call it
  • Use consistent "test a flag" style.
jhb marked an inline comment as done.May 28 2015, 6:20 PM
jhb added inline comments.
sys/kern/uipc_shm.c
869 ↗(On Diff #5716)

In this case it is because the code comes from different functions whcih used different styles. :-/

I actually prefer to not use the != 0 for testing single flags, but I will update these. I think vn_mmap() likely has a mix of styles as well now for the same reason.

jhb marked an inline comment as done.May 28 2015, 6:43 PM
jhb added inline comments.
sys/vm/vm_mmap.c
1419 ↗(On Diff #5753)

I don't know why this check is here and why it returns 0 instead of EINVAL.

Humm, it goes all the way back to when vm_mmap.c was first imported into BSD in 1990. :-/

I don't think we pass a size of zero explicitly to vm_mmap(). Hmm, I thought sys_mmap() was filtering all these out, but it is not. Oddly enough, if you do something like this in an old binary:

p = mmap(NULL, 0, PROT_READ, MAP_SHARED, fd, PAGE_SIZE / 2);

You will get back a valid pointer for the first page of the file. If you pass a position of 0 you would get back some address that might have been mapped (but for something else) or might not have been.

I would be inclined to move the size == 0 handling up into sys_mmap() itself. Perhaps bail early with a 0 return value for uap->len == 0. However, that breaks the weird case above. Do we know if any binaries are doing zero-length mmaps with non-page-aligned file offsets and expecting to get a page? I guess I could always do the size == 0 check after the pageoff adjustment to size. I would then want vm_mmap() to fail with EINVAL because I don't think in-kernel drivers should be doing zero-length mappings.

1466 ↗(On Diff #5753)

I don't know if it would be better to perform the EINVAL checks later in this function before doing the racct stuff. There doesn't seem to be anything to unwind these racct reservations if the mapping fails.

Hmm, I guess if they fail later on it is sort of harmless since the next mapping attempt will "fix" the accounting.

Yes, this is much easier to understand, and IMO it is very good improvement into the current maze of the sys_mmap() top part.

sys/kern/vfs_vnops.c
2402 ↗(On Diff #5753)

What prevents the vnode reclaim and corresponding vp->v_mount from invalidation between two checks in the if () test ?

2442 ↗(On Diff #5753)

You call vm_object_deallocate() both in the fo_mmap implementations, e.g. it is done both in vn_mmap and shm_mmap, and in the error path of the vm_mmap_object(). Is it intended, or the deallocation happens one more time than it should ?

Also, why do you need to pass writecounted down to vm_mmap_object ? You get the control back in vn_mmap, so you could release write count there,

sys/kern/vfs_vnops.c
2402 ↗(On Diff #5753)

Mmm, is the ref count from the fp enough (fp is held by the caller (sys_mmap) for the duration of this function)? This is the same now as in the old code. I am also assuming that this reference from the fp is enough to make v_rdev stable for the call to vm_mmap_cdev() in vn_mmap().

2442 ↗(On Diff #5753)

Ah, that is true. I think I will remove it from vm_mmap_object() (I had fixed vm_mmap() to drop it on error as well). I think it is cleaner for the caller (e.g. fo_mmap methods) to drop the reference explicitly on error since they gained the reference.

Unfortunately I cannot move the writecounted hack since vm_mmap() with OBJT_VNODE needs to be able to do it as well. Well, I guess I could duplicate it in both vm_mmap() and vn_mmap(). That might actually be cleaner since it is a kind of cleanup only for vm_mmap_vnode() and would be done at the same level as the call to vm_mmap_vnode().

Which do you prefer? Doing the deallocates and writecounted checks in the callers will result in more duplication. The reason I can see in favor of doing it is that it makes the vm_mmap_object() API a bit simpler and keeps code changes at the same level (e.g. writecounted is only in functions that directly use vm_mmap_vnode()).

Actually, I've broken the writecounted updates now in that if the racct errors trigger then they won't get backed out. Moving the cleanup into the callers will fix this so I will do that.

  • Move cleanup on error out of vm_mmap_object() and update comments.
  • Fail internal mapping requests of size zero but allow it for old binaries.
  • Compile.
sys/kern/vfs_vnops.c
2403 ↗(On Diff #5755)

A struct mount is type stable; the vp->v_mount could only transition from valid pointer to NULL. Usual solution is to ensure that vp->v_mount is evaluated only once, i.e. cache it into local variable. Look for instance at the MNT_SHARED_WRITES() or MNT_EXTENDED_SHARED() inlines in the sys/mount.h.

2443 ↗(On Diff #5755)

I agree with the move of dereference to the callers.

sys/vm/vm_mmap.c
1482 ↗(On Diff #5755)

I am not aware of such cases, but I would also probably be extra-cautious. Consumers of the mmap interface tend to exercise all corner cases, usually.

jhb marked an inline comment as done.
  • Cache mount for MNT_NOEXEC check and split out devfs_mmap_f().
sys/fs/devfs/devfs_vnops.c
1767 ↗(On Diff #5756)

I kept the MNT_NOEXEC check to err on the side of caution though I know Konstantin stated on IRC he feels it isn't very useful.

I was originally worried that having this separate from vn_mmap would result in a lot of duplication, but I don't think it does. One thing I am assuming is that hwpmc doesn't care about executable mappings belonging to devices.

1792 ↗(On Diff #5756)

In theory we could actually dispense with devfs_fp_check() and just pass fp->f_data down as the cdev pointer to vm_mmap_cdev(). However, I think it is cleaner to be more consistent. This also results in td_fpop being handled the way it is for other operations (and sys_mmap() no longer has to mess with it).

sys/fs/devfs/devfs_vnops.c
1787 ↗(On Diff #5756)

Note that vm_mmap_cdev() also calls dev_refthread() internally. Although seemingly innocent, I do not like the recursion for si_threadcount.

Might be, vm_mmap_cdev() should be split into two functions, one which does refthread wrapping and calls another. Then another could be used there as well. Also, it should reduce the number of places where dev_relthread() is called in vm_mmap_cdev(), making the logic cleaner.

1788 ↗(On Diff #5756)

if (error != 0)

sys/vm/vm_mmap.c
328 ↗(On Diff #5756)

For modern binaries, the function returns an error earlier.

336 ↗(On Diff #5756)

Might be, we should finally add CTASSERTs() for this repeated claim.

1323 ↗(On Diff #5756)

Is this check and cleanup still needed in vm_mmap_vnode() ? Unless I am confused, there are two callers of vm_mmap_vnode(), and both of them check writecounted. vn_mmap() would need to do
error = vm_mmap_vnode(); if (error == 0) error = vm_mmap_object(); if (error != 0) { if (writecounted) ...}
to not skip writecounted cleanup, but this should be fine.

1448 ↗(On Diff #5756)

Aren't this removes support for mmapping shm objects using vm_mmap() ? Do we need the feature ?

jhb marked 3 inline comments as done.Jun 1 2015, 4:10 PM
jhb added inline comments.
sys/fs/devfs/devfs_vnops.c
1787 ↗(On Diff #5756)

I can push the refthread into vm_mmap() around vm_mmap_cdev().

I suspect that all the callers of vm_mmap() with OBJT_DEVICE already hold a thread reference on the cdev, but there doesn't appear to be way to assert that easily.

sys/vm/vm_mmap.c
328 ↗(On Diff #5756)

Ah, yes, the function returns the error, not the binary itself.

336 ↗(On Diff #5756)

The one other bit of confusion is that we aren't consistent with prot. vm_mmap() and fo_mmap() generally accept the mmap arg (PROT_*) not a vm_prot_t, but it is passed as a vm_prot_t. It might be more accurate for prot to be
an int that is always the mmap args and have vm_mmap_object be the only place doing the implicit conversion.

Actually, I think it might be better if I fix vm_mmap_cdev/vnode and fo_mmap to assume VM_PROT_* instead.

1323 ↗(On Diff #5756)

On error vm_mmap_vnode() doesn't return a valid object pointer (which is needed to pass to vnode_pager_update_writecount()).

Hmm, there appears to be an old bug here in that if vm_pager_allocate() fails and we had set writecounted previously, then we will pass a NULL object to vnode_pager_update_writecount.

1448 ↗(On Diff #5756)

We should not. I only added that for the system call to map them. I don't know of any drivers that are mapping them in userland address spaces via ioctls.

jhb marked 2 inline comments as done.
  • Move thread reference out of vm_mmap_cdev() into vm_mmap().
  • Comment tweak.
  • Use VM_PROT_* with vm_mmap().
  • Treat 'prot' passed to fo_mmap() and vm_mmap_*() as vm_prot_t (VM_PROT_*).
kib edited edge metadata.
kib added inline comments.
sys/fs/devfs/devfs_vnops.c
1787 ↗(On Diff #5866)

We can assert that si_threadcount > 0. It would not ensure that the counter owner is the caller, but is good enough. E.g., the same assert is used in VFS when held vnode must be passed around.

1796 ↗(On Diff #5866)

if (error != 0) there too and in several other places. I did not marked them all to not clutter the review.

sys/vm/vm_mmap.c
336 ↗(On Diff #5866)

Sounds fine.

This revision is now accepted and ready to land.Jun 1 2015, 4:52 PM
jhb marked an inline comment as done.Jun 1 2015, 5:09 PM
jhb added inline comments.
sys/fs/devfs/devfs_vnops.c
1787 ↗(On Diff #5866)

The only place that calls vm_mmap() with OBJT_DEVICE is actually drm_mapbufs(). Those are always done from an ioctl, and so devfs_ioctl_f() already holds a reference. OTOH, I'd have to get to the devsw somehow. Probably clearer to leave it as-is.

jhb edited edge metadata.
  • Explicitly test error against 0.
This revision now requires review to proceed.Jun 1 2015, 5:27 PM
sys/vm/vm_mmap.c
1516–1518 ↗(On Diff #5868)

There is a ')' missing.

I think that this is a good idea. I'm happy to see it go into the tree when you're done with it. If there is any specific part that you want me to take a really careful look at just let me know.

jhb edited edge metadata.
  • Comment fix.
jhb marked an inline comment as done.Jun 4 2015, 5:57 PM

I am happy for this to go in now (i.e. I think I am finished modulo any further review comments). I don't think there are any parts that I'm unsure of. fo_mmap() has also worked fine for my desired use case of adding a new file type that supports mmap in a child branch of this (though I also think that this is cleaner even without that extra flexibility).

jhb added a reviewer: jhb.
This revision is now accepted and ready to land.Jun 4 2015, 7:27 PM
This revision was automatically updated to reflect the committed changes.