Details

Reviewers

mav
trasz
markj
emaste

Commits

rGbbf7a4e878ed: O_PATH: allow vnode kevent filter on such files
rGf9b923af34a6: O_PATH: Allow to open symlink
rGa5970a529c2d: Make files opened with O_PATH to not block non-forced unmount
rG8d9ed174f3af: open(2): Implement O_PATH
rG437c241d0c78: vfs_vnops.c: Make vn_statfile() non-static
rG42be0a7b10b1: Style.

Summary

This was done from reading the Linux man page for open(2). I did not performed experiments on Linux to see how is this implementation is compatible with it.

As I understand, it was asked for by samba devs/porters.

Test program is available at https://gist.github.com/35e8084c99fcc9a2af0b1ae79dd845d7

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

kib added inline comments.Mar 30 2021, 4:35 PM

sys/kern/kern_descrip.c
4985	You mean O_EXLOCK is handled in my implementation? Yes I kept them explicitly, patch would be simpler if adv locking is ignored for O_PATH descriptors. Do you prefer to have adv locking ignored? I did not see a reason to disable that.

Disable adv locking.
As a safety measure, do not set any FREAD/FWRITE/FEXEC flags on O_PATH files, and not enable any caprights except LOOKUP.

markj added inline comments.Mar 31 2021, 9:46 PM

lib/libc/sys/open.2
324
333
sys/kern/kern_descrip.c
718	Does it make sense to permit F_GETLK for O_PATH descriptors?
3547	Should it be permitted for O_PATH?
sys/kern/vfs_syscalls.c
1112	Is it correct? For a file opened with O_PATH \| O_EXEC, we indeed want the CAP_FEXECVE right, no? Linux documentation states that O_PATH descriptors may be used with fexecve().

kib marked 5 inline comments as done.Mar 31 2021, 10:24 PM

kib added inline comments.

sys/kern/kern_descrip.c
718	Disabled.
3547	Disabled.
sys/kern/vfs_syscalls.c
1112	I did not knew that, I intended to disable fexecve(). Tried to handle O_EXEC then.

Man page editings.
Disable flock(2) and fcntl(F_GETLK) for O_PATH.
Enable fexecve(2) for O_PATH.

I started writing some test cases for O_PATH/AT_EMPTY_PATH BTW.

lib/libc/sys/open.2
323
sys/kern/vfs_syscalls.c
1026	Or handle it here if (flags & O_EXEC) { cap_rights_set_one(rightsp, CAP_FEXECVE); if (flags & O_PATH) return; } else { ...
1116	We should not allow O_CREAT either, I suspect.
sys/kern/vfs_vnops.c
417	I am not sure that we want to bypass MAC for O_PATH opens, especially since VEXEC access is possible.

kib marked 4 inline comments as done.Apr 1 2021, 6:57 PM

kib added inline comments.

sys/kern/vfs_vnops.c
417	Then O_VERIFY should be accepted as well, and VOP_ACCESS needs to be performed. I restructured the introductory part of vn_open_vnode() to handle all this.

Add MAC and access checks for O_EXEC | O_PATH case.
Restructure caprights calculation.

wulf added a subscriber: wulf.Apr 3 2021, 2:26 PM

wulf added inline comments.

sys/kern/kern_descrip.c
4985	Could we leave vn_kqfilter here? It looks like O_PATH is very helpful for kqueue-based file monitors. Especially for watching for file systems which have expensive open() like fuse and nfs.

Enable EVFILT_VNODE kqfilter for O_PATH files.
Make O_PATH references ephemeral, i.e. non-forces unmount should be allowed if only O_PATH files are opened (I did not tested that thing work at all after the change).
Improve the man page, listing all currently allowed ops in more structured way.

Fix some vrefact(9)s, O_PATH files only hold vnode, the vnode is not used.

arichardson added a subscriber: arichardson.Apr 5 2021, 3:39 PM

In D29323#658959, @walker.aj325_gmail.com wrote:
acl_get_fd(fd): Invalid argument
futimens(fd): Invalid argument
extattr_set_fd(fd): Invalid argument
Okay. These now fail as expected. errno is EINVAL rather than EBADF. In case of futimens() Linux fails with EBADF.

Would it be possible to include a basic test that does things like checks for EBADF?

I am able to open 600 root:wheel files as ordinary user with O_PATH flag added. Is it intended behavior?
If yes, we should deny kevent() for such a fds as it is a security hole.

In D29323#663235, @wulf wrote:

I am able to open 600 root:wheel files as ordinary user with O_PATH flag added. Is it intended behavior?

Yes intended. O_PATH should not result in any operation on the file not available through the path, but that's all.

If yes, we should deny kevent() for such a fds as it is a security hole.

Can you formulate exactly what should we deny? I do not see how to limit this usefully, unless you can provide a criteria, I will roll back enablement of the kqueue' vnode filter.

In D29323#663265, @kib wrote:

Can you formulate exactly what should we deny? I do not see how to limit this usefully, unless you can provide a criteria, I will roll back enablement of the kqueue' vnode filter.

At the present state of patch we can open() e.g. /etc/master.password as unprivileged user and start listening for kevents generated by superuser actions.
It was not possible earlier as open() was returning EACCES but it is possible now.
Linux guys do not count this as security hole. They allow unprivileged users to get inotify events on such a files when parent directory is listened, but BSDs did not allow this before.
IMO, we have two options to keep previous behavior.

Add extra fmode flag (FACCES) to force kevent() failures on such a file descriptors.
Add extra open() flag to process such a descriptors slightly differently from O_PATH.

I don't have a strong opinion which option is better.

In D29323#663322, @wulf wrote:

In D29323#663265, @kib wrote:

Can you formulate exactly what should we deny? I do not see how to limit this usefully, unless you can provide a criteria, I will roll back enablement of the kqueue' vnode filter.

At the present state of patch we can open() e.g. /etc/master.password as unprivileged user and start listening for kevents generated by superuser actions.
It was not possible earlier as open() was returning EACCES but it is possible now.
Linux guys do not count this as security hole. They allow unprivileged users to get inotify events on such a files when parent directory is listened, but BSDs did not allow this before.
IMO, we have two options to keep previous behavior.

Add extra fmode flag (FACCES) to force kevent() failures on such a file descriptors.

Add extra open() flag to process such a descriptors slightly differently from O_PATH.

I don't have a strong opinion which option is better.

If we have x access to the directory, we can stat files in it, or we can open files with O_PATH. In either case, we can notice:

writes, by observing changes in the mtime, size, and probably other stat fields
reads, mostly by the changes in atime. If atime updates are disabled for the volume, perhaps we cannot
linking and unlinking the file, by observing the nlink count
removal of the directory entry, by watching e.g. the parent directory

So for me, it seems that we already get the same knowledge just from the ability to do stat(2). Two quirks are that

with knotes we get notifications almost realtime, while with stat(2) we must poll
we do not get notifications about open/close, and depending on configuration, about reads.

Do we still consider it wrong? If yes, I can add some global sysctl knob to enable/disable vnode event filter on O_PATH files, but IMO it is overkill.

Do we still consider it wrong?

I do not have enough level of competence in security area to give solid answer to this question. IMO albeit "Security through obscurity" principle exists, it is better to have this feature than not to have it at all even at the price of realtime notification.

Also after some playing with O_PATH in libinotify, I found pair of issues:

Our current implementation does not open symlinks itself when O_PATH|O_NOFOLLOW open() flags are set simultaneously like Linux does. It returns EMLINK error instead of.
There is no faccess() call to check access permissions on application layer. But others miss this feature as well, so that is not our bug, but just a note.

In D29323#663984, @wulf wrote:

Also after some playing with O_PATH in libinotify, I found pair of issues:

Our current implementation does not open symlinks itself when O_PATH|O_NOFOLLOW open() flags are set simultaneously like Linux does. It returns EMLINK error instead of.

Should be fixed.

There is no faccess() call to check access permissions on application layer. But others miss this feature as well, so that is not our bug, but just a note.

Can you elaborate more? I am not aware what faccess() is, and there is no Linux man page either. Is this something like faccessat(fd, NULL, mode, AT_EMPTY_PATH) ?

Allow open(symlink, O_PATH|O_NOFOLLOW)

Can you elaborate more?

Linux checks for access rights to a file when establish an inotify watch on it (but ignores access rights for subwatches)

We misses that ability with O_PATH at least in raceless way as VOP_ACCESS stage is skipped at open() and there is no access() variant that have fd parameter passed in.

I am not aware what faccess() is, and there is no Linux man page either.

It is really called faccessx(), not faccess(). My bad, sorry: https://www.ibm.com/docs/en/i/7.4?topic=ssw_ibm_i_74/apis/faccessx.htm

Is this something like faccessat(fd, NULL, mode, AT_EMPTY_PATH) ?

AT_EMPTY_PATH looks promising. I should test it. Does it work when fd is not a directory?

In D29323#664079, @wulf wrote:

Can you elaborate more?

Linux checks for access rights to a file when establish an inotify watch on it (but ignores access rights for subwatches)

You mean, with O_PATH we should check the snapshot of current access rights when fd is passed to kqueue for vnode filter?

We misses that ability with O_PATH at least in raceless way as VOP_ACCESS stage is skipped at open() and there is no access() variant that have fd parameter passed in.

I am not aware what faccess() is, and there is no Linux man page either.

It is really called faccessx(), not faccess(). My bad, sorry: https://www.ibm.com/docs/en/i/7.4?topic=ssw_ibm_i_74/apis/faccessx.htm

I cannot find faccessx() on my Fedora 33 VM either. The reference seems to be about AIX?

Is this something like faccessat(fd, NULL, mode, AT_EMPTY_PATH) ?

AT_EMPTY_PATH looks promising. I should test it. Does it work when fd is not a directory?

Yes, it should work when fd is arbitrary file, this is the point of AT_EMPTY_PATH. It should be "" and not NULL in the second arg, now that I look at the snippet.

An attempt to fchdir() using an O_PATH | O_DIRECTORY descriptor yields:

VNASSERT failed: old > 0 not true at /usr/home/markj/src/freebsd-dev/sys/kern/vfs_subr.c:3031 (vrefact)
0xfffff8051d419000: type VDIR
    usecount 1, writecount 0, refcount 2 seqc users 0 mountedhere 0
    hold count flags ()
    flags (VMP_LAZYLIST)
    lock type ufs: UNLOCKED
        nlink=2, effnlink=2, size=512, extsize 0
        generation=67b06701, uid=1001, gid=0, flags=0x0
        ino 7867288, on dev ada2s1
panic: vrefact: wrong use count 0
cpuid = 21
time = 1617808572
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01ae1cf980
vpanic() at vpanic+0x181/frame 0xfffffe01ae1cf9d0
panic() at panic+0x43/frame 0xfffffe01ae1cfa30
vrefact() at vrefact+0x5e/frame 0xfffffe01ae1cfa50
sys_fchdir() at sys_fchdir+0x83/frame 0xfffffe01ae1cfac0
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe01ae1cfbf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01ae1cfbf0

markj added inline comments.Apr 7 2021, 4:11 PM

sys/kern/vfs_syscalls.c
3470	Linux does not permit fsync/fdatasync with O_PATH descriptors. I can't really see why it would be useful to diverge there.
sys/sys/fcntl.h
159	Should it be `(fflags & (O_EXEC \| O_PATH))`?
162	I think O_PATH should be included here, else it is not returned by fcntl(F_GETFL), contrary to Linux.

You mean, with O_PATH we should check the snapshot of current access rights when fd is passed to kqueue for vnode filter?

Yes! It would be ideal as it would eliminate any security concerns and make faccess* stuff unneeded. We also can cache result of VOP_ACCESS performed at open() time with VWRITE & VREAD flags kept and than use it at kevent() times to keep traditional kqueue access rights checking style but it could be an overkill.

I cannot find faccessx() on my Fedora 33 VM either. The reference seems to be about AIX?

Yes, it is AIX and OS/400 specific function.

In D29323#664210, @markj wrote:

An attempt to fchdir() using an O_PATH | O_DIRECTORY descriptor yields:

VNASSERT failed: old > 0 not true at /usr/home/markj/src/freebsd-dev/sys/kern/vfs_subr.c:3031 (vrefact)
0xfffff8051d419000: type VDIR
    usecount 1, writecount 0, refcount 2 seqc users 0 mountedhere 0
    hold count flags ()
    flags (VMP_LAZYLIST)
    lock type ufs: UNLOCKED
        nlink=2, effnlink=2, size=512, extsize 0
        generation=67b06701, uid=1001, gid=0, flags=0x0
        ino 7867288, on dev ada2s1
panic: vrefact: wrong use count 0
cpuid = 21
time = 1617808572
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01ae1cf980
vpanic() at vpanic+0x181/frame 0xfffffe01ae1cf9d0
panic() at panic+0x43/frame 0xfffffe01ae1cfa30
vrefact() at vrefact+0x5e/frame 0xfffffe01ae1cfa50
sys_fchdir() at sys_fchdir+0x83/frame 0xfffffe01ae1cfac0
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe01ae1cfbf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01ae1cfbf0

Should be fixed.

In D29323#664224, @wulf wrote:

You mean, with O_PATH we should check the snapshot of current access rights when fd is passed to kqueue for vnode filter?

Yes! It would be ideal as it would eliminate any security concerns and make faccess* stuff unneeded. We also can cache result of VOP_ACCESS performed at open() time with VWRITE & VREAD flags kept and than use it at kevent() times to keep traditional kqueue access rights checking style but it could be an overkill.

Why VWRITE?
And the behavior with the check for current access rights is absolutely non-typical for file descriptors, which should catch the result of check at open time.

I added the check on each VOP, but I probably drop it. It is there for now if you want to play with it and for others to state their opinion. I do not see a problem with allowing kevent(2) on O_PATH.

I cannot find faccessx() on my Fedora 33 VM either. The reference seems to be about AIX?

Yes, it is AIX and OS/400 specific function.

I believe faccessat(AT_EMPTY_PATH) gives this semantic.

sys/kern/vfs_syscalls.c
3470	But user can call sync(2) always, so what would be the point disallowing fsync(2)? I can but it is strange.

Fix fchdir()
Fix flags translation between ext<->int
Add checks for access on each VOP_KQFILTER (to be dropped)

Why VWRITE?

It depends on orthogonality of O_PATH
If we consider O_PATH flag to be orthogonal to other access mode flags like O_RDONLY, O_WRONLY and O_RDWR, we should pass VWRITE to VOP_ACCESS if O_WRONLY or O_RDWR is specified in open() flag list.
If O_PATH is just a next access mode in the list and it can not be combined with other aforementioned O_* flags, then only VREAD should be passed to VOP_ACCESS unconditionally.

And the behavior with the check for current access rights is absolutely non-typical for file descriptors, which should catch the result of check at open time.

To catch the result of check at open time is what I offered in second sentence of my previous reply. We can add second VOP_ACCESS(VREAD) call at open() and use its returned value in knfilter later.

I added the check on each VOP, but I probably drop it. It is there for now if you want to play with it and for others to state their opinion. I do not see a problem with allowing kevent(2) on O_PATH.

I tested It and found it working as expected after fixing a typo in VOP_ACCESS parameters (s/FREAD/VREAD/). Thanks!

Check access on open and allow kevent vnode filter for O_PATH files if VOP_ACCESS(VREAD) passed.

markj added inline comments.Apr 8 2021, 5:04 PM

sys/kern/vfs_syscalls.c
3470	I do not have a strong argument, it just seems more in line with expected semantics for O_PATH, and it is hard to imagine a scenario where something would want to call fsync() on an O_PATH fd. Having a minimal set of permitted interfaces for O_PATH descriptors makes it easier to think about their properties and how they interact with the rest of the system. In this case I agree it's a minor point and don't insist on anything. Capsicum even permits sync(2), though that seems strange to me.

Disable fsync(O_PATH)

Check access on open and allow kevent vnode filter for O_PATH files if VOP_ACCESS(VREAD) passed.

kevent() is always failing now on registering of O_PATH file descriptors.

Debugging printfs show that FKQALLOWED flag is not preserved between open() and kevent() calls.

Move FKQALLOWED to the final f_flag value.

Move FKQALLOWED to the final f_flag value.

FKQALLOWED flag is still unset in vn_kqfilter_opath()

In D29323#664971, @wulf wrote:

Move FKQALLOWED to the final f_flag value.

FKQALLOWED flag is still unset in vn_kqfilter_opath()

Could you please give me the min example?

Could you please give me the min example?

$ cat test.c

#include <sys/types.h>
#include <sys/event.h>
#include <sys/time.h>
#include <fcntl.h>
#include <err.h>
#include <errno.h>

int
main(int argc, char *argv[])
{
	int fd, kq, nev;
	struct kevent ev;
	static const struct timespec tout = { 1, 0 };

	if ((fd = open(argv[1], O_PATH | O_NOFOLLOW | O_NONBLOCK)) == -1)
		err(1, "Cannot open `%s'", argv[1]);

	if ((kq = kqueue()) == -1)
		err(1, "Cannot create kqueue");

	EV_SET(&ev, fd, EVFILT_VNODE, EV_ADD | EV_ENABLE | EV_CLEAR,
	    NOTE_DELETE|NOTE_WRITE|NOTE_EXTEND|NOTE_ATTRIB|NOTE_LINK|
	    NOTE_RENAME|NOTE_REVOKE|NOTE_READ|NOTE_OPEN|NOTE_CLOSE|
	    NOTE_CLOSE_WRITE, 0, 0);
	if (kevent(kq, &ev, 1, NULL, 0, &tout) == -1)
		err(1, "kevent %d", errno);
}

$ cc test.c -o test
$ ./test test.c
test: kevent 9: Bad file descriptor

Fix propagating FKQALLOWED

In D29323#665100, @kib wrote:

Fix propagating FKQALLOWED

It is working now. Thanks!

I would suggest explicitly prohibiting AIO operations on O_PATH descriptors, in aio_aqueue(). Today reads and writes are prohibited already, but fsyncs are not and who knows what might be added in the future.

We do not always return EBADF in response to invalid system calls, e.g., mmap() returns ENODEV. I am not sure if it is worth addressing.

sys/kern/kern_descrip.c
555	This case is a bit strange. The ioctl will always fail for O_PATH fds, but the flags may be modified by the loop above, in principle. Perhaps we should check for path_fileops before that and unconditionally fail. All flags settable with F_SETFL are related to I/O operations on the file.
774	I doubt we should permit O_PATH fds here, we disallow posix_fadvise() already.
sys/kern/vfs_syscalls.c
1147	Perhaps assert that O_PATH was not specified in this case.