Page MenuHomeFreeBSD

Expose eventfd in the native API/ABI using a new __specialfd syscall
Needs ReviewPublic

Authored by greg_unrelenting.technology on Oct 4 2020, 8:06 PM.

Details

Summary

eventfd is a Linux system call that produces special file descriptors for event notification.
When porting Linux software, it is currently usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues are not passable between processes! (And as noted by the author of epoll-shim, even if they were, the library state would also have to be passed somehow.) This came up when debugging strange HW video decode failures in Firefox. A native implementation would avoid these problems and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for the Linuxulator), it's pretty easy to expose it natively, which is what this patch does.

It's still slightly WIP (no manpages yet, and I also want to add procstat support) but I'm posting it now to get feedback on the whole thing and on sys/eventfd.h specifically.

Notes:

  • eventfd_read/write one-liners are from musl libc. I didn't make them static inline because they're usually symbols in Linux libcs and with symbols I'm more confident that software still using epoll-shim's implementation would correctly pick the right set of functions;
  • EFD_* flags are using Linux values of CLOEXEC and NONBLOCK to avoid extra conversions on the Linuxulator side. Illumos also did it like this btw;
  • The reuse of eventfd_ioctl in Linuxulator timerfd is no more, there's timerfd_ioctl now. I'm not sure if it's a valuable reuse really.
Test Plan

Example program from the Linux manpage is an okay example that works with the native implementation but not with epoll-shim:

% ./eventfd-test 1 2 4 7 14
Child writing 1 to efd
Child writing 2 to efd
Child writing 4 to efd
Child writing 7 to efd
Child writing 14 to efd
Child completed write loop
Parent about to read
Parent read 28 (0x1c) from efd
% cc -o eventfd-test-esh -I/usr/local/include -L/usr/local/lib -lepoll-shim eventfd-test.c
% ./eventfd-test-esh 1 2 4 7 14
Child writing 1 to efd
write: Bad file descriptor
Parent about to read
read: Operation not supported

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lib/libc/stdio/Symbol.map
182 ↗(On Diff #77865)

I couldn't find any better place? sys doesn't seem to have anything with a .3 manpage.. (Though these could be just listed on eventfd(2)?)

sys/compat/linux/linux_event.c
850

This (and most other things) are just moved from linux_event.c, I didn't think changing them was appropriate.. okay, will do

sys/kern/sys_eventfd.c
1

Should be the linux_event.c copyright since the code is mostly from there?

120

I was thinking of just renaming LINUXEFD to EVENTFD, since in this patch the linux part just reuses the same type..?

306

Only direct users of kqueue would be able to use kn_data, right? The cast is ehh.. I don't know about the cast.

sys/kern/sys_eventfd.c
306

Only direct users of kqueue would be able to use kn_data, right? The cast is ehh.. I don't know about the cast.

Yes, only kqueue users would be able to do that. Right now, the kqueue integration is only used as an implementation detail for Linuxulator's epoll emulation, so setting kn_data doesn't matter at all. But if eventfd is native, the kqueue integration suddenly becomes part of the API surface.

sys/kern/sys_eventfd.c
1

That seems reasonable to me.

88

Should we check for unknown flags?

120

That sounds reasonable to me.

sys/compat/linux/linux_event.c
861

I guess the intention of the original code was to make FIOASYNC do nothing, but it seems fine to leave kern_ioctl to set it in f_flag, we just don't do anything with that bit anywhere.

greg_unrelenting.technology added inline comments.
sys/compat/linux/linux_event.c
861

(disregard that comment, I thought .fo_ioctl overrides kern_ioctl but no, it's called from there)

greg_unrelenting.technology added inline comments.
sys/kern/sys_eventfd.c
306

Well, I mean that kqueue is often used indirectly through libevent/libuv/etc too.

I guess the cast is okay if documented in kqueue(2). Interestingly the C cast operator does not guarantee that casting unsigned to signed would be a simple byte reinterpretation but on all our platforms it should be equivalent anyway..?

353

It was like this.. Looking at other stat implementations like kqueue_stat, seems like they bzero and return 0, guess I should go with that?

sys/kern/sys_eventfd.c
88

That's happening in sys_eventfd and linux_eventfd2

greg_unrelenting.technology edited the summary of this revision. (Show Details)
  • Cleaned various things up (based on feedback)
  • Added kqueue filter data (hopefully the casts are fine)
  • Switched constants to O_ native ones
  • Changed stat to non-error (same as the impl for kqueue)
  • Moved libc extra functions to the gen subdirectory (this was the feedback in D2172)
  • Added manpage (written from scratch to avoid licensing issues)
  • Added audit stuff
  • Added procstat stuff

This mostly looks ok to me, my comments are just nits.

lib/libc/sys/eventfd.2
80

This paragraph is mixing tenses. I think "is not set" is better, ditto below.

164

returns

sys/compat/linux/linux_event.c
670

It seems better to translate to the aliases, i.e., flags |= EFD_CLOEXEC or EFD_NONBLOCK.

sys/kern/sys_eventfd.c
149

I think you can instead just assert that f_type == DTYPE_EVENTFD.

181

Same here and below, I don't believe this needs to be checked.

189

The goto can be eliminated with a loop:

while (error == 0 && efd->efd_count == 0) {
...
357

The opening brace should be on the previous line.

sys/sys/eventfd.h
1

Missing a copyright header.

greg_unrelenting.technology added inline comments.
sys/kern/sys_eventfd.c
181

Yeah looks like this isn't checked for e.g. pipes. I wonder why this check was added originally — possibly due to the mixing of eventfd and timerfd in one file.

sys/sys/eventfd.h
1

With these trivial API headers, I don't even know who should have the copyright. I can put myself I guess..

I see various errors in the man page:

[XeNoN:freebsd-head-clean]# mandoc -Tlint lib/libc/sys/eventfd.2
mandoc: lib/libc/sys/eventfd.2:44:2: ERROR: skipping end of block that is not open: Fc
mandoc: lib/libc/sys/eventfd.2:188:2: ERROR: inserting missing end of block: Sh breaks Bl
mandoc: lib/libc/sys/eventfd.2:129:2: WARNING: moving paragraph macro out of list: Pp
mandoc: lib/libc/sys/eventfd.2:129:2: WARNING: skipping paragraph macro: Pp before Pp
mandoc: lib/libc/sys/eventfd.2:192:2: WARNING: unusual Xr order: close after select

Therefore, I will add bcr as a reviewer, since he is more competent in this area.
Thanks for working on this.

I am not sure that this feature deserves FreeBSD syscall. Could we get away with /dev/eventfd, which would return eventfd on open ? Libc wrapper then handles compat for eventfd(2).

lib/libc/gen/eventfd_rw.c
35 ↗(On Diff #78167)

return (value);

Is errno supposed to contain a valid value when -1 is returned ? It is currently not.

sys/compat/linux/linux_event.c
665

spaces should be put around both sides of binary op:
LINUX_O_CLOEXEC | LINUX_O_NONBLOCK | EFD_SEMAPHORE

671

Should this be LINUX_EFD_SEMAPHORE ?

sys/compat/linux/linux_event.h
59 ↗(On Diff #78167)

I do not think this constant should be removed. It is Linux ABI.

sys/kern/capabilities.conf
403

The file is ordered alphabetically.

sys/kern/sys_eventfd.c
259

Spaces around |.

sys/sys/user.h
439

I wonder if the contamination of the header with include of eventfd.h is worth it. IMO it is good enough to use uint64_t directly.

In D26668#596865, @kib wrote:

Could we get away with /dev/eventfd, which would return eventfd on open ? Libc wrapper then handles compat for eventfd(2).

Illumos gets away with that, I guess we could – but a syscall works much better with Capsicum. With the device based solution, eventfd creation inside a sandbox would require holding a dirfd to the parent directory of the magical device. (And automatically storing that capability surely won't be acceptable for libc so making eventfd(2) work sandboxed would require interposing a special impl over the libc symbol from LD_PRELOAD arghh. Welllllll maybe having a caph_init_eventfd() and having the normal eventfd() check for its stored fd would be okay.. still a bit dirty but not the worst.)

Ooh I just had a galaxy brain idea. What if we introduce one syscall, specialfd with args being an int for the type constant and a void* that points to per-type args structs. (Maybe also len to allow extending structs while keeping the same type.) This way, one syscall would allow adding more similar facilities later (like timerfd or something completely new.. and e.g. shm_open2 could have been this) without having a new syscall-worthiness debate each time!

In D26668#596865, @kib wrote:

Could we get away with /dev/eventfd, which would return eventfd on open ? Libc wrapper then handles compat for eventfd(2).

Illumos gets away with that, I guess we could – but a syscall works much better with Capsicum. With the device based solution, eventfd creation inside a sandbox would require holding a dirfd to the parent directory of the magical device. (And automatically storing that capability surely won't be acceptable for libc so making eventfd(2) work sandboxed would require interposing a special impl over the libc symbol from LD_PRELOAD arghh. Welllllll maybe having a caph_init_eventfd() and having the normal eventfd() check for its stored fd would be okay.. still a bit dirty but not the worst.)

I tend to prefer having separate system calls for exactly this reason. /dev/eventfd requires access to a global namespace, but nothing about eventfd functionality depends on that.

Ooh I just had a galaxy brain idea. What if we introduce one syscall, specialfd with args being an int for the type constant and a void* that points to per-type args structs. (Maybe also len to allow extending structs while keeping the same type.) This way, one syscall would allow adding more similar facilities later (like timerfd or something completely new.. and e.g. shm_open2 could have been this) without having a new syscall-worthiness debate each time!

If one went down this road I'd suggest an option to open /dev/null and perhaps /dev/zero.

That being said, I'm not a fan of untyped arguments to syscalls without length arguments. They complicate system call wrappers.

Ok, the capsicum argument is convincing enough to not go with /dev/eventfd route. But on the other hand having a single syscall that can also serve timerfd in future would be nice indeed. I do not think that shmfd should be plugged into this case.

No, please, let's keep it compatible with the rest of the world (ie Linux). Otherwise we'll only creating more work for Ports folks for no good reason.

No, please, let's keep it compatible with the rest of the world (ie Linux). Otherwise we'll only creating more work for Ports folks for no good reason.

Uh, which thing in the discussion above would make it not compatible?? (via the libc, which is the only way we can be compatible)

No, please, let's keep it compatible with the rest of the world (ie Linux). Otherwise we'll only creating more work for Ports folks for no good reason.

Uh, which thing in the discussion above would make it not compatible?? (via the libc, which is the only way we can be compatible)

I got that impression from arguments about special device nodes, or merging with timerfd.

Or it might be that I'm entirely out of sync with this discussion; if the public interface is compatible with Linux, then I'm happy :-)

Or it might be that I'm entirely out of sync with this discussion; if the public interface is compatible with Linux, then I'm happy :-)

Yes, the discussion is about the syscall under the hood of the libc function, which of course would be compatible, that's the whole point.

Or it might be that I'm entirely out of sync with this discussion; if the public interface is compatible with Linux, then I'm happy :-)

Yes, the discussion is about the syscall under the hood of the libc function, which of course would be compatible, that's the whole point.

But then the functionality of libc wrapper would have to be reimplemented in the kernel anyway, for the purpose of linuxulator.

But then the functionality of libc wrapper would have to be reimplemented in the kernel anyway, for the purpose of linuxulator.

There's only one kernel implementation, int eventfd_create(struct thread *td, uint32_t initval, int flags).
It is already used by linuxulator like: int linux_eventfd2(struct thread *td, struct linux_eventfd2_args *args) { ... return eventfd_create(...) }

The way we expose it in the native ABI does not affect linuxulator and does not change libc API. To sum up, the options are:

  • straightforward syscall, which is in the current version of this patch here: int sys_eventfd(struct thread *td, struct eventfd_args *args) { ... return eventfd_create(...) }
    • native libc has an auto generated syscall wrapper
  • special device (like /dev/eventfd on illumos) which would return the result of eventfd_create() on open
    • native libc would do int eventfd(..) { return open(that_dev_path) }
    • worst option because it's bad for Capsicum, of course
  • "multiplexed" syscall
    • libc would do: int eventfd(..) { struct eventfd_args args = {..}; return specialfd(SPECIAL_FD_EVENTFD, &args, sizeof(args)) }
    • the benefit: avoiding the "is this thing syscall worthy" debate for every new thing similar to eventfd (so timerfd etc), we could expose other existing things (e.g. SPECIAL_FD_NULL for /dev/null) for easy access in Capsicum
      • though an arbitrary new fd type might require other syscalls around it
        • though they could be turned into ioctls
    • the downside: complexity involved in processing structs from user memory safely

Note that the "functionality of libc wrapper" in options 2 and 3 above of course would not have to be "reimplemented in the kernel for linuxulator" because it would be purely plumbing for conforming the eventfd() API to the syscall ABI we choose.

So yeah, I thought the multiplexed syscall is a cool idea, but now the thought of actually processing passed structs in the kernel is making me doubt it :) So I'll leave this as-is. Well, addressing the latest couple nits.

greg_unrelenting.technology added inline comments.
lib/libc/gen/eventfd_rw.c
35 ↗(On Diff #78167)

All the linux manpage says is

The GNU C library defines an additional type, and two functions that attempt to abstract some of the details of reading and writing on an eventfd file descriptor:

typedef uint64_t eventfd_t;

int eventfd_read(int fd, eventfd_t *value);
int eventfd_write(int fd, eventfd_t value);

The functions perform the read and write operations on an eventfd file descriptor, returning 0 if the correct number of bytes was transferred, or -1 otherwise.

Just like the musl implementation I took, the glibc implementation does not care about errno:
https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/eventfd_read.c

sys/compat/linux/linux_event.c
665

It was like this.. maybe to keep the line under 80 characters :) but this file already has slightly longer lines so I guess that doesn't matter that much

sys/compat/linux/linux_event.h
59 ↗(On Diff #78167)

With our EFD_SEMAPHORE being equivalent I thought it would be an unnecessary duplication, but sure, let's keep it for consistency

xtouqh_icloud.com added inline comments.
lib/libc/sys/eventfd.2
45

How about:

The
.Fn eventfd
system call creates ...

59

I would use the proper list instead of literal one:

.Bl -tag -width "EFD_SEMAPHORE" -compact
.It Dv EFD_CLOEXEC
set FD_CLOEXEC on the file descriptor
.It Dv EFD_NONBLOCK
do not block on read/write operations
.It Dv EFD_SEMAPHORE
use semaphore semantics
.El

75

Use .Bl -bullet or .Bl -dash and .It without arguments instead of re-inventing one :)

176

The
.Fa flags
argument ...

This revision is now accepted and ready to land.Wed, Nov 4, 5:44 PM

So where we are WRT extending syscall to make it reusable for timerfd ?

In D26668#604707, @kib wrote:

So where we are WRT extending syscall to make it reusable for timerfd ?

As mentioned above, I'm not sure anymore whether it's worth writing "dangerous" user struct processing code just to save on syscall numbers, but I could try, sure.

Another thing. That syscall, would it have to be documented? Or would it be okay to pretend that we have the eventfd(2) "system call" when in reality it's a libc function calling into a private specialfd system call?

Another thing. That syscall, would it have to be documented? Or would it be okay to pretend that we have the eventfd(2) "system call" when in reality it's a libc function calling into a private specialfd system call?

It is fine to have eventfd(2) man page as you already did. Eventually you might want to add specialfd(2) but I do not think this is a hard requirement.

For instance, for long time we only had man pages for pthread_mutex*(3) etc until I sit and wrote __umtx_op(2).

Added specialfd. Okay, that doesn't look too bad. I haven't tested intentionally causing problems from userspace, but other syscalls seem to be doing the same thing as here with user structs, I guess everything that could go wrong is handled inside copyin?

I'm not sure how well I'm following conventions regarding "private" syscalls, but rpctls_syscall seems to be in a similar position (normal auto-generated wrapper just put into FBSDprivate_1.0 in the symbol map; no underscore prefix in the name).

Also applied the latest manpage suggestions, except the first one (not sure how I feel about the phrase "the eventfd system call" now that eventfd is actually a libc wrapper around a different syscall, specialfd)

This revision now requires review to proceed.Fri, Nov 6, 10:23 PM

Is specialfd supposed to be a public or private interface? If private I'd suggest __specialfd to make that clear.

sys/kern/sys_specialfd.c
46
49
sys/kern/syscalls.master
3248

While it has little effect in practice, please use size_t for things that should be initialized by sizeof().

sys/kern/sys_specialfd.c
48

Remove all the blank lines inside the case.

53

spaces around '|' as any other binary op.

sys/sys/_specialfd.h
41 ↗(On Diff #79286)

Again, no need for cdefs.h, it should come in my other means.

Name the header specialfd.h, without _.

44 ↗(On Diff #79286)

I think it is better to put the declaration for __sys_specialfd() to libc/include/libc_private.h instead.

Also do not export specialfd symbols in any way. It is enough that we generate syscall number symbol.

sys/sys/eventfd.h
48

This include is excessive. sys/types.h already includes cdefs.h (as well as almost all other important headers).

Now named with the __ convention and using libc_private (didn't know about it).

Interestingly, libc/include/libc_private.h did not contain any __sys___* before.

greg_unrelenting.technology retitled this revision from Add native system call for eventfd to Expose eventfd in the native API/ABI using a new __specialfd syscall.Mon, Nov 9, 4:41 PM