Page MenuHomeFreeBSD

Expose eventfd in the native API/ABI using a new __specialfd syscall
ClosedPublic

Authored by kib on Oct 4 2020, 8:06 PM.
Referenced Files
Unknown Object (File)
Thu, Jan 2, 3:42 AM
Unknown Object (File)
Thu, Jan 2, 2:37 AM
Unknown Object (File)
Tue, Dec 24, 3:21 PM
Unknown Object (File)
Sat, Dec 21, 1:46 PM
Unknown Object (File)
Dec 5 2024, 7:34 AM
Unknown Object (File)
Nov 17 2024, 4:32 PM
Unknown Object (File)
Nov 11 2024, 9:35 PM
Unknown Object (File)
Nov 11 2024, 9:33 PM

Details

Summary

eventfd is a Linux system call that produces special file descriptors for event notification.
When porting Linux software, it is currently usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues are not passable between processes! (And as noted by the author of epoll-shim, even if they were, the library state would also have to be passed somehow.) This came up when debugging strange HW video decode failures in Firefox. A native implementation would avoid these problems and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for the Linuxulator), it's pretty easy to expose it natively, which is what this patch does.

It's still slightly WIP (no manpages yet, and I also want to add procstat support) but I'm posting it now to get feedback on the whole thing and on sys/eventfd.h specifically.

Notes:

  • eventfd_read/write one-liners are from musl libc. I didn't make them static inline because they're usually symbols in Linux libcs and with symbols I'm more confident that software still using epoll-shim's implementation would correctly pick the right set of functions;
  • EFD_* flags are using Linux values of CLOEXEC and NONBLOCK to avoid extra conversions on the Linuxulator side. Illumos also did it like this btw;
  • The reuse of eventfd_ioctl in Linuxulator timerfd is no more, there's timerfd_ioctl now. I'm not sure if it's a valuable reuse really.
Test Plan

Example program from the Linux manpage is an okay example that works with the native implementation but not with epoll-shim:

% ./eventfd-test 1 2 4 7 14
Child writing 1 to efd
Child writing 2 to efd
Child writing 4 to efd
Child writing 7 to efd
Child writing 14 to efd
Child completed write loop
Parent about to read
Parent read 28 (0x1c) from efd
% cc -o eventfd-test-esh -I/usr/local/include -L/usr/local/lib -lepoll-shim eventfd-test.c
% ./eventfd-test-esh 1 2 4 7 14
Child writing 1 to efd
write: Bad file descriptor
Parent about to read
read: Operation not supported

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lib/libc/sys/eventfd.2
80 ↗(On Diff #78030)

This paragraph is mixing tenses. I think "is not set" is better, ditto below.

164 ↗(On Diff #78030)

returns

sys/compat/linux/linux_event.c
669

It seems better to translate to the aliases, i.e., flags |= EFD_CLOEXEC or EFD_NONBLOCK.

sys/kern/sys_eventfd.c
182

Same here and below, I don't believe this needs to be checked.

190

The goto can be eliminated with a loop:

while (error == 0 && efd->efd_count == 0) {
...
358

The opening brace should be on the previous line.

val_packett.cool added inline comments.
sys/kern/sys_eventfd.c
182

Yeah looks like this isn't checked for e.g. pipes. I wonder why this check was added originally — possibly due to the mixing of eventfd and timerfd in one file.

sys/sys/eventfd.h
2

With these trivial API headers, I don't even know who should have the copyright. I can put myself I guess..

val_packett.cool marked an inline comment as done.

I see various errors in the man page:

[XeNoN:freebsd-head-clean]# mandoc -Tlint lib/libc/sys/eventfd.2
mandoc: lib/libc/sys/eventfd.2:44:2: ERROR: skipping end of block that is not open: Fc
mandoc: lib/libc/sys/eventfd.2:188:2: ERROR: inserting missing end of block: Sh breaks Bl
mandoc: lib/libc/sys/eventfd.2:129:2: WARNING: moving paragraph macro out of list: Pp
mandoc: lib/libc/sys/eventfd.2:129:2: WARNING: skipping paragraph macro: Pp before Pp
mandoc: lib/libc/sys/eventfd.2:192:2: WARNING: unusual Xr order: close after select

Therefore, I will add bcr as a reviewer, since he is more competent in this area.
Thanks for working on this.

I am not sure that this feature deserves FreeBSD syscall. Could we get away with /dev/eventfd, which would return eventfd on open ? Libc wrapper then handles compat for eventfd(2).

lib/libc/gen/eventfd_rw.c
35 ↗(On Diff #78167)

return (value);

Is errno supposed to contain a valid value when -1 is returned ? It is currently not.

sys/compat/linux/linux_event.c
662

spaces should be put around both sides of binary op:
LINUX_O_CLOEXEC | LINUX_O_NONBLOCK | EFD_SEMAPHORE

670

Should this be LINUX_EFD_SEMAPHORE ?

sys/compat/linux/linux_event.h
59

I do not think this constant should be removed. It is Linux ABI.

sys/kern/capabilities.conf
399

The file is ordered alphabetically.

sys/kern/sys_eventfd.c
260

Spaces around |.

sys/sys/user.h
441 ↗(On Diff #78167)

I wonder if the contamination of the header with include of eventfd.h is worth it. IMO it is good enough to use uint64_t directly.

In D26668#596865, @kib wrote:

Could we get away with /dev/eventfd, which would return eventfd on open ? Libc wrapper then handles compat for eventfd(2).

Illumos gets away with that, I guess we could – but a syscall works much better with Capsicum. With the device based solution, eventfd creation inside a sandbox would require holding a dirfd to the parent directory of the magical device. (And automatically storing that capability surely won't be acceptable for libc so making eventfd(2) work sandboxed would require interposing a special impl over the libc symbol from LD_PRELOAD arghh. Welllllll maybe having a caph_init_eventfd() and having the normal eventfd() check for its stored fd would be okay.. still a bit dirty but not the worst.)

Ooh I just had a galaxy brain idea. What if we introduce one syscall, specialfd with args being an int for the type constant and a void* that points to per-type args structs. (Maybe also len to allow extending structs while keeping the same type.) This way, one syscall would allow adding more similar facilities later (like timerfd or something completely new.. and e.g. shm_open2 could have been this) without having a new syscall-worthiness debate each time!

In D26668#596930, @greg_unrelenting.technology wrote:
In D26668#596865, @kib wrote:

Could we get away with /dev/eventfd, which would return eventfd on open ? Libc wrapper then handles compat for eventfd(2).

Illumos gets away with that, I guess we could – but a syscall works much better with Capsicum. With the device based solution, eventfd creation inside a sandbox would require holding a dirfd to the parent directory of the magical device. (And automatically storing that capability surely won't be acceptable for libc so making eventfd(2) work sandboxed would require interposing a special impl over the libc symbol from LD_PRELOAD arghh. Welllllll maybe having a caph_init_eventfd() and having the normal eventfd() check for its stored fd would be okay.. still a bit dirty but not the worst.)

I tend to prefer having separate system calls for exactly this reason. /dev/eventfd requires access to a global namespace, but nothing about eventfd functionality depends on that.

In D26668#596930, @greg_unrelenting.technology wrote:

Ooh I just had a galaxy brain idea. What if we introduce one syscall, specialfd with args being an int for the type constant and a void* that points to per-type args structs. (Maybe also len to allow extending structs while keeping the same type.) This way, one syscall would allow adding more similar facilities later (like timerfd or something completely new.. and e.g. shm_open2 could have been this) without having a new syscall-worthiness debate each time!

If one went down this road I'd suggest an option to open /dev/null and perhaps /dev/zero.

That being said, I'm not a fan of untyped arguments to syscalls without length arguments. They complicate system call wrappers.

Ok, the capsicum argument is convincing enough to not go with /dev/eventfd route. But on the other hand having a single syscall that can also serve timerfd in future would be nice indeed. I do not think that shmfd should be plugged into this case.

No, please, let's keep it compatible with the rest of the world (ie Linux). Otherwise we'll only creating more work for Ports folks for no good reason.

No, please, let's keep it compatible with the rest of the world (ie Linux). Otherwise we'll only creating more work for Ports folks for no good reason.

Uh, which thing in the discussion above would make it not compatible?? (via the libc, which is the only way we can be compatible)

In D26668#597982, @greg_unrelenting.technology wrote:

No, please, let's keep it compatible with the rest of the world (ie Linux). Otherwise we'll only creating more work for Ports folks for no good reason.

Uh, which thing in the discussion above would make it not compatible?? (via the libc, which is the only way we can be compatible)

I got that impression from arguments about special device nodes, or merging with timerfd.

Or it might be that I'm entirely out of sync with this discussion; if the public interface is compatible with Linux, then I'm happy :-)

Or it might be that I'm entirely out of sync with this discussion; if the public interface is compatible with Linux, then I'm happy :-)

Yes, the discussion is about the syscall under the hood of the libc function, which of course would be compatible, that's the whole point.

In D26668#598027, @greg_unrelenting.technology wrote:

Or it might be that I'm entirely out of sync with this discussion; if the public interface is compatible with Linux, then I'm happy :-)

Yes, the discussion is about the syscall under the hood of the libc function, which of course would be compatible, that's the whole point.

But then the functionality of libc wrapper would have to be reimplemented in the kernel anyway, for the purpose of linuxulator.

But then the functionality of libc wrapper would have to be reimplemented in the kernel anyway, for the purpose of linuxulator.

There's only one kernel implementation, int eventfd_create(struct thread *td, uint32_t initval, int flags).
It is already used by linuxulator like: int linux_eventfd2(struct thread *td, struct linux_eventfd2_args *args) { ... return eventfd_create(...) }

The way we expose it in the native ABI does not affect linuxulator and does not change libc API. To sum up, the options are:

  • straightforward syscall, which is in the current version of this patch here: int sys_eventfd(struct thread *td, struct eventfd_args *args) { ... return eventfd_create(...) }
    • native libc has an auto generated syscall wrapper
  • special device (like /dev/eventfd on illumos) which would return the result of eventfd_create() on open
    • native libc would do int eventfd(..) { return open(that_dev_path) }
    • worst option because it's bad for Capsicum, of course
  • "multiplexed" syscall
    • libc would do: int eventfd(..) { struct eventfd_args args = {..}; return specialfd(SPECIAL_FD_EVENTFD, &args, sizeof(args)) }
    • the benefit: avoiding the "is this thing syscall worthy" debate for every new thing similar to eventfd (so timerfd etc), we could expose other existing things (e.g. SPECIAL_FD_NULL for /dev/null) for easy access in Capsicum
      • though an arbitrary new fd type might require other syscalls around it
        • though they could be turned into ioctls
    • the downside: complexity involved in processing structs from user memory safely

Note that the "functionality of libc wrapper" in options 2 and 3 above of course would not have to be "reimplemented in the kernel for linuxulator" because it would be purely plumbing for conforming the eventfd() API to the syscall ABI we choose.

So yeah, I thought the multiplexed syscall is a cool idea, but now the thought of actually processing passed structs in the kernel is making me doubt it :) So I'll leave this as-is. Well, addressing the latest couple nits.

val_packett.cool added inline comments.
lib/libc/gen/eventfd_rw.c
35 ↗(On Diff #78167)

All the linux manpage says is

The GNU C library defines an additional type, and two functions that attempt to abstract some of the details of reading and writing on an eventfd file descriptor:

typedef uint64_t eventfd_t;

int eventfd_read(int fd, eventfd_t *value);
int eventfd_write(int fd, eventfd_t value);

The functions perform the read and write operations on an eventfd file descriptor, returning 0 if the correct number of bytes was transferred, or -1 otherwise.

Just like the musl implementation I took, the glibc implementation does not care about errno:
https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/eventfd_read.c

sys/compat/linux/linux_event.c
662

It was like this.. maybe to keep the line under 80 characters :) but this file already has slightly longer lines so I guess that doesn't matter that much

sys/compat/linux/linux_event.h
59

With our EFD_SEMAPHORE being equivalent I thought it would be an unnecessary duplication, but sure, let's keep it for consistency

val_packett.cool marked an inline comment as done.

Address feedback

xtouqh_icloud.com added inline comments.
lib/libc/sys/eventfd.2
45 ↗(On Diff #78348)

How about:

The
.Fn eventfd
system call creates ...

59 ↗(On Diff #78348)

I would use the proper list instead of literal one:

.Bl -tag -width "EFD_SEMAPHORE" -compact
.It Dv EFD_CLOEXEC
set FD_CLOEXEC on the file descriptor
.It Dv EFD_NONBLOCK
do not block on read/write operations
.It Dv EFD_SEMAPHORE
use semaphore semantics
.El

75 ↗(On Diff #78348)

Use .Bl -bullet or .Bl -dash and .It without arguments instead of re-inventing one :)

176 ↗(On Diff #78348)

The
.Fa flags
argument ...

This revision is now accepted and ready to land.Nov 4 2020, 5:44 PM

So where we are WRT extending syscall to make it reusable for timerfd ?

In D26668#604707, @kib wrote:

So where we are WRT extending syscall to make it reusable for timerfd ?

As mentioned above, I'm not sure anymore whether it's worth writing "dangerous" user struct processing code just to save on syscall numbers, but I could try, sure.

Another thing. That syscall, would it have to be documented? Or would it be okay to pretend that we have the eventfd(2) "system call" when in reality it's a libc function calling into a private specialfd system call?

In D26668#605298, @greg_unrelenting.technology wrote:

Another thing. That syscall, would it have to be documented? Or would it be okay to pretend that we have the eventfd(2) "system call" when in reality it's a libc function calling into a private specialfd system call?

It is fine to have eventfd(2) man page as you already did. Eventually you might want to add specialfd(2) but I do not think this is a hard requirement.

For instance, for long time we only had man pages for pthread_mutex*(3) etc until I sit and wrote __umtx_op(2).

Added specialfd. Okay, that doesn't look too bad. I haven't tested intentionally causing problems from userspace, but other syscalls seem to be doing the same thing as here with user structs, I guess everything that could go wrong is handled inside copyin?

I'm not sure how well I'm following conventions regarding "private" syscalls, but rpctls_syscall seems to be in a similar position (normal auto-generated wrapper just put into FBSDprivate_1.0 in the symbol map; no underscore prefix in the name).

Also applied the latest manpage suggestions, except the first one (not sure how I feel about the phrase "the eventfd system call" now that eventfd is actually a libc wrapper around a different syscall, specialfd)

This revision now requires review to proceed.Nov 6 2020, 10:23 PM

Is specialfd supposed to be a public or private interface? If private I'd suggest __specialfd to make that clear.

sys/kern/sys_specialfd.c
46 ↗(On Diff #79286)
49 ↗(On Diff #79286)
sys/kern/syscalls.master
3248

While it has little effect in practice, please use size_t for things that should be initialized by sizeof().

sys/kern/sys_specialfd.c
48 ↗(On Diff #79286)

Remove all the blank lines inside the case.

53 ↗(On Diff #79286)

spaces around '|' as any other binary op.

sys/sys/_specialfd.h
41 ↗(On Diff #79286)

Again, no need for cdefs.h, it should come in my other means.

Name the header specialfd.h, without _.

44 ↗(On Diff #79286)

I think it is better to put the declaration for __sys_specialfd() to libc/include/libc_private.h instead.

Also do not export specialfd symbols in any way. It is enough that we generate syscall number symbol.

sys/sys/eventfd.h
49

This include is excessive. sys/types.h already includes cdefs.h (as well as almost all other important headers).

Now named with the __ convention and using libc_private (didn't know about it).

Interestingly, libc/include/libc_private.h did not contain any __sys___* before.

val_packett.cool retitled this revision from Add native system call for eventfd to Expose eventfd in the native API/ABI using a new __specialfd syscall.Nov 9 2020, 4:41 PM
kib edited reviewers, added: val_packett.cool; removed: kib.

Remove kern/sys_specialfd.c, it contained only short function, moved to sys_generic.c.
Do not depend on sys/fcntl.h for EFD_CLOEXEC and EFD_NONBLOCK, they are translated to O_ flags anyway. [Is this right, does Linux sources use e.g. EFD_NONBLOCK or O_NONBLOCK ?]
Move code to create file from eventfd_create() to sys___specialfd(), this way it is better structured for more special file types.
Properly split ABI, export kern_specialfd() helper used by both native and Linux syscalls.
Use falloc_noinstall() + finstall() in sys___specialfd().

lib/libc/sys/kqueue.2
48 ↗(On Diff #81090)

This is unrelated but the diff is generated from the branch that has split patch, so it is separate change in git.

In D26668#620034, @kib wrote:

Is this right, does Linux sources use e.g. EFD_NONBLOCK or O_NONBLOCK ?

Yeah, looks like no EFD_ in the kernel other than BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK) style assertions.

This revision is now accepted and ready to land.Dec 25 2020, 8:20 PM
In D26668#620675, @greg_unrelenting.technology wrote:
In D26668#620034, @kib wrote:

Is this right, does Linux sources use e.g. EFD_NONBLOCK or O_NONBLOCK ?

Yeah, looks like no EFD_ in the kernel other than BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK) style assertions.

No, I mean in consumers, userspace that calls eventf(). Do they use O_NONBLOCK or EFD_NONBLOCK ?

In D26668#620680, @kib wrote:
In D26668#620675, @greg_unrelenting.technology wrote:
In D26668#620034, @kib wrote:

Is this right, does Linux sources use e.g. EFD_NONBLOCK or O_NONBLOCK ?

Yeah, looks like no EFD_ in the kernel other than BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK) style assertions.

No, I mean in consumers, userspace that calls eventf(). Do they use O_NONBLOCK or EFD_NONBLOCK ?

Answering myself: https://codesearch.debian.net/search?q=eventfd.*O_NONBLOCK&literal=0
So unfortunately there are.

Restore match between EFD and O flags.

This revision now requires review to proceed.Dec 25 2020, 9:41 PM