Page MenuHomeFreeBSD

timerfd: Add native support for Linux's timerfd
AbandonedPublic

Authored by jfree on Feb 9 2023, 5:07 AM.
Referenced Files
Unknown Object (File)
Wed, Mar 20, 4:46 AM
Unknown Object (File)
Sun, Mar 17, 4:44 AM
Unknown Object (File)
Sun, Mar 10, 9:16 AM
Unknown Object (File)
Sun, Mar 10, 9:16 AM
Unknown Object (File)
Sun, Mar 10, 9:16 AM
Unknown Object (File)
Sun, Mar 10, 9:16 AM
Unknown Object (File)
Sun, Mar 10, 9:15 AM
Unknown Object (File)
Sun, Mar 10, 9:15 AM

Details

Summary

The timerfd interface is traditionally composed of non-standard Linux system calls that operate on interval timers.
Overhaul all compatibility code for Linux timerfd and integrate it into the FreeBSD base kernel.

Unlike per-process timers, timerfd timers are identified by special file descriptors and may be passed to other processes.

Test Plan

This timerfd implementation now successfully passes all of epoll-shim's timerfd tests.
https://github.com/jiixyj/epoll-shim/tree/master/test

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/kern/sys_timerfd.c
282–283

Also remove the other TFD_GETTIME32 case.

The _Static_assert isn't strictly necessary.

jfree added inline comments.
sys/sys/timerfd.h
58–59

I still get ENOTTY with _IOW. The timerfd functions aren't modifying the userspace addr directly. I believe that the kernel doesn't bother copying the modified data back to the userspace address unless read access is declared. There also appears to be some checking on the data to see if it was modified because I do not receive ENOTTY with _IOWR.

sys/sys/timerfd.h
58–59

Did you recompile userspace? ENOTTY means you didn't match a command. There is no check that anything changed. The code in sys_ioctl is literally:

	if (error == 0 && (com & IOC_OUT))
		error = copyout(data, uap->data, (u_int)size);
jfree added inline comments.
sys/sys/timerfd.h
58–59

I have to form the habit of sifting through the code when I encounter an issue like this. Yes, you are right. I had forgotten that sys header files were installed through world. I apologize for the silly distraction.

jfree added inline comments.
sys/kern/sys_timerfd.c
282–283

Could you explain why we CAN depend on the command differing for TFD_SETTIME? I understand that the 32/64 ioctl() variants have args of differing sizes, but wouldn't any machine just use the TFD_SETTIME case given that libc passes that in? I don't see the control flow saying "if 32-bit, then use the 32-bit variants".

  • Use SV_CURPROC_FLAG() to determine if TFD_GETTIME32 compatibility instructions should be executed.
sys/kern/sys_timerfd.c
287

How could it be that TFD_GETTIME32 is not defined?

289

I do not see much use for this assert, esp. because I believe this code is never compiled.

310

Are you sure that this is the right implementation for these FIO* iocontrols? FIOASYNC, if implemented, should change async mode, if FIONBIO reports that async is supported.

Also, what about FIONREAD?

502
503
sys/kern/sys_timerfd.c
56

BTW, could this define be static?

sys/kern/sys_timerfd.c
282–283

Could you explain why we CAN depend on the command differing for TFD_SETTIME? I understand that the 32/64 ioctl() variants have args of differing sizes, but wouldn't any machine just use the TFD_SETTIME case given that libc passes that in? I don't see the control flow saying "if 32-bit, then use the 32-bit variants".

TFD_SETTIME is defined with the size of struct timerfd_settime_args, but that varies with the compiled ABI. When a 32-bit program is compiled that size is for the version with 32-bit pointers which is equivalent to struct timerfd_settime_args32.

287

I think it should only be defined on amd64 because the size of the argument only differs between i386 and everything else. If we define it on non-amd64 we'll have a duplicate case value which is an error.

sys/sys/timerfd.h
58

BTW, something that will need testing is to verify that the timerfd_gettime and timerfd_settime pseudo syscalls work in capability mode.

jfree marked 10 inline comments as done.
  • M_TIMERFD malloc type is now static
  • style(9) fixes
  • TFD_GETTIME32 will only be defined on amd64 machines
sys/kern/sys_timerfd.c
282–283

TFD_SETTIME is defined with the size of struct timerfd_settime_args, but that varies with the compiled ABI. When a 32-bit program is compiled that size is for the version with 32-bit pointers which is equivalent to struct timerfd_settime_args32.

Perhaps you already answered this and I am misinterpreting, but I was trying to figure out why 32-bit binaries choose the TFD_GETTIME32 ioctl case when TFD_GETTIME (non-32) is passed in via libc?

310

Are you sure that this is the right implementation for these FIO* iocontrols? FIOASYNC, if implemented, should change async mode, if FIONBIO reports that async is supported.

Also, what about FIONREAD?

These are remnants from @dchagin's lkpi tfd implementation. I am not sure why they are here as the tfd Linux man page does not mention explicit support for them. I can't really see an easy way to implement these generic ioctls and don't think they're necessary. What're your thoughts?

sys/kern/sys_timerfd.c
91

And where is this unr initialized?

282–283

The ioctl command values are obtained by some combination of three attributes: read/write mask, the io command itself, and the length of the parameter' structure. The details are provided in sys/ioccom.h.

Now consider two compilation environments for userspace: ILP32 (32bit) and LP64 (64bit). The mask and the io command are same, but generally the parameter' structure sizes are different for 32 vs 64 bit. This means that the final value of the ioctl command passed to kernel should be typically different between 32 vs 64bit user space. Ultimately, this allows the kernel to distinguish the bitness of the ioctl caller without looking at ABI (AKA sysentvec).

But if it happens that the parameter's structure layout is same for 32 and 64 bit ABIs, the resulting ioctl commands would have exactly the same numeric values. Then kernel could not distinguish the mode of caller.

If parameters do not contain pointers, it is fine to leave only one handler for the command. But if there are pointers, you need to properly reconstruct usermode addresses. This is where ABI checks are needed.

310

I believe it is not hard to implement them. You just need to understand the semantic.

But having the dummy placeholders returning success is wrong any way.

jfree marked 2 inline comments as done.
  • Add generic ioctl() support for FIONBIO and FIONREAD
sys/kern/sys_timerfd.c
91

And where is this unr initialized?

new_unrhdr64() in timerfd_create_file().

282–283

Thank you for writing this. I understand now :)

jfree marked 2 inline comments as done.Mar 2 2023, 6:07 AM
sys/kern/sys_timerfd.c
91

This still does not make a sense. The unrhdr64 var needs to be initialized once, globally. Instead you re-initialize it on each open.

313

What if more than one thread do the ioctl in parallel? Isn't it racy?

316

The value should be returned only when read would not block, otherwise you should return 0.

jfree marked 3 inline comments as done.
  • Use SYSINIT() to initialize timerfd ino unit number
  • Use atomic bitwise operations for file flags in FIONBIO ioctl
  • Return 0 when FNONBLOCK is not set in FIONREAD ioctl

Fix minor logic error in FIONREAD ioctl case

sys/kern/sys_timerfd.c
31

Did that came back? I do not remember if I asked to remove the $FreeBSD$ from the new files.

122

If you put SYSINIT right after the timerfd_init(), you can avoid forward-declare the function.

324

This expression is never true.

325

You should only return non-zero *data when read would not block. It has nothing to do with non-blocked mode.

jfree marked 4 inline comments as done.
  • Remove __FBSDID tag
  • Move SYSINIT to avoid forward declaring timerfd_init()
  • Correctly check for read blocking in FIONREADioctl case
sys/kern/sys_timerfd.c
31

Did that came back? I do not remember if I asked to remove the $FreeBSD$ from the new files.

You asked on https://reviews.freebsd.org/D38460. I was unsure if it was still required for kernel files, so I just left it in here. I take it that __FSDID() is obsolete everywhere in the source tree?

324

This expression is never true.

Oops. Lapse in my logic late at night to blame. I meant != 0.

kib added inline comments.
sys/kern/sys_timerfd.c
31

Current policy is to not introduce these into new files, but also not to remove existing IDs. The idea was that a single sweep should get rid of them. Might be, some day.

BTW, something that will need testing is to verify that the timerfd_gettime and timerfd_settime pseudo syscalls work in capability mode.

Everything still appears to work correctly after doing a cap_enter() at the beginning of the test program in the Linux timerfd man page:

https://man7.org/linux/man-pages/man2/timerfd_create.2.html

What is the destiny of these patches? Why did you not committed them still?

In D38459#896885, @kib wrote:

What is the destiny of these patches? Why did you not committed them still?

While I was creating a manpage to accompany this patch, I noticed that the TFD_TIMER_CANCEL_ON_SET flag is ignored in this implementation. I'm in school right now, so I'm not working on this as fast as I'd like, but I have a new revision on the way that adds support for this flag, fixes a few other bugs, and cleans up the code.

I also wanted to run this code against some tests before submitting to src. I have accumulated a number of timerfd tests so this part shouldn't take long.

I definitely have not abandoned this, though.

Hey, couple random notes:

  • re: "Developers that wish to support FreeBSD should avoid using timerfd" in the quarterly… :/
    • file descriptor handle based APIs are actually kinda better because composability / fd-passing / capability mode friendliness
      • FreeBSD invented procdesc(4) soo it's strange that we're not yet striving to turn everything into a file descriptor and Linux has us beat on this…
    • also there's no explicit clock selection in EVFILT_TIMER so when we finally add a suspend-aware monotonic clock it would only be possible to explicitly choose suspend-awareness-or-not with timerfd :)
    • and also…
  • EVFILT_TIMER is currently subject to a system-wide and small-by-default kern.kq_calloutmax (kq_ncallouts) limit, which feels very unnerving: imagine an important daemon getting starved of timers by some random user app!! This code as-is is not, and I wouldn't like it to be, but then it would be kinda strange that EVFILT_TIMER would still be.
    • should we convert that to a per-{user,process,…} that both facilities would use? An rlimit sounds appropriate I think?
    • but do we need that limit at all? Maybe just abolish it from EVFILT_TIMER?

Quick update: I'm nearly finished with my school work for the year, so I've had more time to work on this. I've nearly re-engineered the entire patch and I'm passing ~95% of the epoll-shim timerfd testing suite. I should have a new patch out in the next week (hopefully).

Tests: (specifically timerfd, timerfd-root, and timerfd-mock)
https://github.com/jiixyj/epoll-shim/tree/master/test

Hey, couple random notes:

  • re: "Developers that wish to support FreeBSD should avoid using timerfd" in the quarterly… :/
    • file descriptor handle based APIs are actually kinda better because composability / fd-passing / capability mode friendliness
      • FreeBSD invented procdesc(4) soo it's strange that we're not yet striving to turn everything into a file descriptor and Linux has us beat on this…
    • also there's no explicit clock selection in EVFILT_TIMER so when we finally add a suspend-aware monotonic clock it would only be possible to explicitly choose suspend-awareness-or-not with timerfd :)

Good to know. I will adjust my language in the manpage so I'm not discouraging users against using timerfd.

    • and also…
  • EVFILT_TIMER is currently subject to a system-wide and small-by-default kern.kq_calloutmax (kq_ncallouts) limit, which feels very unnerving: imagine an important daemon getting starved of timers by some random user app!! This code as-is is not, and I wouldn't like it to be, but then it would be kinda strange that EVFILT_TIMER would still be.
    • should we convert that to a per-{user,process,…} that both facilities would use? An rlimit sounds appropriate I think?
    • but do we need that limit at all? Maybe just abolish it from EVFILT_TIMER?

A per-proc limit sounds appropriate through rlimit. I'm not sure about abolishing the limit altogether, though. I am guessing it was implemented for a reason.

    • and also…
  • EVFILT_TIMER is currently subject to a system-wide and small-by-default kern.kq_calloutmax (kq_ncallouts) limit, which feels very unnerving: imagine an important daemon getting starved of timers by some random user app!! This code as-is is not, and I wouldn't like it to be, but then it would be kinda strange that EVFILT_TIMER would still be.
    • should we convert that to a per-{user,process,…} that both facilities would use? An rlimit sounds appropriate I think?
    • but do we need that limit at all? Maybe just abolish it from EVFILT_TIMER?

A per-proc limit sounds appropriate through rlimit. I'm not sure about abolishing the limit altogether, though. I am guessing it was implemented for a reason.

The limit comes from commit 0217f5c71e87311c392911359ab8fc64260fa170, which states that the goal is to "globally limit the amount of memory that can be allocated by callouts". Implicitly we might also want to limit the number of pending callouts that can be placed in the callout wheel: if a large number of callouts fall into a single bucket, a softclock thread could end up spending lots of CPU time servicing user-supplied callouts, starving more important system callouts. The first is a pretty weak argument when you consider other vectors: a callout requires less memory than a vm_map_entry, for example, and there's no limit on the number of vm_map_entries that a user process can allocate (for example by mapping a giant region of the address space and using mprotect() to alternate protections on each virtual page).

The latter is a stronger argument, but the same problem can arise if you have a bazillion user threads all calling sleep(), and we limit the number of threads in the system.

I think I agree that a per-process limit (or perhaps a per-kqueue limit?) makes more sense. Or the global limit should be equal to the kern.maxthread limit instead of the rather anemic default of 4096.

I have contrary opinion about ioctl vs syscall.

I am working on code that automatically generate the code to cope with system calls for qemu's user-mode bsd-user. This works for the vast majority of system calls, except for sysctl, ioctl, and the weird things like semctl. Having this as a ioctl means that to implement them I have to write code by hand rather than let the generator generate the new code when syscalls.master is updated. The system call numbers aren't all that rare, and it seems a less bad thing to assign a few of them here than to force them through the ioctl interface.

jfree edited the summary of this revision. (Show Details)
jfree edited the test plan for this revision. (Show Details)
jfree added a reviewer: imp.
  • Support TFD_TIMER_CANCEL_ON_SET settime flag using timerfd_jumped().
  • Convert ioctl() interface to native syscalls.
  • Support TFD_TIMER_CANCEL_ON_SET settime flag using timerfd_jumped().
  • Convert ioctl() interface to native syscalls.

Why? I do not like it very much. It is not as if all (or most) other individual kernel ops where performed by dedicated syscall entries, so I do not see a good reason to do it there. Ioctl perfectly fits the intent of providing file-specific requests.

BTW, I do not quite understand how would making this syscalls would help with automatic generation of some kind of thunks, simply because thunks still need to know how to interpret the pointers.

Why? I do not like it very much. It is not as if all (or most) other individual kernel ops where performed by dedicated syscall entries, so I do not see a good reason to do it there. Ioctl perfectly fits the intent of providing file-specific requests.

The system call interface is standard and more intuitive for those that want to extend timerfd later.
The ioctl() implementation is confusing and messy for compat.

These special file descriptor functions are system calls in Linux and are widely used in userspace, so why not treat them with equivalent priority on FreeBSD?
A wrapper just makes things more rigid.

It's not like we're short on syscall numbers, either, I believe.

Why? I do not like it very much. It is not as if all (or most) other individual kernel ops where performed by dedicated syscall entries, so I do not see a good reason to do it there. Ioctl perfectly fits the intent of providing file-specific requests.

The system call interface is standard and more intuitive for those that want to extend timerfd later.
The ioctl() implementation is confusing and messy for compat.

Er? Can you explain? What is confusing for ioctl, and what is intuitive for the syscall interface?

These special file descriptor functions are system calls in Linux and are widely used in userspace, so why not treat them with equivalent priority on FreeBSD?

Same, I do not understand the point of 'priority'.

A wrapper just makes things more rigid.

It's not like we're short on syscall numbers, either, I believe.

Question is not in the shortage of the syscall numbers, but significant mismatch between existing tradition of providing syscalls for generic interfaces, vs having specialized operations treated by corresponding facilities. In the case of timerfd, there are specialized ops relevant only to the timerfd. Putting them in the common namespace with e.g. open(2) or read(2) is weird.

It also contradicts to the base approach I put with specialfd(2).

Er? Can you explain? What is confusing for ioctl, and what is intuitive for the syscall interface?

  • The sys_ and kern_ function prefixes are standard so their intention can be understood at a glance.
  • The syscall interface automatically handles argument translation for compat32, so we don't need to manually declare timerfd_settime_args and timerfd_settime_args32 in timerfd.h.
  • The tfd ioctl() fop is filled with compat32 preprocessor conditionals that confusingly check if TFD_GETTIME32 is defined, which only happens on amd64 because i386 is the only arch with 32-bit time_t. Then SV_CURPROC_FLAG(SV_ILP32) must be used to check if we're in a 32-bit executable. This distracts from what timerfd is actually doing and the compat cruft can be avoided by using system calls.

System calls are already well documented and are a direct gateway to the kernel. I don't see the point of jumping around an already-established interface to create exclusive wrapper, that only Linux specialfds use, in the name of being better.

I think it is negligent to ignore Linux conventions for the sake of being different. This mentality is what drives the elitist attitude that most of the Linux community sees in BSD. I'm not saying we need to become Linux, but I fear that the further we drift from their practices, the harder compatibility will get. Adding a few system calls is a fair tradeoff for adhering to Linux's implementation. If Linux's timerfd drastically changes in the future, at least we can rest knowing that we will need to jump through the same hurdles that they did to update it.

Question is not in the shortage of the syscall numbers, but significant mismatch between existing tradition of providing syscalls for generic interfaces, vs having specialized operations treated by corresponding facilities. In the case of timerfd, there are specialized ops relevant only to the timerfd. Putting them in the common namespace with e.g. open(2) or read(2) is weird.

What about timer_gettime() and timer_settime()? These don't look like they're exposed through a per-process timer wrapper.

jfree added a subscriber: mckusick.
  • tfd_jumped must be set to TFD_CANCELED, not TFD_JUMPED, for timerfd_settime to return ECANCELED.
  • style(9) line-break fixes courtesy of @mckusick.

I have code that automatically translates new system calls between host/target for bsd-user (though I'm not quite ready to commit it).
ioctl is possible, but has a lot more exceptional cases so is harder to do automatically. There's no automated annotation like there is for system calls.
The 'generic interfaces' require that I go and create annotations for each of the 'op codes' in that, because that doesn't exist in our current annotation.

So having a few extra system calls means they will be annotated and my job of getting it to work in bsd-user and others wanting to do automatic translation is easier.

In D38459#918467, @imp wrote:

I have code that automatically translates new system calls between host/target for bsd-user (though I'm not quite ready to commit it).
ioctl is possible, but has a lot more exceptional cases so is harder to do automatically. There's no automated annotation like there is for system calls.
The 'generic interfaces' require that I go and create annotations for each of the 'op codes' in that, because that doesn't exist in our current annotation.

So having a few extra system calls means they will be annotated and my job of getting it to work in bsd-user and others wanting to do automatic translation is easier.

For system calls you do need annotations, you cannot properly understand the purpose of any pointer passed to syscall otherwise. Is it in, out, in/out, or even just an abstract address like mmap/munmap/madvise arguments?

For ioctls, the meaning of the command/arg is quite formalized, at least in BSDs. You are well aware that we encode both sizes and directions for copyin and copyout. After I took some time thinking how to implement what you described, I was surprised by the statement that syscalls are easier than ioctls.

That said, I am quite dislike extending the syscall semantical coverage by adding one-off operations. In the case of special fds, having all ops grouped together (under ioctl umbrella) localizes the interfaces and make them more comprehensive. This is, of course, my opinion, but all my experience supports the claim.

For system calls you do need annotations, you cannot properly understand the purpose of any pointer passed to syscall otherwise. Is it in, out, in/out, or even just an abstract address like mmap/munmap/madvise arguments?

For ioctls, the meaning of the command/arg is quite formalized, at least in BSDs. You are well aware that we encode both sizes and directions for copyin and copyout. After I took some time thinking how to implement what you described, I was surprised by the statement that syscalls are easier than ioctls.

That said, I am quite dislike extending the syscall semantical coverage by adding one-off operations. In the case of special fds, having all ops grouped together (under ioctl umbrella) localizes the interfaces and make them more comprehensive. This is, of course, my opinion, but all my experience supports the claim.

Historically I shared your view of system calls being reserved for general interfaces. But that formality was lost long ago. I have always had a dim view of ioctl and have been unhappy that they have been extended way past operating special features in devices. So these days I would rather see new interfaces added via system calls rather than being shoehorned into ioctls.

System calls are all generated from sys/kern/syscalls.master which provides full type information as well as In/Out/InOut details for pointer arguments. System calls also make it easier to have multiple arguments versus ioctl where the programmer has to define and build a structure if multiple arguments need to be handled. And the elements of that structure likely do not provide In/Out/InOut details. Thus imp's automation is easier to do on system calls versus ioctls where one has to dig down into them and figure out what they are doing which is especially laborious if they define a structure to pass in/out parameters and data.

In D38459#918469, @kib wrote:
In D38459#918467, @imp wrote:

I have code that automatically translates new system calls between host/target for bsd-user (though I'm not quite ready to commit it).
ioctl is possible, but has a lot more exceptional cases so is harder to do automatically. There's no automated annotation like there is for system calls.
The 'generic interfaces' require that I go and create annotations for each of the 'op codes' in that, because that doesn't exist in our current annotation.

So having a few extra system calls means they will be annotated and my job of getting it to work in bsd-user and others wanting to do automatic translation is easier.

For system calls you do need annotations, you cannot properly understand the purpose of any pointer passed to syscall otherwise. Is it in, out, in/out, or even just an abstract address like mmap/munmap/madvise arguments?

For the most part, our current annotations suffice. mmap is a bit special with some flags, but the others are not. And yes, you can understand most of the pointers for the system calls. They point to structures, which are well defined, arrays of structures, strings, paths, etc. We have almost everything annotated to do what we want, including generating better truss output automatically (though the flags on many of the sysctls need more annotation). A lost of this is very mechanical, and I'm trying to generate it all for many different uses.

For ioctls, the meaning of the command/arg is quite formalized, at least in BSDs. You are well aware that we encode both sizes and directions for copyin and copyout. After I took some time thinking how to implement what you described, I was surprised by the statement that syscalls are easier than ioctls.

For ioctls require that I parse more C code to dig that data out. And while many are relatively well behaved, we're growing more and more that just copy in a pointer and then the ioctl does 'stuff' from there. And there's no central repo for ioctls, so it's hard to get a complete list (though some of the stuff truss does compensates a bit for that).

That said, I am quite dislike extending the syscall semantical coverage by adding one-off operations. In the case of special fds, having all ops grouped together (under ioctl umbrella) localizes the interfaces and make them more comprehensive. This is, of course, my opinion, but all my experience supports the claim.

I can only comment on the one aspect of this that's easier for me.

Add brief comments explaining each of the tfd_jumped macros.

I would like to get this patch committed before the 14.0 freeze in August. I am personally in favor of making these syscalls, but I'd like some notice if we're not going this route.

I have this staged and plan to commit once I confirm universe builds.

I'm struggling with how to slice and dice it for the commit, but I'll figure that out... There's a number of aux things that may be their own commit(s).

Are there any tests?

I note there's been some decent around the best approach for this commit. I talked to @markj and @jhb and we agree that there's a number of up sides to doing it as a system call (automatic generation of compat code is easier, for example), and it avoid the issues brooks raised with the ioctl interface. I know kib disagrees, but it seems like we won't come to a unanimous decision here.

In D38459#947334, @imp wrote:

Are there any tests?

I used the epoll-shim timerfd tests:
https://github.com/jiixyj/epoll-shim/tree/master/test

They're made using ATF, so it wouldn't be difficult to bring them in-tree.

mjg added inline comments.
sys/kern/sys_timerfd.c
426

malloc M_WAITOK does not fail, so there is no point null checking....

438

this *does* fail and leaks tfd.

the only thing which can fail in this func is falloc. just do it first.

apart from missing timerfd_head locking mentioned by kib, I have not looked at other stuff

jfree added inline comments.
sys/kern/sys_timerfd.c
438

the only thing which can fail in this func is falloc. just do it first.

Thanks for the corrections. I'll add these changes in a new patch.

jfree marked an inline comment as done.

This patch has been applied to src under commit af93fea710385b2b11f0cabd377e7ed6f3d97c34.

There doesn't seem to be a way to associate this review directly with that commit, so I'm just going to abandon it.

This patch has been applied to src under commit af93fea710385b2b11f0cabd377e7ed6f3d97c34.

There doesn't seem to be a way to associate this review directly with that commit, so I'm just going to abandon it.

For future reference, under "Edit related objects" in the menu at top-right there is a "Edit commits" option you can use to associate commits with a review (I've gone ahead and done it for this review). Once you've done that you can then close a review without having to abandon it.