Page MenuHomeFreeBSD

Manage thread signal mask using a shared word, instead of syscall.
ClosedPublic

Authored by kib on Oct 24 2017, 12:29 PM.

Details

Summary

A new syscall is added which registers a uint32_t variable as containing the count of blocks for signal delivery. Its content is read by kernel on each syscall entry and on AST processing, non-zero count of blocks is interpreted same as the signal mask blocking all signals.

Main purpose of the fast sigblock feature is to allow rtld to not issue two sigprocmask(2) syscalls for each symbol binding operation in single-threaded processes. Rtld needs to block signals as part of locking to ensure signal safety of the bind process, because signal handlers might need to lazily resolve symbol references. For multi-threaded processes, libthr intercepts all signals handlers which allows to delay signal delivery by manually raising the delivered signal upon unlock. This is not ideal, and I did not wanted to make libc intercept signal handlers too.

There is some rudimentary use of the fast sigblock in libthr, but it is not related to the critical sections, which are still use signal raising on exit. This is because critical sections have to handle more issues than only signals.

Benchmarks do not show a slow-down of the syscalls in micro-measurements, and macro benchmarks like buildworld do not demonstrate a difference. Part of the reason is that buildworld time is dominated by compiler, and clang already links to libthr. On the other hand, small utilities typically used by shell scripts have the total number of syscalls cut by half.

The biggest downside of the feature that I see is that memory corruption that affects the registered fast sigblock location, would cause quite strange application misbehavior. For instance, the process would be immune to ^C (but killable by SIGKILL).

Benchmark

stock
root@r-freeb43:/data/work # ( truss ./hello > /dev/null ) |& wc -l
      63
root@r-freeb43:/data/work # ./syscall_timing getppid
Clock resolution: 0.000000001
test    loop    time    iterations      periteration
getppid 0       1.016000291     2897051 0.000000350
getppid 1       1.009834382     2877363 0.000000350
getppid 2       1.000017563     2848788 0.000000351
getppid 3       1.027104944     2927389 0.000000350
getppid 4       1.017986489     2901363 0.000000350
getppid 5       1.010239487     2879487 0.000000350
getppid 6       1.004737854     2861549 0.000000351
getppid 7       1.016996867     2902481 0.000000350
getppid 8       1.000985560     2854247 0.000000350
getppid 9       1.023981476     2921559 0.000000350
buildworld -s -j 32
     2950.74 real     70371.54 user      2495.92 sys
     3033.48 real     70085.36 user      2482.85 sys
     2985.57 real     70240.54 user      2495.64 sys
     2927.52 real     70204.11 user      2486.19 sys
     3007.15 real     70140.09 user      2489.78 sys

fast sigblock
root@r-freeb43:/tmp # ( truss ./hello > /dev/null ) | & wc -l
      37
root@r-freeb43:/tmp # ./syscall_timing getppid
Clock resolution: 0.000000001
test    loop    time    iterations      periteration
getppid 0       1.000028797     2888175 0.000000346
getppid 1       1.050218490     3031970 0.000000346
getppid 2       1.027986764     2967578 0.000000346
getppid 3       1.046828208     3021258 0.000000346
getppid 4       1.000027060     2886588 0.000000346
getppid 5       1.027107249     2964878 0.000000346
getppid 6       1.021487553     2950501 0.000000346
getppid 7       1.001574229     2890893 0.000000346
getppid 8       1.009152278     2914871 0.000000346
getppid 9       1.042734749     3012841 0.000000346
buildworld -s -j 32
     2998.44 real     70147.80 user      2478.18 sys
     2935.23 real     70159.03 user      2490.75 sys
     2961.46 real     70072.53 user      2501.04 sys
     2939.97 real     70115.35 user      2504.80 sys
     2978.95 real     70006.44 user      2509.96 sys

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 28128

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kib added inline comments.Dec 17 2019, 4:06 PM
lib/libc/sys/fast_sigblock.2
67–74

If used properly (in other words, if there is no memory corruption), then it is. But if application wrote a garbage into the fast_sigblock word, it breaks of course.

I understand your point that this significantly increases the possible non-compliance for misbehaving programs, and this was one of the reasons why I did not rushed with the patch, which was written several years ago. But in the end, I think that it is not too bad, because the kernel integrity is maintained, and stuff like SIGKILL/SIGSTOP is not affected.

sys/kern/kern_sig.c
255–262

The list is obviously populated with signals that can be sent synchronously as result of fault. Ideally, we would only block async 'fake' signals sent with kill(2) of the same numbers.

Additional aspect is that we reset the signal disposition to default when delivering signal from fault, if the signal is blocked. So if the signals above are blocked, then the process is guaranteed to be killed.

I do not think this should be a fix in this review.

kib updated this revision to Diff 65752.Dec 17 2019, 4:07 PM

Use bool symbol for false.
Add assert before dec.

brooks added inline comments.Dec 17 2019, 10:00 PM
sys/kern/kern_proc.c
3031

I don't see how using kvaddr_t prevents that usage. I'd think that at most we'd talking about one cast in some OS dependent code.

This code also don't match the usage in procstat where a reference to a uintmax_t is used. It might work by accident on i386, but it's going to be nonsense on 32-bit BE systems.

kib added inline comments.Dec 17 2019, 10:20 PM
sys/kern/kern_proc.c
3031

kvaddr_t is 64bit on LP64, which means that compat32 code call would end up with kernel doing SYSCTL_OUT() 64bit value over 32bit length out argument. This should result in ENOMEM.

I will fix procstat, it arguably requires more substantial changes, it is not very useful to repeat the address for each signal number. I will add a new subcommand to procstat.

kib added inline comments.Dec 18 2019, 8:09 PM
sys/kern/kern_sig.c
255–262

I thought that the correct fix for this is to make it possible to only block async signals, while keeping synchronous delivery enabled, i.e. from faults. I even started implementing this with a flag adjustment to the sigprocmask(2) to enable/disable this quirk.

But in the process I realized that it is exactly the reverse behavior that is needed, faults should terminate the process if occured in our locked section, both because the handler installed is from application which cannot know internals of rtld, and due to recursion that you noted.

So the existing default behavior for blocked faulting signals is desired, and to get it we should stop removing the signals from the mask. The corresponding rtld change will be committed separately.

kib updated this revision to Diff 65790.Dec 18 2019, 8:18 PM

Stop blocking faulting signals.
trapsignal(): account for td_sigblock.

kib updated this revision to Diff 65834.Dec 19 2019, 11:24 PM

Add separate command fsigblock to procstat.

kib updated this revision to Diff 65835.Dec 20 2019, 12:47 AM

Fix auxv return value.

kib updated this revision to Diff 65838.Dec 20 2019, 1:26 AM

Print the name for AT_BSDFLAGS auxv.

brooks added inline comments.Dec 20 2019, 7:59 PM
usr.bin/procstat/procstat_sigs.c
215

I don't understand how this can work. uintmax_t is always __uint64_t, but for 32-bit ABIs the sysctl outputs 32-bits.

kib added inline comments.Dec 20 2019, 9:31 PM
usr.bin/procstat/procstat_sigs.c
215

Right, this should be uintptr_t. Thanks for noting.

kib updated this revision to Diff 65871.Dec 20 2019, 9:33 PM

Correct type for sysctl output.

kib edited the summary of this revision. (Show Details)Jan 6 2020, 4:42 PM
kib updated this revision to Diff 66538.Jan 9 2020, 8:11 AM

Rebase.
Add knob to disable fast_sigblock reporting, effectively disabling it.

kib edited the summary of this revision. (Show Details)Jan 13 2020, 4:07 PM

some minor comments, I would like to review the man page in detail still but please go ahead with commit if I don't get to it in time (and I'll post-commit review/edit as necessary)

lib/libc/gen/auxv.c
128–131

other than AT_OSRELDATE/AT_NCPUS these seem to be in alpha order before this change

lib/libc/sys/fast_sigblock.2
3

we can drop the All rights reserved.

libexec/rtld-elf/rtld_lock.c
304–314

shall we commit the whitespace fix first?

sys/kern/capabilities.conf
165–168

(despite commenting on alpha order elsewhere) here I think this should group with sigprocmask

sys/kern/imgact_elf.c
186–190

is this just for testing? i.e., would we remove this in the future?

kib marked 4 inline comments as done.Jan 13 2020, 5:04 PM
kib added inline comments.
libexec/rtld-elf/rtld_lock.c
304–314

The whole diff will be split for commit. As usual, there will be the kernel part, then rtld/libc/libthr parts, then utils. Trivial style changes should be also committed separately where possible.

sys/kern/imgact_elf.c
186–190

I revert the defaults in the update, but I want the control to stay. Even of only for emergency to allow people to boot if something goes wrong.

kib updated this revision to Diff 66690.Jan 13 2020, 5:04 PM
kib marked an inline comment as done.

Handle notes by Ed.

cem added a subscriber: cem.Jan 13 2020, 7:54 PM

It seems like this is really close to something that would be a generally useful feature (an optimization on sigprocmask(2)), and not only a hack for RTLD.

It seems like recursion needed by RTLD could be handled external to the kernel ABI (in fact, it looks like it already is handled in the existing sigprocmask code paths).

Then the userspace pointer could point to a bitset of signals to block, instead of a recursion counter. A 32-bit set is potentially sufficient, and preserves the nice property of being atomically writable on 32-bit arch. Signal 0 does not exist, so that bit could be repurposed as the pending flag. There are few signals defined in signal.2 beyond 31: SIGTHR, and SIGLIBRT. They could either be out of scope of this optimization, or remapped onto bits associated with signals that cannot be blocked: 9 (SIGKILL), 17 (SIGSTOP).

I would suggest the fast_sigblock syscall take a len or version parameter in addition to ptr (for FAST_SIGBLOCK_SETPTR). It can be EINVAL for len to be anything but 4 (or version 1) at this time, but it gives the syscall some flexibility towards wider signal masks or other future enhancements, and imposes no runtime cost. (In fact, I think doing this makes sense even if we just want to commit the RTLD hack today. It gives us a way to upgrade the syscall to the more general design in the future without adding more syscalls and auxvals.)

This patch is new to me and I am probably missing some subtleties you have already considered, so apologies in advance if this is a bad suggestion.

sys/kern/kern_sig.c
242

__read_mostly?

rpokala added inline comments.
lib/libc/sys/fast_sigblock.2
66

Did you mean count of blockers, or count of blocked threads, or something else? count of blocks doesn't scan as correct English to me.

sys/kern/imgact_elf.c
186–190

If this is a tunable, then report fast sigblock support should probably be something like enable fast sigblock support or fast sigblock support enabled or fast sigblock enabled; something other than report, in any case, since this is a control.

kib added a comment.Jan 14 2020, 2:57 PM
In D12773#507608, @cem wrote:

It seems like this is really close to something that would be a generally useful feature (an optimization on sigprocmask(2)), and not only a hack for RTLD.
It seems like recursion needed by RTLD could be handled external to the kernel ABI (in fact, it looks like it already is handled in the existing sigprocmask code paths).

If you propose to replace the counter with the bool (even per-signal) it would make the rtld locks use two very different code paths for sigprocmask(2) vs. fast_sigblock, I do not want it. Also I might need the recursion if libthr ever starts using this feature for critical sections, in addition to the signal masking for cancellation.

Then the userspace pointer could point to a bitset of signals to block, instead of a recursion counter. A 32-bit set is potentially sufficient, and preserves the nice property of being atomically writable on 32-bit arch. Signal 0 does not exist, so that bit could be repurposed as the pending flag. There are few signals defined in signal.2 beyond 31: SIGTHR, and SIGLIBRT. They could either be out of scope of this optimization, or remapped onto bits associated with signals that cannot be blocked: 9 (SIGKILL), 17 (SIGSTOP).

I tried to make a sane draft how would it interact with sigprocmask(2), esp. with the return of the existing mask, and cannot.
I propose to postpone this for later. If you or somebody can make a solid proposal, I will implement it.

I would suggest the fast_sigblock syscall take a len or version parameter in addition to ptr (for FAST_SIGBLOCK_SETPTR). It can be EINVAL for len to be anything but 4 (or version 1) at this time, but it gives the syscall some flexibility towards wider signal masks or other future enhancements, and imposes no runtime cost. (In fact, I think doing this makes sense even if we just want to commit the RTLD hack today. It gives us a way to upgrade the syscall to the more general design in the future without adding more syscalls and auxvals.)

fast_sigblock(2) already takes 'commands', so there is no problem with adding new set of operations and indicate it with the flags in AT_BSDFLAGS.

This patch is new to me and I am probably missing some subtleties you have already considered, so apologies in advance if this is a bad suggestion.

lib/libc/sys/fast_sigblock.2
66

This thing only works for current thread. Block is one increment of the variable.

I have to make this possible because both rtld and libthr might need to recurse into critical sections.

sys/kern/imgact_elf.c
186–190

It is not 'enable_fast_sigblock' or anything like that. The fast sigblock support is always enabled. The tunable governs the report of the presence of the feature in auxv.

jilles added inline comments.Jan 14 2020, 11:32 PM
lib/libc/sys/fast_sigblock.2
67–74

Hmm, my point was that in this scenario:

  • The process does not ignore SIGUSR1.
  • Thread 1 has fast signal blocking in effect.
  • Thread 2 does not have fast signal blocking in effect, and has SIGUSR1 unmasked.
  • Something sends SIGUSR1 to the process.
  • FOREACH_THREAD_IN_PROC visits thread 1 before thread 2.

the signal will not be handled until thread 1 disables fast signal blocking, or thread 2 does something that causes the process's signal queue to be checked. The same thing happens with libthr's deferred signals (but in that case only for signals that have a handler, not ones that use the default action).

When thread 1 masks SIGUSR1 normally, sigtd() will always pick thread 2, and the signal is handled without delay.

As a result, fast signal blocking is only suitable for short critical sections, not as a generic replacement for sigprocmask()/pthread_sigmask().

To @cem, making this work as a generic replacement for sigprocmask()/pthread_sigmask() would require a rather different design since accessing a different process's pageable memory while sending a signal (in sigtd()) seems a bad idea.

lib/libthr/thread/thr_sig.c
117

Right, libthr needs to wrap signal handlers for disabling cancellation and for something related to sigsuspend(2), but the postpone part can probably be replaced by fast sigblock (not in this review though).

sys/kern/imgact_elf.c
186–190

Perhaps change the description to "enable fast sigblock in new processes" since that is what it does in practice. Of course, code can be written that tries to use fast sigblock even if auxv does not indicate it is supported, but I think that can be ignored for such a short description.

cem added a comment.EditedJan 15 2020, 4:00 AM

@jilles,

making this work as a generic replacement for sigprocmask()/pthread_sigmask() would require a rather different design since accessing a different process's pageable memory while sending a signal (in sigtd()) seems a bad idea.

I don't think that's any different between the current design and the proposed design. I think your concern seems to boil down to: byzantine or misbehaving programs might be able to delay non-kill signal delivery to themselves. I'm having trouble seriously considering the idea that this scenario might arise by accident alone; most programs simply do not manipulate signal masks, and those that do usually do not manipulate them frequently. rtld is a special case. And even compute-heavy applications enter the kernel sometimes (or are preempted).

Yes, sigtd can't know anything other than some (potentially stale) cached value of remote process' threads' current block mask. That might be good enough. E.g., something like

@@ -2056,10 +2056,14 @@ sigtd(struct proc *p, int sig, int prop)
                return (curthread);
        signal_td = NULL;
        FOREACH_THREAD_IN_PROC(p, td) {
-               if (!SIGISMEMBER(td->td_sigmask, sig)) {
-                       signal_td = td;
-                       break;
-               }
+               if (SIGISMEMBER(td->td_sigmask, sig))
+                       continue;
+               if ((td->td_pflags & TDP_FAST_SIGBLOCK) &&   // Or perhaps td_sigblock_val is simply incorporated into td_sigmask
+                   (td->td_sigblock_val & (1u << sig)))
+                       continue;
+
+               signal_td = td;
+               break;
        }
        if (signal_td == NULL)
                signal_td = FIRST_THREAD_IN_PROC(p);

does not substantially change signal delivery latency to arbitrary threads in a process, in the scenario you've described. (The same racy solution would fix the specific scenario in the current design, too.)

Sure, maybe the cached mask is stale (only possible if the thread is currently running in userspace) and we pick a thread which results in your scenario. But that is possible under kib's design anyway. Consider your scenario, except magically sigtd chooses the "correct" Thread 2. Thread 2 is currently running in userspace and next enters the kernel to enable fast blocking (and then does not reenter the kernel). Thread 1 unblocks, but does not enter the kernel. Now the signal delivery is delayed indefinitely (or until thread 1 is de-scheduled, at least).

In fact, a very similar scenario is possible _today_ with sigprocmask:

  • The process does not ignore USR1
  • Thread 1 has sigprocmask masked USR1
  • Thread 2 has not masked USR1
  • Something sends USR1
  • sigtd picks Thread 2

Thread 2 next enters the kernel to sigprocmask block USR1. Now the signal will not be handled until thread 1 unblocks, or thread 2 unblocks. Sure, if thread 1 unblocks with sigprocmask, it is in the kernel and can reschedule_signals() immediately. But the requirement is still that some thread enters the kernel before the signal may be handled.

Fast-blocking processes can and should do something similar to reschedule_signals() on kernel entry/exit if process-scope signals are pending and blocked by the current thread. Threads in the same process can obviously access fastblock masks in the same process; the only additional delay is between any arbitrary thread in the process unblocking the signal, and any arbitrary thread in the process entering the kernel.

In D12773#508215, @cem wrote:

In fact, a very similar scenario is possible _today_ with sigprocmask:

  • The process does not ignore USR1
  • Thread 1 has sigprocmask masked USR1
  • Thread 2 has not masked USR1
  • Something sends USR1
  • sigtd picks Thread 2

Thread 2 next enters the kernel to sigprocmask block USR1. Now the signal will not be handled until thread 1 unblocks, or thread 2 unblocks. Sure, if thread 1 unblocks with sigprocmask, it is in the kernel and can reschedule_signals() immediately. But the requirement is still that some thread enters the kernel before the signal may be handled.

This is clearly expected behaviour and not a bug: if all threads in a process have a signal blocked, the signal will not be handled until at least one thread unblocks (which could take an arbitrary length of time).

If there are other threads that do not have USR1 masked, the sigprocmask call by thread 2 to block USR1 will wake up at least one (via reschedule_signals()).

In my scenarios, at least one thread does not have the signal blocked, but its handling can still be delayed indefinitely.

Fast-blocking processes can and should do something similar to reschedule_signals() on kernel entry/exit if process-scope signals are pending and blocked by the current thread. Threads in the same process can obviously access fastblock masks in the same process; the only additional delay is between any arbitrary thread in the process unblocking the signal, and any arbitrary thread in the process entering the kernel.

I guess this could work, but it adds a dependency on how often the kernel is entered that did not use to be there.

rpokala added inline comments.Jan 16 2020, 3:21 AM
lib/libc/sys/fast_sigblock.2
66

This thing only works for current thread. Block is one increment of the variable.

But what does that mean?

re-reads

Okay, it's incremented every time something says "block signals until I'm done", right? And its possible for multiple things to say that recursively, leading to a count greater than one?

If that's correct, then "blocks" means "requests to block signals"; in that case, it could be worded more clearly like this::

If the variable contains a non-zero count of requests to block signals (see below)

cem added inline comments.Jan 16 2020, 3:27 AM
lib/libc/sys/fast_sigblock.2
66

@rpokala It's just a recursion count, yeah. I might phrase the sentence as: "If the variable indicates that blocking is requested, then the kernel ..." (and just leave out the recursion detail, which isn't important to the description of the on/off behavior).

The recursion is already described later (sentence ending in l. 80).

kib added a comment.Jan 16 2020, 11:50 AM
In D12773#508215, @cem wrote:

Fast-blocking processes can and should do something similar to reschedule_signals() on kernel entry/exit if process-scope signals are pending and blocked by the current thread. Threads in the same process can obviously access fastblock masks in the same process; the only additional delay is between any arbitrary thread in the process unblocking the signal, and any arbitrary thread in the process entering the kernel.
I guess this could work, but it adds a dependency on how often the kernel is entered that did not use to be there.

I thought about the problem pointed out by Jilles for some time. I think that the best we can do is to check for fast_sigblock in AST and reschedule_signals from there if fast_sigblock masks signals. This is the place where we end up for the signal delivery which does not happen, and it should occur exactly when needed regardless of other activity.

Of course, it could occur that reschedule_signals() picks up the thread which also blocks the signal with fast_sigblock, but then it would round-robing the delivery until some thread unblocks.

kib updated this revision to Diff 66830.Jan 16 2020, 11:52 AM

reschedule_signals() from AST if fast_sigblock blocks.
Editorial changes to the man page as suggested by cem.

cem added a comment.Jan 16 2020, 8:09 PM

Thanks. I did not do a full pass over the manual page but just noticed one grammar suggestion in a nearby sentence to the change.

lib/libc/sys/fast_sigblock.2
67

nit: s/contained/containing/

sys/kern/capabilities.conf
673

I realize this is getting into stupid bikeshed territory, and feel free to ignore the suggestion, but perhaps fast_sigblock could be renamed sigfastblock or sig_fast_block such that the routine naturally sorts into the same pseudo-namespace as the rest of the signal manipulating routines?

sys/kern/imgact_elf.c
186–190

+1. I think jilles' proposed text is a reasonable proxy for actual functionality affected by the sysctl given both the constraints of the sysctl description length convention, and most common audience (non-developers).

sys/kern/kern_sig.c
2692

This version doesn't modify sigtd to be vaguely aware of fastblock threads, right? It seems like we may end up just burning CPU rescheduling signals onto the same thread. sigtd will by default pick curthread if it matches p and doesn't slow-block (td_sigmask) the signal.

If we do not modify sigtd, we probably want to wrap the call to reschedule_signals with <save td_sigmask>; SIGFILLSET(td->td_sigmask); and <restore td_sigmask> to prevent fastblock threads from scheduling masked signals onto themselves preferentially.

sys/kern/subr_trap.c
324–327

As mentioned above, it seems like this actually schedule all pending signals (not otherwise blocked by td_sigmask) onto td, completely ignoring td_sigblock_val.

kib updated this revision to Diff 66921.Jan 17 2020, 5:30 PM
  • rename to sigfastblock
  • teach reschedule_signals()/sigtd() about sigfastblock (prop arg to sigtd() is unused, it was removed)
  • some editing of the man page and knob description.
cem added a comment.Jan 17 2020, 8:43 PM

Thanks, looks good to me except for one concern below.

sys/kern/kern_sig.c
2063–2064

I think the last clause needs to be &&:

(!fast_sigblock || (td != curthread && td->td_sigblock_val == 0))

But I'm also not sure the !fast_sigblock logic is correct in the second use. In non-fastblocking callers, e.g., kill(1) -> tdsendsignal(), we will pass the parameter fast_sigblock = false and bypass checking td_sigblock_val. I think we should check td_sigblock_val if fast_sigblock is enabled on a thread, regardless of the caller.

I.e.,

((td->td_pflags & TDP_SIGFASTBLOCK) == 0 || (td != curthread && td->td_sigblock_val == 0).

Perhaps the large condition is more clearly expressed by inverting the sense:

FOREACH_THREAD_IN_PROC(...) {
    if (SIGISMEMBER(...))
        continue;
    if (td->td_pflags & TDP_SIGFASTBLOCK) {
        if (td == curthread || td->td_sigblock_val != 0)
            continue;
    }

    signal_td = td;
    break;
}

But I do not feel strongly about reversing the sense of the loop.

cem added inline comments.Jan 17 2020, 8:49 PM
sys/kern/kern_sig.c
2063–2064

In fact, the only reason td would be curthread is if the previous condition showed that curthread blocked the relevant signal. So I think this can be simplified to:

if (curproc == p && !SIGISMEMBER(curthread->td_sigmask, sig) &&
    (!fast_sigblock || curthread->td_sigblock_val == 0))
	return (curthread);
signal_td = NULL;
FOREACH_THREAD_IN_PROC(p, td) {
	/* Already preferentially handled above. */
	if (td == curthread)
		continue;
	if (SIGISMEMBER(td->td_sigmask, sig)
		continue;
	if ((td->td_pflags & TDP_SIGFASTBLOCK) && td->td_sigblock_val != 0)
		continue;

		signal_td = td;
		break;
	}
}
kib added inline comments.Jan 18 2020, 1:26 PM
sys/kern/kern_sig.c
2063–2064

Mmm, no, I did wanted it coded in the way I did.

First, td_sigblock_val should be only evaluated in the context of curthread. It could be stale because the value is only re-read on syscall entry and AST.

Second, I want sigfastblock to disrupt the existing signal delivery flow as little as possible. In particular, most significant change occurs at the AST processing signals (and in wakeup of sleeping thread, by necessity). I want signal delivery to act as if sigfastblock does not exist, and only reschedule at AST if it happens that the immediate delviery is not possible.

I agree that td == curthread check is somewhat redundant because the micro-optimization check at the start of function duplicates it, but I do not think it is critical to avoid that duplication.

mjg added a subscriber: mjg.Jan 18 2020, 4:38 PM

This fails to build for me on head:

ld: error: /usr/obj/usr/src/amd64.amd64/tmp/usr/lib/libpthread.so: undefined reference to __sys_sigfastblock
cc: error: linker command failed with exit code 1 (use -v to see invocation)

I'm not confident from just reading. Is the TDP_SIGFASTBLOCK flag getting cleared by rtld at some point for single-threaded case?

In general imo the kernel should automatically clear the flag after n syscalls without changes and let userspace know. Then should userspace continue using sigaction it will know whether it needs to call the kernel, all while not relying on it to eventually clear the flag to prevent avoidable calls to fuword.

kib added a comment.Jan 18 2020, 5:58 PM
In D12773#509808, @mjg wrote:

This fails to build for me on head:

ld: error: /usr/obj/usr/src/amd64.amd64/tmp/usr/lib/libpthread.so: undefined reference to __sys_sigfastblock
cc: error: linker command failed with exit code 1 (use -v to see invocation)

Did you do 'make sysent' before starting the build ? It works for me.

I'm not confident from just reading. Is the TDP_SIGFASTBLOCK flag getting cleared by rtld at some point for single-threaded case?

No, it is not. The flag is set when fast sigblock location is registered, and it is there until either libthr is loaded (which re-registers) or execve(2) is executed.

In general imo the kernel should automatically clear the flag after n syscalls without changes and let userspace know. Then should userspace continue using sigaction it will know whether it needs to call the kernel, all while not relying on it to eventually clear the flag to prevent avoidable calls to fuword.

I might do this for some future version.

mjg added a comment.EditedJan 19 2020, 4:25 AM
In D12773#509824, @kib wrote:
In D12773#509808, @mjg wrote:

This fails to build for me on head:

ld: error: /usr/obj/usr/src/amd64.amd64/tmp/usr/lib/libpthread.so: undefined reference to __sys_sigfastblock
cc: error: linker command failed with exit code 1 (use -v to see invocation)

Did you do 'make sysent' before starting the build ? It works for me.

That's my bad, works now.

I'm not confident from just reading. Is the TDP_SIGFASTBLOCK flag getting cleared by rtld at some point for single-threaded case?

No, it is not. The flag is set when fast sigblock location is registered, and it is there until either libthr is loaded (which re-registers) or execve(2) is executed.

In general imo the kernel should automatically clear the flag after n syscalls without changes and let userspace know. Then should userspace continue using sigaction it will know whether it needs to call the kernel, all while not relying on it to eventually clear the flag to prevent avoidable calls to fuword.

I might do this for some future version.

Is there a problem making the modification I suggested? It would also enable the optimization for all sigprocmask consumers, not only rtld. As it is this adds fuword call overhead for long running processes which I suspect can be avoided. If it happens later it may need some form of version checks to see if libc knows about disablement by the kernel.

I'm working on removal several branches from this area which includes the long broken process tracing support using procfs and reimplementatino of dynamic syscall handling which would synchronize with IPIs. The branches which remain will probably be possible to group into a flag field which can be tested at once to see if any extra processing is needed.

With the alteration proposed above fast signal blocking will not induce fuword calls for long running processes and thus will be an easy addition to the field (consequently making the entire mechanism come at no extra cost in the long run).

kib added a comment.Jan 19 2020, 7:17 PM
In D12773#509975, @mjg wrote:
In D12773#509824, @kib wrote:
In D12773#509808, @mjg wrote:

This fails to build for me on head:

ld: error: /usr/obj/usr/src/amd64.amd64/tmp/usr/lib/libpthread.so: undefined reference to __sys_sigfastblock
cc: error: linker command failed with exit code 1 (use -v to see invocation)

Did you do 'make sysent' before starting the build ? It works for me.

That's my bad, works now.

I'm not confident from just reading. Is the TDP_SIGFASTBLOCK flag getting cleared by rtld at some point for single-threaded case?

No, it is not. The flag is set when fast sigblock location is registered, and it is there until either libthr is loaded (which re-registers) or execve(2) is executed.

In general imo the kernel should automatically clear the flag after n syscalls without changes and let userspace know. Then should userspace continue using sigaction it will know whether it needs to call the kernel, all while not relying on it to eventually clear the flag to prevent avoidable calls to fuword.

I might do this for some future version.

Is there a problem making the modification I suggested? It would also enable the optimization for all sigprocmask consumers, not only rtld. As it is this adds fuword call overhead for long running processes which I suspect can be avoided. If it happens later it may need some form of version checks to see if libc knows about disablement by the kernel.

In fact yes, after some thinking about it I highly dislike that. This is not a kernel job to stop the mode requested by userspace. It would be similar e.g. for kernel to close some file descriptor because it decided that userspace executed some number of syscalls that did not touch them.

Although it is possible to do that without a version check, because sigfastblock(2) has commands and I can add a command 'activate, but deactivate after N syscalls', it would indeed require some way for kernel to inform userspace that it deactivated.

I think your proposal might be transformed into some optional LD_ env variable that would direct rtld to deactivate the fast sigblock mode after N binding were done. I do not have objections against this variant, esp. if combined with a knob to make libthr not use fast sigblock. But still, I do not see it as a requirement before the patch goes in. It can be done later if this architectural change appears to be sound and not cause unexpected effects.

I'm working on removal several branches from this area which includes the long broken process tracing support using procfs and reimplementatino of dynamic syscall handling which would synchronize with IPIs. The branches which remain will probably be possible to group into a flag field which can be tested at once to see if any extra processing is needed.
With the alteration proposed above fast signal blocking will not induce fuword calls for long running processes and thus will be an easy addition to the field (consequently making the entire mechanism come at no extra cost in the long run).

Your proposal requires to count syscalls (per-thread ?) and test for the limit. Anyway, I am not going to go nuts just to save one fuword.

mjg added a comment.EditedJan 20 2020, 2:56 PM
In D12773#510115, @kib wrote:
In D12773#509975, @mjg wrote:

In general imo the kernel should automatically clear the flag after n syscalls without changes and let userspace know. Then should userspace continue using sigaction it will know whether it needs to call the kernel, all while not relying on it to eventually clear the flag to prevent avoidable calls to fuword.

I might do this for some future version.

Is there a problem making the modification I suggested? It would also enable the optimization for all sigprocmask consumers, not only rtld. As it is this adds fuword call overhead for long running processes which I suspect can be avoided. If it happens later it may need some form of version checks to see if libc knows about disablement by the kernel.

In fact yes, after some thinking about it I highly dislike that. This is not a kernel job to stop the mode requested by userspace. It would be similar e.g. for kernel to close some file descriptor because it decided that userspace executed some number of syscalls that did not touch them.

I don't think the analogy holds. Should the dedicated "fastsigblock" syscall be removed and e.g., another argument added to sigprocmask syscall (to specify the area to read stuff from) it could be easily considered as just advisory. This would make the mechanism as something between libc and the kernel.

Although it is possible to do that without a version check, because sigfastblock(2) has commands and I can add a command 'activate, but deactivate after N syscalls', it would indeed require some way for kernel to inform userspace that it deactivated.
I think your proposal might be transformed into some optional LD_ env variable that would direct rtld to deactivate the fast sigblock mode after N binding were done. I do not have objections against this variant, esp. if combined with a knob to make libthr not use fast sigblock. But still, I do not see it as a requirement before the patch goes in. It can be done later if this architectural change appears to be sound and not cause unexpected effects.

Keeping this tied to rtld makes it harder to automatically use it for other code which calls sigprocmask. With my proposal there will be no need to modify of it. If they do it all the time they will stick to the fast variant (mostly), if they rarely do it it will clear itself for them and they wont pay for fuword.

I'm working on removal several branches from this area which includes the long broken process tracing support using procfs and reimplementatino of dynamic syscall handling which would synchronize with IPIs. The branches which remain will probably be possible to group into a flag field which can be tested at once to see if any extra processing is needed.
With the alteration proposed above fast signal blocking will not induce fuword calls for long running processes and thus will be an easy addition to the field (consequently making the entire mechanism come at no extra cost in the long run).

Your proposal requires to count syscalls (per-thread ?) and test for the limit. Anyway, I am not going to go nuts just to save one fuword.

It does not have to be perfect, it can just use the already existing per-cpu syscall count and just take it modulo % n == 0. Meaning sometimes it is going to disable it very soon and other times late, but it will even out in the long run.

version check would make it pessimal to go to fuword in the fallback since there would be code which will always need it

kib added a comment.Jan 20 2020, 5:14 PM
In D12773#510434, @mjg wrote:
In D12773#510115, @kib wrote:
In D12773#509975, @mjg wrote:

In general imo the kernel should automatically clear the flag after n syscalls without changes and let userspace know. Then should userspace continue using sigaction it will know whether it needs to call the kernel, all while not relying on it to eventually clear the flag to prevent avoidable calls to fuword.

I might do this for some future version.

Is there a problem making the modification I suggested? It would also enable the optimization for all sigprocmask consumers, not only rtld. As it is this adds fuword call overhead for long running processes which I suspect can be avoided. If it happens later it may need some form of version checks to see if libc knows about disablement by the kernel.

In fact yes, after some thinking about it I highly dislike that. This is not a kernel job to stop the mode requested by userspace. It would be similar e.g. for kernel to close some file descriptor because it decided that userspace executed some number of syscalls that did not touch them.

I don't think the analogy holds. Should the dedicated "fastsigblock" syscall be removed and e.g., another argument added to sigprocmask syscall (to specify the area to read stuff from) it could be easily considered as just advisory. This would make the mechanism as something between libc and the kernel.

We are not going to change sigprocmask(2). And if a new syscall is added, then it is not better or worse of sigfastblock(2).

Although it is possible to do that without a version check, because sigfastblock(2) has commands and I can add a command 'activate, but deactivate after N syscalls', it would indeed require some way for kernel to inform userspace that it deactivated.
I think your proposal might be transformed into some optional LD_ env variable that would direct rtld to deactivate the fast sigblock mode after N binding were done. I do not have objections against this variant, esp. if combined with a knob to make libthr not use fast sigblock. But still, I do not see it as a requirement before the patch goes in. It can be done later if this architectural change appears to be sound and not cause unexpected effects.

Keeping this tied to rtld makes it harder to automatically use it for other code which calls sigprocmask. With my proposal there will be no need to modify of it. If they do it all the time they will stick to the fast variant (mostly), if they rarely do it it will clear itself for them and they wont pay for fuword.

It cannot be blindly used instead of sigprocmask(2). The interfaces are very different: quickly mask everything vs. control each async signal, optionall obtaining the previous mask. And in fact the later is what makes it hard to integrate sigprocmask(2) and sigfastblock(2) into a sane user interface. It would require much design which is not obvious.

I'm working on removal several branches from this area which includes the long broken process tracing support using procfs and reimplementatino of dynamic syscall handling which would synchronize with IPIs. The branches which remain will probably be possible to group into a flag field which can be tested at once to see if any extra processing is needed.
With the alteration proposed above fast signal blocking will not induce fuword calls for long running processes and thus will be an easy addition to the field (consequently making the entire mechanism come at no extra cost in the long run).

Your proposal requires to count syscalls (per-thread ?) and test for the limit. Anyway, I am not going to go nuts just to save one fuword.

It does not have to be perfect, it can just use the already existing per-cpu syscall count and just take it modulo % n == 0. Meaning sometimes it is going to disable it very soon and other times late, but it will even out in the long run.

version check would make it pessimal to go to fuword in the fallback since there would be code which will always need it

There is no need for version check. If this ever going to be done, another command should be added, besides SIGFASTBLOCK_SETPTR.

This revision was not accepted when it landed; it landed in state Needs Review.Sun, Feb 9, 11:53 AM
This revision was automatically updated to reflect the committed changes.