Page MenuHomeFreeBSD

Synchronize the sigfastblock value on entry to sigsuspend
ClosedPublic

Authored by jtl on Mar 11 2021, 8:46 PM.

Details

Summary

We have seen several cases of processes which have become "stuck" in kern_sigsuspend(). They exhibit these symptoms:

The kernel shows that the user has blocked signals:

(kgdb) print $td->td_pflags
$1 = 0x80000101
(kgdb) print $td->td_sigblock_ptr
$2 = (void *) 0x801012038
(kgdb) print $td->td_sigblock_val
$3 = 0x10

The userspace side shows that is has unblocked signals:

#0  _sigsuspend () at _sigsuspend.S:4
4	_sigsuspend.S: No such file or directory.
(gdb) print *(int *)0x801012038
$1 = 0x0

The kernel sigblock value is out of sync with the userspace value. Normally, this would get resolved on the next syscall return when ast() would read the userspace sigblock value and do the right thing. However, when we sleep while waiting for signals, we instead deadlock. (Because the signals are blocked, cursig() says there are no signals. So, kern_sigsuspend() simply loops without ever returning.)

We should probably consider adding a similar check to kern_sigtimedwait().

It appears this condition became noticable sometime between main-c255535-gd189a74dfdcd and main-c256326-g24a8f6d36996; however, the condition happens infrequently enough that it will be very hard to bisect.

Test Plan

I can't reliably reproduce this condition, so I can't easily test it.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jtl requested review of this revision.Mar 11 2021, 8:46 PM

It is strange indeed, and it sounds more like a self-inflicting action from userspace. Code in rtld or libthr should not leak sigfastblock block, but of course bugs are possible.
The values you show in the userspace/kernelspace block word make a random memory corruption unlikely.

I think that the proposed patch is a good protective measure, and yes kern_sigtimedwait() should be patched same way.

This revision is now accepted and ready to land.Mar 12 2021, 11:47 AM

This is the change I am planning to commit once the regression tests finish running.

This revision now requires review to proceed.Mar 12 2021, 5:35 PM
In D29225#654397, @kib wrote:

It is strange indeed, and it sounds more like a self-inflicting action from userspace. Code in rtld or libthr should not leak sigfastblock block, but of course bugs are possible.

I don't think the user space code is leaking a sigfastblock. It looks like libthr is modifying the block value up and down around the fork() call (and the code where we see this error does call fork() multiple times in fast succession). But, libthr must not see the SIGFASTBLOCK_PEND flag when it is unblocking signals, which means it doesn't run the sigfastblock(SIGFASTBLOCK_UNBLOCK) syscall. And, for some reason, we must have cached the old value of the sigfastblock word before it was decremented.

The values you show in the userspace/kernelspace block word make a random memory corruption unlikely.

Agreed. And, it seems to always be 0x10.

I think that the proposed patch is a good protective measure, and yes kern_sigtimedwait() should be patched same way.

Thanks! I've updated the diff to include the change for kern_sigtimedwait(), and to remove my musings from the comment in kern_sigsuspend(). I'm running the regression tests now. Once that is finished I plan to commit the change.

This revision is now accepted and ready to land.Mar 12 2021, 5:47 PM