Page MenuHomeFreeBSD

Move the _has_waiters flag in POSIX semaphores into _count.
ClosedPublic

Authored by jhb on Oct 17 2014, 4:36 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Apr 16, 4:42 AM
Unknown Object (File)
Tue, Apr 16, 4:42 AM
Unknown Object (File)
Mar 10 2024, 10:06 PM
Unknown Object (File)
Mar 10 2024, 9:52 PM
Unknown Object (File)
Mar 10 2024, 9:52 PM
Unknown Object (File)
Mar 10 2024, 9:45 PM
Unknown Object (File)
Mar 7 2024, 10:09 PM
Unknown Object (File)
Mar 1 2024, 3:16 AM
Subscribers
None

Details

Reviewers
kib
jilles
adrian
Summary

The current POSIX semaphore implementation stores the _has_waiters flag
in a separate word from the _count. This does not permit both items to
be updated atomically in a portable manner. As a result, sem_post()
must always perform a system call to safely clear _has_waiters.

This change removes the _has_waiters field and instead uses the high bit
of _count as the _has_waiters flag. A new umtx object type (_usem2) and
two new umtx operations are added (SEM_WAIT2 and SEM_WAKE2) to implement
these semantics. The older operations are still supported under newly
added COMPAT_FREEBSD9/10 options. The POSIX semaphore API in libc has
been updated to use the new implementation. Note that the new
implementation is not compatible with the previous implementation.
However, this only affects static binaries (which cannot be helped by
symbol versioning). Binaries using a dynamic libc will continue to work
fine. SEM_MAGIC has been bumped so that mismatched binaries will error
rather than corrupting a shared semaphore. In addition, a padding field
has been added to sem_t so that it remains the same size.

Test Plan
  • The tools/regression/posixsem2 test passes.
  • A simple test program that just does sem_init(.., 1), sem_wait/sem_post loop in a single thread works fine (and makes no system calls in the wait/post loop).
  • A slightly more complicated program that creates a process-shared semaphore in a MAP_SHARED | MAP_ANON region and forks a child process for each CPU to do a similar wait/post loop also works fine. It makes a few system calls during startup when the child processes block on the semaphore before the main thread finishes forking them so it can do the initial sem_post's to release them and handle the loop for CPU 0. This also works fine. I could make a variant of this that uses threads instead of child processes if there is interest.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

jhb retitled this revision from to Move the _has_waiters flag in POSIX semaphores into _count..
jhb updated this object.
jhb edited the test plan for this revision. (Show Details)
jhb added reviewers: kib, jilles, adrian.
lib/libc/gen/sem_destroy.3
70

I will probably commit these manpage updates separately. The cross-references have been obsolete since FreeBSD 9.0.

lib/libc/gen/sem_new.c
440

The #if 0 here requires more explanation and is something I'd like your (collective) opinion on. See the #if 0 code in kern_umtx.c as well to see the kernel half of this.

The #if 0 toggles between doing a signal wakeup in a contested sem_post() vs a broadcast wakeup. In the broadcast wakeup case, all the waiting threads are resumed. This can result in a thundering herd if they are all scheduled and just go back to sleep again. However, if there are more threads than CPUs and they execute sequentially, then it is actually cheaper since the subsequent sem_post() operations that would wake up each additional thread would no longer have to wake up a thread (they are already sitting on the run queue and the waiters flag is clear). In addition, the implementation of sem_post() itself is a bit cheaper since the atomic op that bumps the count also clears the flag, so the kernel doesn't later have to conditionally clear the flag after signalling a thread.

I'm not sure which strategy is best. In the kernel we use a broadcast wakeup for mtx_unlock() under the assumption that you are more likely to run the given threads in sequence in which case they all have cheap lock/unlock operations (instead of having the lock contested the entire time requiring all lock/unlock to be expensive). I did this after reading about the same decision made in the same way for Solaris' mutexes in Solaris Internals. OTOH, POSIX semaphores are not quite the same as an in-kernel mutex. I think it also means that as an orthogonal point we should think about whether pthread mutexes should use broadcast on wakeup (I believe they only use signal right now).

sys/amd64/conf/GENERIC
56

In terms of committing, I would add the COMPAT_FREEBSD9/10 options in a separate commit before the semaphore changes that used them.

sys/kern/kern_umtx.c
2735

casuword3() accepts a volatile, so it doesn't need the __DEVOLATILE() hack.

Hmm, USEM_HAS_WAITERS | USEM_MAX_COUNT compares equal to -1 and will be interpreted as a fuword32/casuword32 failure. However, this combination is extremely unlikely to happen.

Perhaps operations on an _usem2 should be UMTX_OP_SEM2_WAKE etc instead of UMTX_OP_SEM_WAKE2.

An additional assumption with sem_post waking all waiting threads is that each waiting thread will post soon after its wait completes. This is a reasonable assumption for kernel mutexes, but may not be for userland semaphores. At least, holding a kernel mutex for a "long" time is definitely wrong, but keeping a userland semaphore's count at 0 for a "long" time can be perfectly valid (e.g. when using a semaphore to wait for an event).

Looks good to me otherwise.

I hadn't realized the -1 value for fuword32/suword32. I don't think there's much we can do about that
without altering SEM_VALUE_MAX. I assume we can't do that without breaking the ABI (in theory). However, we could just drop it by one and that would fix that case. I agree it's extremely unlikely.

I can name the operations either way. I just copied MUTEX_WAKE2's approach, but I believe that doesn't operate on a new type, so SEM2 might be clearer that it accepts a different type.

As I said, I don't really know what the best strategy to use for sem_post() is. It is probably safest to preserve the existing strategy (signal) rather than adopting a new one, but I do want to see what you all think.

jhb edited edge metadata.

Rename umtx ops from SEM_FOO2 to SEM2_FOO.

In D961#5, @jhb wrote:

I hadn't realized the -1 value for fuword32/suword32. I don't think there's much we can do about that
without altering SEM_VALUE_MAX. I assume we can't do that without breaking the ABI (in theory). However, we could just drop it by one and that would fix that case. I agree it's extremely unlikely.

suword() is not a problem, fuword and casu are. I always use copyin() instead of fuword() for this reason., and patch could do the same. For cas, the only solution is to create another function with the proper interface. It is just the amount of testing of non-x86 arches which stops me from doing this.

On my previous $job, we did problems due to java process spinning forever with -1 value in the lock/count field.

In D961#8, @kostikbel wrote:

suword() is not a problem, fuword and casu are. I always use copyin() instead of fuword() for this reason., and patch could do the same. For cas, the only solution is to create another function with the proper interface. It is just the amount of testing of non-x86 arches which stops me from doing this.

On my previous $job, we did problems due to java process spinning forever with -1 value in the lock/count field.

Hmm, I don't think copyin() is guaranteed to be atomic? It could in theory consist of a byte loop on
a given architecture in which case you might mix bytes of the word at different times? I think I'd prefer to leave it as it is for now. The rest of umtx uses the *uword() functions. If in the future that API is adjusted to return errors differently then umtx as a whole can be updated.

jhb edited edge metadata.

Remove the broadcast code.

Apart from the -1 collision (which Konstantin is working on via the fueword/casueword changes) does this look ok?

kib edited edge metadata.
This revision is now accepted and ready to land.Oct 24 2014, 6:41 PM