Page MenuHomeFreeBSD

Fix several issues with process group orphanage.

Authored by kib on Aug 18 2020, 8:40 PM.



This started as exploration for the reasons for r361967/r362910 assertion failures. Then Peter Holm was able to narrow down the problem to very easy reproduction with timeout(1) which pointed to the big issue with reaping.

There seems to be a lot of problems with calculation of pg_jobs which directs SIGHUP/SIGCONT delivery for orphaned process group:

  1. Re-calculation of the orphaned status for children of exiting parent was wrong, but mostly unnoticed when all children were reparented to init(8). When child can be reparented to a different process which could affect the child' job control state, it was not properly accounted for in pg_jobc.
  2. Lockless check for exiting process' parent process group is racy because nothing prevents the parent from changing its group membership.
  3. Exited process is left in the process group, until waited. This affects other calculations of pg_jobc.

Split handling of job control status on process changing its process group, and process exiting. Calculate increments and decrements for pg_jobs by exact checking the orphanage instead of assuming process group membership for children and parent. Move the call to killjobc() later under the proctree_lock. Mark exiting process in killjobc() with a new flag P_TREE_GRPEXITED and skip it for all pg_jobc calculations after the flag is set.

Add checker that independently recalculates pg_jobc value and compares it with the memoized process group state. This is enabled under INVARIANTS.

Tested by: pho

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib requested review of this revision.Aug 18 2020, 8:40 PM
kevans added inline comments.
689 ↗(On Diff #75972)

These aren't defined for !INVARIANTS builds

kib marked an inline comment as done.

Hide checkers calls under INVARIANTS.

jilles added inline comments.
436 ↗(On Diff #75976)

This call may unlock and relock proctree_lock, which appears safe at this time but restricts changes that can be made here.

865 ↗(On Diff #75976)


This revision is now accepted and ready to land.Aug 22 2020, 4:31 PM
kib marked 2 inline comments as done.Aug 22 2020, 5:20 PM
kib added inline comments.
436 ↗(On Diff #75976)

Yes, with the current code there was a race when short-circuiting killjobc, since parent can change its group after or during the check. With this code organization, race should be gone.

I tried to avoid dropping the lock by postponing revoke of the ctty devfs vnode after proctree_lock is dropped, but then it comes in interlock with process lock and then process spinlock till the thread is finally removed from CPU. I do not think it is safe to postpone revoke after exit(2) is signalling children. So I do not see a better code organization.

But I added a comment there.

kib marked an inline comment as done.

Add comment about placement of killjobc().
Fix typo.

This revision now requires review to proceed.Aug 22 2020, 5:21 PM
This revision was not accepted when it landed; it landed in state Needs Review.Aug 22 2020, 9:32 PM
This revision was automatically updated to reflect the committed changes.