Page MenuHomeFreeBSD

jobc: correct check for ignored children.
AbandonedPublic

Authored by kib on Dec 29 2020, 7:23 PM.

Details

Reviewers
jhb
Summary

Process might be on the debugger children list without P_TREE_ORPHANED set if e.g.

  • process is grandchild of the debugger,
  • process parent (child of the debugger) exited.

Then reapper is recorded as o_pptr for the process, but the process itself is not put on the orhpan list of reapper, since it is killed anyway.

This situation is caught if instead of P_TREE_ORPHANED, we check for jobc_parent() equal to the debugger, and ignore the process for job control counting otherwise.

Test Plan

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

kib requested review of this revision.Dec 29 2020, 7:23 PM
kib created this revision.
kib edited the summary of this revision. (Show Details)
kib edited the test plan for this revision. (Show Details)

I don't fully appreciate the code change, but the log message makes sense, and this does resolve the test case I gave you.

This revision is now accepted and ready to land.Dec 30 2020, 12:35 AM

I got a panic with this patch applied (I believe) while building a kernel:

panic: pgrp 49288 0xfffff8006370e080 pg_jobc 1 cnt 0
cpuid = 2
time = 1609359203
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0093da67f0
vpanic() at vpanic+0x181/frame 0xfffffe0093da6840
panic() at panic+0x43/frame 0xfffffe0093da68a0
check_pgrp_jobc() at check_pgrp_jobc+0x1ba/frame 0xfffffe0093da68e0
killjobc() at killjobc+0x211/frame 0xfffffe0093da6940
exit1() at exit1+0x6af/frame 0xfffffe0093da69b0
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe0093da69c0
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe0093da6af0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0093da6af0
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x8003b2f0a, rsp = 0x7fffffffb658, rbp = 0x7fffffffb670 ---
...
(kgdb) p *pgrp
$5 = {pg_hash = {le_next = 0x0, le_prev = 0xfffffe00042a2440}, pg_members = {lh_first = 0xfffff800696d6000}, 
  pg_session = 0xfffff80003f1bd00, pg_sigiolst = {slh_first = 0x0}, pg_id = 49288, pg_jobc = 1, pg_mtx = {
    lock_object = {lo_name = 0xffffffff811a1219 "process group", lo_flags = 21168128, lo_data = 0, 
      lo_witness = 0xfffff8007fd74380}, mtx_lock = 18446741877169321728}}
(kgdb) p pgrp->pg_members
$6 = {lh_first = 0xfffff800696d6000}
(kgdb) p pgrp->pg_members.lh_first
$7 = (struct proc *) 0xfffff800696d6000
(kgdb) p pgrp->pg_members.lh_first->p_comm
$8 = "sh", '\000' <repeats 17 times>
(kgdb) p pgrp->pg_members.lh_first->p_pid
$9 = 49288

The exiting process was a child of the first member of the process group:

(kgdb) p td->td_proc->p_pid
$12 = 49289
(kgdb) p td->td_proc->p_pptr
$13 = (struct proc *) 0xfffff800696d6000
(kgdb) p td->td_proc->p_pptr->p_pid
$14 = 49288
(kgdb) p td->td_proc->p_pgrp
$15 = (struct pgrp *) 0xfffff8006370e080

The exiting process was still in the pgrp at this point and the group had 2 members:

(kgdb) p td->td_proc->p_pgrp->pg_members.lh_first->p_pglist.le_next
$20 = (struct proc *) 0xfffff800694de000
(kgdb) p td->td_proc->p_pgrp->pg_members.lh_first->p_pglist.le_next->p_pglist
$21 = {le_next = 0x0, le_prev = 0xfffff800696d60d8}

The panic was in the check_pgrp_jobc() before getting to the code changed by this review, though perhaps an earlier fix_jobc_enterpgrp had neglected to bump the count when it should have?

Yes Peter reported panics after the change, which is why I postponed commit.

I probably do not want to analyze the reports because it is too much complicated, I want to drop the jobc at all, as discussed elsewhere.