Page MenuHomeFreeBSD

jail: Don't allow resurrection of dead jails
Needs ReviewPublic

Authored by jamie on Jan 14 2021, 5:32 AM.

Details

Reviewers
bz
Summary

A design error of my original jail work was to allow jails that weren't fully removed to be be brought back from the almost-dead, as if they had never been removed. This was a way to allow the other design error of allowing jails to specified by jid, without having to wait (usually a few minutes, often much longer) for the old jail to release all resources. This has always been a messy solution, as it means that a "created" jail may or may not have remnants of its previous existence.

The problem is users are now well-used to being able to specify jails by jid (even though it's not encouraged), and with "-d" or "allow.default" to re-create them without waiting. The solution is to continue to allow this at the user level, without ever re-using removed jails at the kernel level.

It's already the case that a new jail can be created with the same name as one that's waiting to let go of resources. With this patch, it's also possible to created a jail with the same jid as an existing dying jail. This is done by silently renumbering the old dying jail to a new dynamic jid, thus freeing it up for the new jail to use. This obsoletes the ALLOW_DYING flag to jail_set(8), though it remains allowed as a no-op (and is still used by jail_get(8)).

The upside is that newly created jails are always pristine, with only the parameters set that are given in their creation. The downside is that in this once specific instance, jail IDs are no longer immutable. This may confuse users that are specifically looking at dead jails by jid, and not expecting a new jid to show up in an old jail, but that's a small and presumably very uncommon price to pay.

There have been rumblings of security problems associated with jail resurrection, and this stops any of these problems from occurring.

The is the second of two related changes, the other being D27876. I though I'd split the work into roughly equal parts, but it turns out this is the much bigger chunk, with seemingly unrelated part of the jail code refactored to accommodate the main change.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

jamie requested review of this revision.Jan 14 2021, 5:32 AM
jamie created this revision.

Add full context to diffs

mjg added inline comments.
sys/kern/kern_fork.c
719 ↗(On Diff #82280)

This adds avoidable overhead. Combined with prison_proc_hold the lock is taken and unlocked twice.

Instead you can add a read mostly-sleepable lock to protect against fork. It would scale perfectly in the common case and only be a burden when someone is racing killing the jail.

pseudo-code wise it would be like this:

int
prison_fork_enter(struct prison *pr)
{
        int error;

        error = 0;
        rms_rlock(pr->pr_forklock);
        if (__predict_false(pr->pr_state == PRISON_STATE_DYING))
                error = EAGAIN;
        return (error);
}

fork:

if (!prison_fork_enter(cred->cr_prison)) {
        PROC_LOCK(p1);
        kern_psignal(p1, SIGKILL);
        PROC_UNLOCK(p1);
        goto fail0;
}

do_fork(td, fr, newproc, td2, vm2, fp_procdesc);
prison_fork_exit(cred->cr_prison);

then when you switch the state, you rms_wlock() around it.

As a result you have an invariant that by the time you get the lock for writing, any pending forks in the jail wait and anyone who completed managed to finish and can get a signal when you kill it.

This will also open a way to move pr_ref to a per-cpu scheme and avoid taking the prison mutex altogether.

sys/kern/kern_fork.c
719 ↗(On Diff #82280)

I have to stress rms_rlock and runlock don't do any atomics. which is the crux here.

You might also consider adding this, below the EXAMPLES section header:
".Sh COMPATABILITY
The
.Va allow.dying
parameter has been deprecated and doesn't work like it used to, but is still supported."

Perhaps with a little explanation of how it works now?

usr.sbin/jail/jail.8
207–209 ↗(On Diff #82280)

I would suggest changing this to
"Allow making changes to a dying jail. This is deprecated and is equivalent to the
.Va allow.dying
parameter, which is also deprecated."

That will make it match the verbiage for -l, -n, -s, and -u/-U, and is a little easier to read.

sys/kern/kern_fork.c
719 ↗(On Diff #82280)

I'm not familiar with read mostly-sleepable locks; I'll read up on them, to make sure I at least begin to understand.

jamie marked an inline comment as not done.Jan 16 2021, 6:50 AM
jamie added inline comments.
sys/kern/kern_fork.c
719 ↗(On Diff #82280)

True, the rms lock would remove one mtx overhead from do_fork. It would add a certain amount of complexity to kern_jail_set and prison_deref though (the places that set pr_state), because they will have to juggle the sleepable lock with the prison mutex.

Still, fork is so much a more common code path that the jail lifecycle, that it makes sense to keep that path clean.

719 ↗(On Diff #82280)

This will also open a way to move pr_ref to a per-cpu scheme and avoid taking the prison mutex altogether.

So something like a combination of refcount(9) (which is atomic-based and not per-cpu) and counter(9) (which has a caveat that a fetched value is "not guaranteed to reflect the real cumulative value for any moment"). Ideally both increment and "decrement and test for zero" would be atomic-less in the common case.

Is there such a system currently in the kernel? It seems most things either use refcount(9) or mtx-protected non-atomic counters like jail does.

sys/kern/kern_fork.c
719 ↗(On Diff #82280)

The kernel does not have a self-contained routine which does the trick, but I would argue it's not needed here.

General idea is to take the per-cpu lock for reading, verify per-cpu operation is still enabled and then modify the per-cpu state in whatever manner applicable. If it's not allowed, you fallback to mutex-protected case. Then you would have both per-cpu and a global counter, all of which can be modified in arbitrary manners in per-cpu mode. But if the mode gets disabled, you sum up the per-cpu state with the global counter and store it there. From that point on the only legal modification is to the global counter which provides an exact value.

I think moving refcounts around is not necessary for the purpose of this review though.

sys/kern/kern_fork.c
719 ↗(On Diff #82280)

pseudo-code wise it would be like this:

int
prison_fork_enter(struct prison *pr)
{
        int error;

        error = 0;
        rms_rlock(pr->pr_forklock);
        if (__predict_false(pr->pr_state == PRISON_STATE_DYING))
                error = EAGAIN;
        return (error);
}

fork:

if (!prison_fork_enter(cred->cr_prison)) {
        PROC_LOCK(p1);
        kern_psignal(p1, SIGKILL);
        PROC_UNLOCK(p1);
        goto fail0;
}

do_fork(td, fr, newproc, td2, vm2, fp_procdesc);
prison_fork_exit(cred->cr_prison);

then when you switch the state, you rms_wlock() around it.

This has a deadlock problem. allproc_lock has higher precedence than allprison_lock - I think for a number of reasons that predate my time here, but for certain that's built in to the racct system, which has many places that lock wrap allprison_lock around allproc_lock. And allprison_lock needs higher precedence than this pr_forklock, since I would need it inside e.g. prison_find to avoid invalid/new prisons. A sample deadlock is:

A: fork / do_fork
B: jail_set / prison_racct_modify
C: jail_remove / prison_deref

1: A rlock pr_forklock
2: B slock allproc_lock
3: C xlock allprison_lock
4: A xlock allproc_lock, wait on 2B
5: B xlock allprison_lock, wait on 3C
6: C wlock pr_forklock, wait on 1A

So I could still use this lock around pr_state, which I would check inside do_fork as I do now, which would still save a pt_mtx lock. But the race case would have to be creating and killing the new process rather than avoiding it with EAGAIN.

I still like the rm lock idea for it, not just in fork, but because I think I may be able to avoid pr_mtx in prison_find and other such allprison loops (after accounting for pr_ref which I discovered I need to move to lockless for other reasons).

But does it still need to be a sleepable lock? Or should it be even? You mentioned the lack of atomics as a selling point, but a quick read of regular _rm_rlock suggests its common case is also no-atomic (though I don't know just how common that common case is).

sys/kern/kern_fork.c
719 ↗(On Diff #82280)

It could be that by "higher precedence" I mean "lower precedence" but hopefully you get the idea.

sys/kern/kern_fork.c
719 ↗(On Diff #82280)

The deadlock at hand should not be present. Worst case, if code flow really induces it, you can play tricks like setting the flag at some point, dropping the locks, then rms_wlock/rms_wunlock to synchronize against fork.

Update for various changes recently checked in to kern_jail.c.

Also change the locking around pr_ref, pr_uref, and pr_state. Now they require both the prison mutex and allprison_lock to change them (in the ref case, that means to change whether they're zero). Among other things, this means that a shared lock on allprison_lock is sufficient for these to be valid, which gets rid of a number of cases of lock/check/unlock.

Another change to pr_ref is that instead of the prison not being removed until allprison_lock could be held, the reference count itself won't go to zero until then. This means that a prison with zero references will never be seen in a (properly locked) perusal of the allprison list.

Due to this, and a careful tracing of lock interplay, I've determined that the code added to do_fork() no longer needs to lock the prison when checking if it had died while the process was PRS_NEW. Also, the inforporated refcount change means the prison also isn't locked when do_fork() increased the prison process count.

I also refactored the prison_deref and prison_deref_kill code a little more. Aside from KASSERT, almost all reference checks are done with the return values of the refcount(9) functions, rather that a separate refcount_load() call.

On the man pages, I made something close to the recommended wording change to jail(8). I chose not to add the changes to a COMPATIBILITY section, since so much of jail(8) is content deserving to be in such a section (given the major changes from the old version of the command line which is still supported). I also added a similar change to the JAIL_DYING flag part of jail(2).

I am not sure I like the idea of making jid's a first-class citizen just because our mgmt interface once did and we kept it for historic purpose and backward compat and habits are hard to change.

I am also not entirely opposed to the idea.

I have a major concern expressed in the middle of the review (at the place I though that might happen)>

I cannot in its current state review this in any detail. I would have loved this to be a series of changes rather than a big blob. recounting, splitting out "is_.." functions, .. could probably all be done independently without any function changes intended.

That would have left the actual problem here a lot more contained and in the meantime the code cleaned up a lot more (on of the outcomes of all the rework here that I see).

I fear there is not enough automated test cases (especially for edge cases) to make sure we do actually not leak a ref anywhere on the changes and with that (and the change of behaviour) this is a bit tricky for 13 but certainly great to do for 14. So depending on what you want to target you need to make that judgement call (maybe with others, such as re@).

sys/kern/kern_jail.c
161–164

You could also just add PD_KILL at 0x20 and leave the other flags on the same bits?

1204

I am pondering if "dying jails" should be allocated from the end of the ID space;

imagine jid=2 is dying immediately, jid=3...7 are running, and jid=8 is not started yet; then you probably pick jid=8 (I assume from the function name) and when jid=8 tries to start jid 8 is already used and the confusion continues as this change kind-of even makes it worse rather than better.

As I am expecting that to become an allocator problem ... and lastid .. unless we split the ID space in half or do some other nasty thing...

In D28150#634852, @bz wrote:

I am not sure I like the idea of making jid's a first-class citizen just because our mgmt interface once did and we kept it for historic purpose and backward compat and habits are hard to change.

I am also not entirely opposed to the idea.

I would rather go back and not have made the decision to make jids first-class, but they have been for a while, and we're left with the choice of workarounds to keep them this way (which this is), or changing the policy and saying "you may no longer specify the JID of a new jail." My problem with the latter approach is POLA; I don't know how many people use fixed jids, but I'm sure there are some.

Back-compat habits are certainly hard to break: I tried to deprecate jail(2) and the old system-wide jail bahavior sysctls a couple years ago, and was told I couldn't do anything until I had buy-in from the one or two ports that were still using the old interface. Attempts to contact anyone regarding those ports turned out to be a black hole, and so the old bad interface remains. I wonder if removing fixed jids would find the same fate.

sys/kern/kern_jail.c
161–164

The reason I did it this way was to keep a bit of separation between the action flags and the current-state flags. In fact, if I'm changing those numbers, it might even be better to start PD_LOCKED at 0x10 or even 0x100.

1204

If by "jid=8 is not started yet" you mean that nothing at all has been done with jid=8, then yes it would be picked. If you mean it's in the process of creation and not ready yet, then jid=9 would be picked, so no problem.

get_next_prid is based on the code that used to be inside kern_jail_set, moved into its function because in one iteration of this code I called it in two different places. Even though it's now only called in one, I'm keeping it in its own function because I should have been better with my code organization all along.

1204

Currently, lastid sets a limit on the total number of jails, current or dying. With the potential renumbering of dying jails, lastid still sets a limit on the total number of jails, current and dying. So I don't expect any more of an allocator problem than we have now.

Generally, I see two scenarios:
1: well-behaved users who take whatever jid is given to them. get_next_prid will only ever be called for new jails, and at least for the first million jail creations will take the common case of not even looking through the list for a free jid.
2: users who want to specify things by jid, and may re-start jails that are in the dying state. get_next_prid will only ever be called to renumber the dying jails to an unused ID, and once the counter gets past the range of static jids that are being used, it too will usually just pick the next jid without having to traverse the list.
True, there could be a mixed case. But even then, it's little worse than the current case where adding a new jail will either check that a jid is available or look for a new one, only as it may have to do both for a dying jail replacement.

JID partitioning would either cut down the number of available active jails, limit the number of dying jails (which probably wouldn't be an issue), or convert dying jails to jids greater than JAIL_MAX.

sys/kern/kern_jail.c
1204

I mean if my config has 50 jails and I am starting to rely on jid because the guarantee is now basically coming in and while I am starting all 50 jails one of the earlier jails already dies then the 1..50 guarantee is not there either and I still cannot use "JID" for anything unless I dynamically keep it at the end of start and with that I am no better or worse of using "name", right?

So what does this change really buy me? The answer is probably nothing? The only thing is that I never started to use names and should have a long time ago?

Or am I getting something wrong here?

sys/kern/kern_jail.c
1204

This isn't about relying on the jid staying under a certain number when allocated dynamically; that works exactly the same as it has before.

This is about you having 50 jails with static-JID definitions like:
1 { ip4.addr 10.0.0.1; }
2 { ip4.addr 10.0.0.2; }

Currently, jail 1 dies with a TCP keepalive or other such thing that keeps it in dying state, and you can "create" it again with the same jid using "-d" or "allow.dying", which really just brings it back to life.

This doesn't change much from the user perspective: you can still re-create jail 1 when it's dying (with or without the now-unused allow.dying), but you'll get a new jail while the old one gets shoved off to jid 51 or whatever. Either way, the user sees "jail -c 1" working as expected.

The difference, and the reason for the patch, is that now the new jail 1 is always in a consistent newly-created state, without potential problems from something in the old jail state and jail.conf.

Part of this is to ease potential security problems (I don't have any examples cleaner is better). Part is with a guarantee that dead jails stay dead, more aggressive action can be taken to clean up dying jails. Notably, the potential advantages are the same for other scenarios of making dying jails totally invisible to userspace, or of not allowing static jids.

sys/kern/kern_jail.c
1204

To my understanding the problem you are solving is "everything is running and something gets ""restarted""".

Your above explanation has the implicit understanding that no jail dies while all jails are initially started. That is not guaranteed. If I re-created your jail 1 the "now dying one" may be on jid 42 which is < 50 and suddenly jail 42..50 are on 43..51.

My problem is that "during startup your change now can screw my expectations of I start <n> jails they'll get jid 1..<n>" as those jids are all unused, because the "dying jail" consumes an extra JID on "restart" given creation of <n> jails is not atomic and a "restart" can happen before all are running.

So one problem solved another one introduced; hence my suggestion of moving dead jails to the end of the jail ID space rather than to the next free ID initially to reduce the likelihood of collision. If we do something like that then the startup problem also goes away and I'd be happy I think.

usr.sbin/jail/jail.8
913 ↗(On Diff #82895)

This *is* deprecated

sys/kern/kern_jail.c
1204

OK, it took a while, but I think get it now :-).

Yes, that would happen if you started jails 1-41, and 1 died, and then you restarted it (explicitly setting it to 1), and then you started what you thought would be 42-50 which would turn out to be 43-51.

The dying jails don't get renumbered when they're dying. That only happens when a newly created jail wants to use the jid that a dying one currently occupies. So this problem scenario only happens when jails are implicitly numbered at first, and then explicitly numbered to their old jids on re-creation, and that re-creation can happen before all the jails are first created. Yes, in that scenario, the jids may not be as expected, and counting down the renumbered dying jails from the top would solve the issue.

This is now the culmination of D28419, D27876, D28458, D28473, and D28515. The only thing remaining in this patch is the part that doesn't resurrect dying jails via jail_set, and instead renumbers the dying jails as necessary, and the userspace/man changes from before.

This has the feature to use a different ordering scheme for the renumbered jails, counting down from the top of the JAIL_MAX range instead of up from 1. This adds a function get_next_deaded, which is a backwards version of get_next_prid with the loops in reverse order, yielding a prison to insert it after instead of before, and using a different static variable for the current dying-only JID.