Page MenuHomeFreeBSD

jail: allow root to implicitly widen its cpuset to attach
ClosedPublic

Authored by kevans on Feb 26 2021, 10:25 PM.

Details

Summary
The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.

This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.

With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.

PR:             253724
Differential Revision:  https://reviews.freebsd.org/D28952

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Allow jailed root to also take its child jails' cpusets

Might PRIV_SCHED_CPUSET be sufficient for this? If a process has the ability to explicitly expand the current cpu list, it makes sense for it to be able to implicitly do so when attaching to a jail.

Might PRIV_SCHED_CPUSET be sufficient for this? If a process has the ability to explicitly expand the current cpu list, it makes sense for it to be able to implicitly do so when attaching to a jail.

Hmm... yeah, I think the semantics there make plenty of sense. I will switch to that one-

Might PRIV_SCHED_CPUSET be sufficient for this? If a process has the ability to explicitly expand the current cpu list, it makes sense for it to be able to implicitly do so when attaching to a jail.

how does this, or the current approach apply to jails in jails (which are under a specific CPU set?

how does this, or the current approach apply to jails in jails (which are under a specific CPU set?

They would remain restricted. Jail cpusets are applied to child jails as well, in that a child can only have the same CPUs as the parent (or fewer). So it's generally going to be the case that a process in a parent jail will already have CPUs available in the child jail.

The big that inspired this is a system where all non-jailed processes (from init on down) use different CPUs than the jails do. While something similar could be desirable in a jail, this patch wouldn't allow for it.

This might be a reason to keep PRIV_JAIL_CPUSET, and have it generally be available to virtual as well as real root.

This might be a reason to keep PRIV_JAIL_CPUSET, and have it generally be available to virtual as well as real root.

Reading what I wrote, I don't know that there's any point to that. That would put us right back where we were before we made the original change to cpusets around jexec (D27298).

But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.

This might be a reason to keep PRIV_JAIL_CPUSET, and have it generally be available to virtual as well as real root.

Reading what I wrote, I don't know that there's any point to that. That would put us right back where we were before we made the original change to cpusets around jexec (D27298).

Note that PRIV_SCHED_CPUSET is available for jailed root, too,

But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.

Yeah, I think that makes sense -- we're talking about EDEADLK fallback if the priv is also set, so that it naturally gets cpuset down if it can?

But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.

Yeah, I think that makes sense -- we're talking about EDEADLK fallback if the priv is also set, so that it naturally gets cpuset down if it can?

Right. It looks like it's just a matter of moving the priv check inside the cpuset_setproc_newbase error condition, except return cpu_setproc_newbases's error (presumably EDEADLK) instead of prison_check's EPERM.

Re-roll; instead of the previous approach, just take the jail's cpuset if we
have the PRIV_SCHED_CPUSET privilege. This allows jail root to do the same, but
it can only be attaching to jails with a subset of its own widest available
mask.

Adds a test for it as well, ensuring that root can both attach to a pre-existing
jail with a completely disjoint cpuset and that it can widen its cpuset back and
only the overlap is retained post-attach.

kevans retitled this revision from jail: add a privilege for taking a jail's cpuset to jail: allow root to implicitly widen it's cpuset to attach.Feb 27 2021, 4:41 AM
kevans edited the summary of this revision. (Show Details)
kevans retitled this revision from jail: allow root to implicitly widen it's cpuset to attach to jail: allow root to implicitly widen its cpuset to attach.Feb 27 2021, 5:18 AM
This revision is now accepted and ready to land.Feb 27 2021, 5:45 AM

I got the reproducer working on another machine, and discovered that this is in-fact still insufficient. What's happening is that it's actually tripping in cpuset_modify()/cpuset_testupdate() -- jail(8) spawns off processes inside that adopt the caller's mask, so we have a child that attached with our limited set.

This is still a necessary change, it's just not the only one we need; I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

(edit) e.g., https://people.freebsd.org/~kevans//jail-cpuset.diff which accomplishes this for jail(8) on start/stop.

I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

(edit) e.g., https://people.freebsd.org/~kevans//jail-cpuset.diff which accomplishes this for jail(8) on start/stop.

That snippet may go better in jail(8)'s command.c, just before the jail_attach(2), since there's only the one place where all in-jail commands are run. Or perhaps after the jail_attach(2). I don't suppose it makes much difference, but that just limits it to the in-jail part of the creation.

This is still a necessary change, it's just not the only one we need; I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

Makes sense for jail(8), but I'm not sure about jexec(8). Sure it's administrative, but it's so general-use that I wouldn't say you always want the jail's main set. Perhaps it could be an option there, or perhaps just let users run their own cpuset wrapper if they care about it.

But for jail(8), creating a jail without regard to the current "temporary" cpuset restrictions definitely seems the expected behavior.

This is still a necessary change, it's just not the only one we need; I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

Makes sense for jail(8), but I'm not sure about jexec(8). Sure it's administrative, but it's so general-use that I wouldn't say you always want the jail's main set. Perhaps it could be an option there, or perhaps just let users run their own cpuset wrapper if they care about it.

Yeah, in hindsight jexec(8) should definitely be exempt from that -- it's mainly that you want daemons (or, more accurately, exec.start) to be run with the jail's cpuset.

Another idea I tossed back and forth is whether jail_set(2) should just accept a cpu mask for its root cpuset anyways, so we have a built-in way to set the jail's cpuset upon creation. If we weren't fighting exec.start (as we are in the report), then we're potentially racing other processes attaching to the jail and we have a clean way of just avoiding it entirely.