Paths

Table of Contentst

Differential D28952

jail: allow root to implicitly widen its cpuset to attach
ClosedPublic
Actions

Authored by kevans on Feb 26 2021, 10:25 PM.

Details

Reviewers

jamie

Group Reviewers

Jails

Commits

rG60c4ec806dfd: jail: allow root to implicitly widen its cpuset to attach

Summary

The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.

This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.

With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.

PR:             253724
Differential Revision:  https://reviews.freebsd.org/D28952

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kevans created this revision.Feb 26 2021, 10:25 PM

Herald added a subscriber: imp. · View Herald TranscriptFeb 26 2021, 10:25 PM

kevans requested review of this revision.Feb 26 2021, 10:25 PM

Harbormaster completed remote builds in B37421: Diff 84767.Feb 26 2021, 10:25 PM

Allow jailed root to also take its child jails' cpusets

Harbormaster completed remote builds in B37422: Diff 84768.Feb 26 2021, 10:36 PM

Might PRIV_SCHED_CPUSET be sufficient for this? If a process has the ability to explicitly expand the current cpu list, it makes sense for it to be able to implicitly do so when attaching to a jail.

In D28952#648255, @jamie wrote:

Might PRIV_SCHED_CPUSET be sufficient for this? If a process has the ability to explicitly expand the current cpu list, it makes sense for it to be able to implicitly do so when attaching to a jail.

Hmm... yeah, I think the semantics there make plenty of sense. I will switch to that one-

In D28952#648255, @jamie wrote:

Might PRIV_SCHED_CPUSET be sufficient for this? If a process has the ability to explicitly expand the current cpu list, it makes sense for it to be able to implicitly do so when attaching to a jail.

how does this, or the current approach apply to jails in jails (which are under a specific CPU set?

In D28952#648258, @me_igalic.co wrote:

how does this, or the current approach apply to jails in jails (which are under a specific CPU set?

They would remain restricted. Jail cpusets are applied to child jails as well, in that a child can only have the same CPUs as the parent (or fewer). So it's generally going to be the case that a process in a parent jail will already have CPUs available in the child jail.

The big that inspired this is a system where all non-jailed processes (from init on down) use different CPUs than the jails do. While something similar could be desirable in a jail, this patch wouldn't allow for it.

This might be a reason to keep PRIV_JAIL_CPUSET, and have it generally be available to virtual as well as real root.

In D28952#648262, @jamie wrote:

This might be a reason to keep PRIV_JAIL_CPUSET, and have it generally be available to virtual as well as real root.

Reading what I wrote, I don't know that there's any point to that. That would put us right back where we were before we made the original change to cpusets around jexec (D27298).

But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.

In D28952#648283, @jamie wrote:

In D28952#648262, @jamie wrote:

This might be a reason to keep PRIV_JAIL_CPUSET, and have it generally be available to virtual as well as real root.

Reading what I wrote, I don't know that there's any point to that. That would put us right back where we were before we made the original change to cpusets around jexec (D27298).

Note that PRIV_SCHED_CPUSET is available for jailed root, too,

But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.

Yeah, I think that makes sense -- we're talking about EDEADLK fallback if the priv is also set, so that it naturally gets cpuset down if it can?

In D28952#648302, @kevans wrote:

In D28952#648283, @jamie wrote:

But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.

Yeah, I think that makes sense -- we're talking about EDEADLK fallback if the priv is also set, so that it naturally gets cpuset down if it can?

Right. It looks like it's just a matter of moving the priv check inside the cpuset_setproc_newbase error condition, except return cpu_setproc_newbases's error (presumably EDEADLK) instead of prison_check's EPERM.

Re-roll; instead of the previous approach, just take the jail's cpuset if we
have the PRIV_SCHED_CPUSET privilege. This allows jail root to do the same, but
it can only be attaching to jails with a subset of its own widest available
mask.

Adds a test for it as well, ensuring that root can both attach to a pre-existing
jail with a completely disjoint cpuset and that it can widen its cpuset back and
only the overlap is retained post-attach.

Harbormaster completed remote builds in B37429: Diff 84782.Feb 27 2021, 4:40 AM

kevans retitled this revision from jail: add a privilege for taking a jail's cpuset to jail: allow root to implicitly widen it's cpuset to attach.Feb 27 2021, 4:41 AM

kevans edited the summary of this revision. (Show Details)

kevans retitled this revision from jail: allow root to implicitly widen it's cpuset to attach to jail: allow root to implicitly widen its cpuset to attach.Feb 27 2021, 5:18 AM

jamie accepted this revision as: jamie.Feb 27 2021, 5:45 AM

This revision is now accepted and ready to land.Feb 27 2021, 5:45 AM

I got the reproducer working on another machine, and discovered that this is in-fact still insufficient. What's happening is that it's actually tripping in cpuset_modify()/cpuset_testupdate() -- jail(8) spawns off processes inside that adopt the caller's mask, so we have a child that attached with our limited set.

This is still a necessary change, it's just not the only one we need; I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

(edit) e.g., https://people.freebsd.org/~kevans//jail-cpuset.diff which accomplishes this for jail(8) on start/stop.

In D28952#648403, @kevans wrote:

I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

(edit) e.g., https://people.freebsd.org/~kevans//jail-cpuset.diff which accomplishes this for jail(8) on start/stop.

That snippet may go better in jail(8)'s command.c, just before the jail_attach(2), since there's only the one place where all in-jail commands are run. Or perhaps after the jail_attach(2). I don't suppose it makes much difference, but that just limits it to the in-jail part of the creation.

In D28952#648403, @kevans wrote:

This is still a necessary change, it's just not the only one we need; I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

Makes sense for jail(8), but I'm not sure about jexec(8). Sure it's administrative, but it's so general-use that I wouldn't say you always want the jail's main set. Perhaps it could be an option there, or perhaps just let users run their own cpuset wrapper if they care about it.

But for jail(8), creating a jail without regard to the current "temporary" cpuset restrictions definitely seems the expected behavior.

In D28952#648563, @jamie wrote:

In D28952#648403, @kevans wrote:

This is still a necessary change, it's just not the only one we need; I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.

Makes sense for jail(8), but I'm not sure about jexec(8). Sure it's administrative, but it's so general-use that I wouldn't say you always want the jail's main set. Perhaps it could be an option there, or perhaps just let users run their own cpuset wrapper if they care about it.

Yeah, in hindsight jexec(8) should definitely be exempt from that -- it's mainly that you want daemons (or, more accurately, exec.start) to be run with the jail's cpuset.

Another idea I tossed back and forth is whether jail_set(2) should just accept a cpu mask for its root cpuset anyways, so we have a built-in way to set the jail's cpuset upon creation. If we weren't fighting exec.start (as we are in the report), then we're potentially racing other processes attaching to the jail and we have a clean way of just avoiding it entirely.

Closed by commit rG60c4ec806dfd: jail: allow root to implicitly widen its cpuset to attach (authored by kevans). · Explain WhyMar 1 2021, 6:39 PM

This revision was automatically updated to reflect the committed changes.

kevans added a commit: rG60c4ec806dfd: jail: allow root to implicitly widen its cpuset to attach.