The default behavior for attaching processes to jails is that the jail's cpuset augments the attaching processes, so that it cannot be used to escalate a user's ability to take advantage of more CPUs than the administrator wanted them to. This is problematic when root needs to manage jails that have disjoint sets with whatever process is attaching, as this would otherwise result in a deadlock. Therefore, if we did not have an appropriate common subset of cpus/domains for our new policy, we now allow the process to simply take on the jail set *if* it has the privilege to widen its mask anyways. With the new logic, root can still usefully cpuset a process that attaches to a jail with the desire of maintaining the set it was given pre-attachment while still retaining the ability to manage child jails without jumping through hoops. PR: 253724 Differential Revision: https://reviews.freebsd.org/D28952
Details
- Reviewers
jamie - Group Reviewers
Jails - Commits
- rG60c4ec806dfd: jail: allow root to implicitly widen its cpuset to attach
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 37421 Build 34310: arc lint + arc unit
Event Timeline
Might PRIV_SCHED_CPUSET be sufficient for this? If a process has the ability to explicitly expand the current cpu list, it makes sense for it to be able to implicitly do so when attaching to a jail.
Hmm... yeah, I think the semantics there make plenty of sense. I will switch to that one-
how does this, or the current approach apply to jails in jails (which are under a specific CPU set?
They would remain restricted. Jail cpusets are applied to child jails as well, in that a child can only have the same CPUs as the parent (or fewer). So it's generally going to be the case that a process in a parent jail will already have CPUs available in the child jail.
The big that inspired this is a system where all non-jailed processes (from init on down) use different CPUs than the jails do. While something similar could be desirable in a jail, this patch wouldn't allow for it.
This might be a reason to keep PRIV_JAIL_CPUSET, and have it generally be available to virtual as well as real root.
Reading what I wrote, I don't know that there's any point to that. That would put us right back where we were before we made the original change to cpusets around jexec (D27298).
But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.
Note that PRIV_SCHED_CPUSET is available for jailed root, too,
But how about this: first at least try using the intersection of the current and jail sets (whether or not currently jailed), and only as a EDEADLK fallback punt to just using the jailed set.
Yeah, I think that makes sense -- we're talking about EDEADLK fallback if the priv is also set, so that it naturally gets cpuset down if it can?
Right. It looks like it's just a matter of moving the priv check inside the cpuset_setproc_newbase error condition, except return cpu_setproc_newbases's error (presumably EDEADLK) instead of prison_check's EPERM.
Re-roll; instead of the previous approach, just take the jail's cpuset if we
have the PRIV_SCHED_CPUSET privilege. This allows jail root to do the same, but
it can only be attaching to jails with a subset of its own widest available
mask.
Adds a test for it as well, ensuring that root can both attach to a pre-existing
jail with a completely disjoint cpuset and that it can widen its cpuset back and
only the overlap is retained post-attach.
I got the reproducer working on another machine, and discovered that this is in-fact still insufficient. What's happening is that it's actually tripping in cpuset_modify()/cpuset_testupdate() -- jail(8) spawns off processes inside that adopt the caller's mask, so we have a child that attached with our limited set.
This is still a necessary change, it's just not the only one we need; I suspect both jail(8) and jexec(8) should try to switch to their root's id before running commands in a jail, so that administratively spawned stuff ends up with the jail's full mask.
(edit) e.g., https://people.freebsd.org/~kevans//jail-cpuset.diff which accomplishes this for jail(8) on start/stop.
That snippet may go better in jail(8)'s command.c, just before the jail_attach(2), since there's only the one place where all in-jail commands are run. Or perhaps after the jail_attach(2). I don't suppose it makes much difference, but that just limits it to the in-jail part of the creation.
Makes sense for jail(8), but I'm not sure about jexec(8). Sure it's administrative, but it's so general-use that I wouldn't say you always want the jail's main set. Perhaps it could be an option there, or perhaps just let users run their own cpuset wrapper if they care about it.
But for jail(8), creating a jail without regard to the current "temporary" cpuset restrictions definitely seems the expected behavior.
Yeah, in hindsight jexec(8) should definitely be exempt from that -- it's mainly that you want daemons (or, more accurately, exec.start) to be run with the jail's cpuset.
Another idea I tossed back and forth is whether jail_set(2) should just accept a cpu mask for its root cpuset anyways, so we have a built-in way to set the jail's cpuset upon creation. If we weren't fighting exec.start (as we are in the report), then we're potentially racing other processes attaching to the jail and we have a clean way of just avoiding it entirely.