Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 60251 Build 57135: arc lint + arc unit
Event Timeline
This is the followup of https://reviews.freebsd.org/D47107.
I intentionally omitted the CTLFLAG_PRISON for these two sysctl during their introduction due to the docs say that such sysctl can be written to by processes in jail(2). As long as this couple is read-only I did not add the flag.
For now, after a quick code overview, it looks that this flag is not about writable, it's about "value per jail". I think that we could tune the docs and actually add this flag for a bunch of existing sysctl. A very quick counting on my side yields 18 of them like .jailed, .vnet, etc.
This very short patch is a discussion starter, I can extend it to cover more sysctl and the docs as well. What do you think?
Another point is how to deal with sysctl which are "per jail" and vnet-related as well. I guess, they could end up with both flags?
sysctl: Add missing CTLFLAG_PRISON to security.jail.children.*
@igoro The subsystem should be jail, so jail: Add missing CTLFLAG_PRISON to security.jail.children.* . Also it is important to explain why.
Also it is better to add commit meta message
Fixes: ab0841bdbe84 jail: expose children.max and children.cur via sysctl
I guess no. Actually a VNET instance can be shared between multiple jail prisons, e.g., legacy jails those share vnet0.
I like the idea of flagging sysctls that hold values that can be read per jail. If jail(2) says otherwise, we can change that to match.
I agree with your code review.
I like the idea. The only inconvenience I see is that this flagging has to be done by hand, and there is no real automatic way to prevent wrong flagging with CTLFLAG_PRISON (e.g., conditional on the use of curthread->td_cred->cr_prison), but these are very minor, and certainly not worse than the current situation.
This hasn't been done before simply because the CTLFLAG_PRISON was never actually checked if not trying to write to the corresponding sysctl. Now that D47107 has been committed, this change would make a useful difference IMO.
I had to double check the things. I've delved into the details and discovered that the existing code does not allow to use CTLFLAG_PRISON as a flag which means that a variable varies per jail. I've stumbled upon the *allow* variables like these ones:
SYSCTL_PROC(_security_jail, OID_AUTO, set_hostname_allowed, CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL, PR_ALLOW_SET_HOSTNAME, sysctl_jail_default_allow, "I", "Processes in jail can set their hostnames (deprecated)"); SYSCTL_PROC(_security_jail, OID_AUTO, mlock_allowed, CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL, PR_ALLOW_MLOCK, sysctl_jail_default_allow, "I", "Processes in jail can lock/unlock physical pages in memory");
If we add CTLFLAG_PRISON to such sysctlS it will add some mess to the existing interface and behavior, and can add headache for future code. Having such flag added to these variables makes a feeling they can be changed but instead of a correct message that "Operation not permitted" we get something like this:
root@somejail:/ # sysctl security.jail.set_hostname_allowed=1 security.jail.set_hostname_allowed: 0 -> 0
Currently, I see two branches here:
Plan A. If sysctl -J flag was meant to be used to list only variables which can be fiddled by prisoned roots then we keep it as is and probably re-phrase the sysctl man page a bit to make it clear.
Plan B. If the idea of sysctl -J flag is more about listing all sysctl variables which are allocated/meant per jail and may vary per jail, whether they are read-only or writable, then it seems to be better to add a brand new flag for this work. There is a strong feeling that we must keep the existing one named as is due to a lot of mentioning around in the code, docs, articles, etc. And the new one could be something like CTLFLAG_PERJAIL. And, probably, that's a good idea to keep this possibility to have a sysctl variable which is allocated per jail but is not allowed to be changed by prisoned roots.
What do you think?
P.S. Sorry, I've missed the original discussion of the -J sysctl flag, it seems this discussion should have happened there.
So, the problem you have with these sysctl knobs is that they are supposed to be writable from the host but not from jails, while their value is per jail. That comes from the fact that they (unfortunately, IMHO) in fact serve two different purposes:
- They allow to set the default for their corresponding allow bit when creating top-level jails.
- They allow to retrieve the current setting for any jail.
In passing, these knobs are deprecated (in favor of jail parameters for 2; for 1 this is unclear to me), so I would not try too hard to cater to them (e.g., they could be omitted, if need be).
Do you know if there are other knobs with this exact situation?
As food for thought, in order to be more flexible in the future and easier on coders, I'd consider introducing new flags such as CTLFLAG_PRISON_RD and CTLFLAG_PRISON_RW, following the precedents of CTLFLAG_RD and CTLFLAG_RW, and CTLFLAG_CAPRD and CTLFLAG_CAPRW (except for the last underscore), with obvious meanings, and in addition reserve CTLFLAG_PRISON to only indicate knobs having per-jail value. I'm not sure if there will be actual cases where CTLFLAG_PRISON_RW without CTLFLAG_PRISON would make sense, but even if none materializes, I find it easier to reason about these clearly delineated flags than with conflated ones. Even a combination like CTLFLAG_PRISON alone, without neither CTLFLAG_PRISONRD nor CTLFLAG_PRISONRW, which doesn't make sense today, might be leveraged in the future.
Finally, whatever the outcome, and though I read you on compatibility, I'd still consider renaming CTLFLAG_PRISON to something more meaningful. I only see 18 occurences of it in the current tree, so its usage is low enough that changing it shouldn't be that much of a burden and only on a small amount of people. And I think it is especially desirable that we are not backwards compatible if the meaning of CTLFLAG_PRISON is to be changed (as in the above proposal).
It's worse than that - it actually changes prison0's view of that bit.
These particular sysctls are an ugly back-compat hack. In the old world, jail permissions were global; the host could set them and jails could read them. Now such permissions are per-jail, but for backward compatibility they do two different things: For prison0, they read or write the default value of the corresponding permission bit, but only for jails that are created via the old jail(2) interface. Inside a prison, they read that prison's corresponding permission bit. They really ought to be deprecated, along with jail(2) itself.
Aside from those, almost all sysctls that refer to the current prison are either CTLFLAG_RD or CTLFLAG_PRISON. A quick search of req->td->td_ucred->cr_prison yields only security.mac.do.rules. There may be others, since perhaps not all sysctl functions get cr_prison in a single statements like that. But those (and the back-compat ones) can be fixed to explicitly check for the prison in the write case, instead of implicitly as the comment in sys_jail_default_allow notes.
With such a fix, CTLFLAG_PRISON can be added to them (it can already be added to the CTLFLAG_RD ones) without needing a second or third CTLFLAG_PRISON_XXX flag.
I intend to have security.mac.do.rules as writable from both the host and jails, so this one will be no exception (CTLFLAG_RW|CTLFLAG_PRISON). In a series of not yet reviewed changes, I added CTLFLAG_PRISON to this knob. However, I ask that you don't do that now, as the current sysctl implementation is wrong for writes from jails (also fixed in the series to be published for review).
With such a fix, CTLFLAG_PRISON can be added to them (it can already be added to the CTLFLAG_RD ones) without needing a second or third CTLFLAG_PRISON_XXX flag.
The most common cases are allowing the same accesses from jails as from the host, or allowing just read access from jails and read-write from the host and they're indeed covered by the existing flags. However, using CTLFLAG_PRISON for read-only per-jail knobs changes that situation. I'd prefer to have separate flags as evoked above, but if the only offending sysctls are those we considered, I'm OK to defer that and have an explicit check in their implementation.