Each one is an arbitrary string associated with a jail. It can be set
upon jail creation or added/modified later:
    > jail -cm ... meta="tag1=value1 tag2=value2" env="configuration"
The values are not inherited from the parent jail.
A parent jail can read both metadata parameters, while a child jail can
read only env via security.jail.env sysctl.
The maximum size of meta or env per jail is controlled by the
global security.jail.meta_maxbufsize sysctl. Decreasing it does not
alter the existing meta information.
The set of allowed chars is controlled through the global
security.jail.meta_allowedchars sysctl.
Each metadata buffer can be handled as a set of key=value\n strings:
    > jail -cm ... meta="$(echo t1=v1; echo t2=v2)" env.1=one
    > jls meta.t2 env.1 meta.t1
While meta.t1= resets the value to an empty string, the meta.t1 without
the equal sign removes the given key.
Sponsored by:   SkunkWerks GmbHDetails
- Reviewers
- markj - kp - jamie 
- Group Reviewers
- Jails - security 
- Commits
- rG527027da391d: jail: Add meta and env parameters
 rG30e6e008bc06: jail: Add meta and env parameters
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
- Lint Not Applicable 
- Unit
- Tests Not Applicable 
Event Timeline
Right now I use the per-jail host.hostuuid to track jail identities and keep the associated state somewhere under /var/db/jails/${host_hostuuid}/ which is ugly.
This is the way to go IMHO. It would take more work to support it, but a lot less to use it.
If I understand the code correctly security.jail.meta_maxbufsize is the upper limit of the amount of additional kernel memory the jail can tie down with this feature. If a jail is allowed to create sub-jails should each of them be able to allocate metadata up to the global limit or should each of them get its own limit and the allocation be counted (recursively) against the parent limits?
I like the hierarchical limits.
I see a use case for both. Per-jail metadata just for the jail manager, and per-jail environment that the jail itself can see. If both exist, there's no need to set permissions.
There is a third usecase: Reporting *changes* to metadata from inside jail to the host (e.g. service health to a jail manager that ties into a load balancer). So there are (at least) three useful access permissions (host only, host write+jail read, host+jail write), but at some point it probably makes more sense to create a small tmpfs and nullfs mount it into the jail with the desired permissions.
Is this jail metadata visible inside the jail? For podman, read-only host-visible metadata can be added as annotations on the container but these are not visible inside the container's jail.
Yes, it is!
# podman run -it --rm ghcr.io/jlduran/freebsd:15.0 sh # sysctl security.jail.meta security.jail.meta: <do not exit>
Then inject the metadata:
jail -m jid=<jail-id> meta="www=v2"
Go back to the container, and check:
# sysctl security.jail.meta security.jail.meta: www=v2
There seems to be strong support for separating tags. I am
indifferent to either approach (I can do what I need either way)
but there is a strong tradeoff:
The current approach is simple but clean - (one writer updates all
tags  in a single "transaction"), and can be made compare-and-swap if
needed with minimal effort in future. Reading all tags again is
idempotent. This is sufficient for slow-moving / changing targets,
when most tags are applied at jail creation.
Using multiple tags mean its not possible to update all tags in a
single transaction, but it can be operated on more cleanly in a sysctl
style view. I don't know if this approach works well for string vs
integer data - are sysctls typed at all?
security.jail.meta.private.foo=bar
security.jail.meta.private.bar=baz
security.jail.meta.private.baz=1
Reading through https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
app.kubernetes.io/name
- The name of the application mysql string
app.kubernetes.io/instance
- A unique name identifying the instance of an application mysql-abcxyz string
How could we support this arrangement? While good OCI alignment was
not part of the brief, it would be desirable to support it.
How would the multi-tag view work if exposed via sysctl?
How do we incorporate "app.kubernetes.io/instance" as a single key, in a sysctl
style view?
What would happen when security.jail.meta.private.baz=1 changes type, from 1 to "bim" ?
What is the state of data if you read security.jail.meta.io.kubernetes.app,
returning 2 keys and values, if its not a single read transaction?
security.jail.meta.io.kubernetes.app.name
security.jail.meta.io.kubernetes.app.instance
They could have inconsistent values.
I'm not opposed to either implementation, or another one, I just think that
we introduce a lot of hidden complexity *for the kernel* rather than keep
internal simple and leave the mess to userland tools.
I would love to see future OCI/kubernetes needs accommodated. Comments welcomed
on how we could support this!
It seems there is an agreement to split it onto two buffers per jail: both are readable by parent jail, while only one is readable by a jail itself. The updated patch reflects this concept. For now metaext and metaint naming is used as external/internal concept. The naming is open for discussion.
Current state of the patch still does not insist on the structure of data in a buffer, it's up to users. As a demonstration of currently meant UI/UX, which obviously is not very straightforward, users could leverage the usual Unix-way tools for "tagging":
> jail -c name=tst persist metaext="$(echo role=www; echo version=20240824)" > jls -jtst metaext role=www version=20240824 > jls -jtst metaext | grep ^role= | cut -d= -f2 www > jls -jtst metaext | grep ^version= | cut -d= -f2 20240824 > jail -m name=tst metaext="$(jls -jtst metaext | sed 's/^version=.*$/version=20241120/')" > jls -jtst metaext role=www version=20241120
This is where we have the open space for the next ideas which would ease "tagging": sysctl-based, a separate program like jtags, and so on.
There is a third useful combination: (privileged) access host access only e.g. as a place to store API token or similar things without ever writing them to a file.
The problem having a fixed set of sysctls with with their designated access permissions is that all accesses have to go through a single writer and use the same data format because a single sysctl is just one string. This means anyone else is required closely coupled to that writer unless there is some standardised patch and query interface (e.g. JSON patch and JSON query) and I can't imagine anyone wanting to deal with that complexity. A flat key=value store avoids this by allowing multiple writers to lay exclusive claim to their part of the keyspace, but it doesn't absolve the need for access control.
Yeah, it seems that on the Jail Call of 26-Nov we came to a conclusion that for now we would keep it very simple like two buffers per jail managed from the user-land side, while keeping a wide spectrum of opportunities to extend it in the future having a more specific production need in mind. Thus, we can postpone thinking about extra complexity on the kernel side.
The buffer char set limiting is going to be added soon to this patch. As we discussed we could start with the better defaults for the sake of security and allow a very limited set of chars, while there is an opportunity to provide some configuration knobs in the future to widen the set if needed.
If I had managed to make that call, I have to say that at least the conclusion wouldn't have been unanimous. The single blob is indeed simpler to implement, but not simpler to use. Still, you mention extension, and that's a possibility. I'm thinking something like:
meta="foo=bar\0baz=bletch"
meta.foo="bar"
meta.baz="bletch"
Speaking of which, I like "meta" for per-jail private metadata, and "env" for environment readable inside the jail.
Yes, it's just a summarization of the meeting, the discussion continues.
The jails have a nice architecture, you know it :-), which allows adding a new parameter without even changing the user-land: jail(8), jls(8), etc. Of course, currently it's not meant to support such dynamic things like "meta.<anything>". I like the idea, especially when it's supported by others, to teach user-land to treat those meta buffers special way and introduce an extra interface like "meta.<key>=<value>". This is exactly one of the ways I envisioned for extending, and the string manipulations are kept out of the kernel. I would like to share my current understanding of how such UI/UX would look like to have a broader discussion:
# Users can still do whatever they want with a buffer: > jail -c meta="anything..." # Users can opt-in to use the structure proposed by new versions of jail(8) # and jls(8), what means to have a buffer sliced by a <delimiter> onto # "key=value" strings. # All the examples below assume that the <delimiter> is '\n' char. > jail -c persist meta.key1="value1" meta.key2="value2" > jls meta key1=value1 key2=value2 > jls meta.key2 value2 # Modify and append: > jail -m meta.key2="newvalue" meta.key3="v3" > jls meta key1=value1 key2=newvalue key3=v3 # We could support the existing syntax style for key=value pair removal: > jail -m nometa.key1 nometa.key3 > jls key2=newvalue
My current gut feeling is that \n newline char is the best delimiter to have the widest interoperability. And the kernel side, especially vfsopt subroutines, is like already based on a single NULL-terminated string. So that users can still work with a buffer as a human-readable text file and leverage key-based support of jail(8) and jls(8):
# Users may have metadata coming from an external source or a file: > cat a.conf k1=one k2=two > jail -m meta="$(cat a.conf)" > jls meta.k2 two
All the above seem to outline a possible next project/patch to extend the initial idea of "just a blob of text". What do you think?
Speaking of which, I like "meta" for per-jail private metadata, and "env" for environment readable inside the jail.
Probably, I'm missing something here, but have you meant it like "meta" param for parent jails and "env" param for the jail itself readable via sysctl security.jail.env? Like jail -c meta="tagging..." env="configuring...".
The latest update aggregates the recent discussions:
- Rename metaext/metaint to meta/env. meta is expected to be used for "tagging" and hidden from the jail. env is intended for "configuring" and readable by the jail through security.jail.env sysctl.
- The allowed chars for each buffer are very limited by default, it covers Base64, k=v\n format, and some extra bytes. It can be changed via security.jail.meta_allowedchars sysctl. For convenience (as it seems to me for now), setting it to an empty string allows everything.
- The tests and man page are upgraded respectively.
Why is this a kernel issue? Aside from NUL, because it preserves the C-string nature, allowed characters would seem only to be a concern on the user side.
Yes, that's a good point. Indeed, it feels like a user-land concern by default. If we agree to limit it by default for extra security, then having it in a single place covers all use cases and entry points: jail(8), flua, libjail, direct syscall, and so on -- it seems preferable to addressing it separately for each existing and future syscall consumer. What do you think? Does this approach offer additional benefits for us?
@jamie thanks for the comments. The thinking for restriction of characters is this:
- the most common expected use case is ascii character set, used in jails tools & shell scripts
- if the stored data can be binary, we introduce a risk of shell escapes in a place where root privileges are often used
- and every single program & tool needs to implement its own poorly secured, insufficiently tested, parsing,
- it is simpler & cleaner to restrict character set in 1 place, consistently
- that doesn't have to be in the kernel, it can be in userland, e.g. in a c library that can be used by jls(8) jail(8) jail(3lua) and libxo variants thereof.
There are currently a wide variety of strings in jail parameters, and for that matter many more elsewhere in the kernel. So far, we have gotten by with counting on administrators putting reasonable values in them.
while testing jls output with new tags, I noted that the manpage says " Each jail is represented by one row which contains space-separated values of the listed parameter". Which is not quite the case as many parameters could have an embedded new line, that's not trimmed out.
So for meta/env we could either:
- amend the manpage to match the existing reality (and then meta/env would just follow the existing inconsistent behaviour)
- swap all newline for spaces during jls line-based output (not in libxo mode)
I added a PR for this issue, https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283414
| sys/kern/kern_jailmeta.c | ||
|---|---|---|
| 426 | Who is the "someone else" who was quicker? This section is run with an exclusive lock held - you've even started the function by asserting that. You may unlock the prison, but you never unlock allprison_lock (which of course you shouldn't). | |
| sys/kern/kern_jailmeta.c | ||
|---|---|---|
| 426 | Thank you very much for your time. Indeed, it may seem non-obvious, the comment could be improved. Currently, the reasoning is as follows. Jail creation/update is under the giant allprison_lock, while the OSD framework only expects protection of its struct osd. In our case this is pr_osd which is protected by pr_mtx. Usually, we tend to avoid giant locks when there is an optimization demand, and there has been increased discussion about cases with a large number of jails. For example, new logic could be introduced for mass manipulation or mass reset of metadata, which is optimal to be done with per-jail locking (pr_mtx) without allprison_lock. It might be premature to address this now, but it's easier to implement it these days while the context is still fresh in my mind -- avoiding the potential need to analyze hard-to-diagnose defects caused by unexpectedly overridden metadata due to interference from new logic. And it's not too much complexity, a relatively simple repeat. What do you think? | |
| sys/kern/kern_jailmeta.c | ||
|---|---|---|
| 426 | Makes sense, if there are plans to go that direction. Yes, the comment could be improved, notably to mention the possible future use case. But I still don't see that use case. Surely, for something done with mass manipulation of data, the "mass" in question would be all jails or some subset, which would still require allprison_lock. I see perhaps a non-mass use case, where you would change the environment in some fashion other than jail_set, though I don't see offhand what that other method would be, or why you want it. I know that reading can happen outside the jail system (the sysctl to read the jail's own env), but not writing. | |
| sys/kern/kern_jailmeta.c | ||
|---|---|---|
| 426 | Thank you, the comment has been improved. | |
A few observations from actually running this:
If I look for an unknown key within meta or env (e.g. meta.bar when meta is "foo=x"), I get an empty string. I expected ENOENT like I would get for a regular jail parameter that doesn't exist.
There's no way to remove a key, only to add or set one. I can set it to an empty string, but they key is still there. I suppose that makes sense that setting a key to an empty value creates a key with an empty value, but I'd like a way to remove a key, short of reading the whole thing and writing back everything except the unwanted key. But I don't know a good way to specify that only working with a string value.
Would it be possible for the security.jail.env sysctl to expand the keys as sub-nodes, i.e. if a jail has env of "foo=x", then sysctl security.jail.env.foo would be "x".
Here's a possibility:
Saying meta.foo="" on the command line or jail.conf clears the contents of meta.foo, or adds it with empty contents. The jail_set(2) interface passes an empty string for the value. This already works.
Just saying meta.foo (no '=') on the command line or jail.conf removes the key if it exists. The jail_set(2) interface passes a null string for the value, which is typically an error.
Thank you very much for your time and consideration.
If I look for an unknown key within meta or env (e.g. meta.bar when meta is "foo=x"), I get an empty string. I expected ENOENT like I would get for a regular jail parameter that doesn't exist.
Technically speaking there are only two new parameters: meta and env. If meta2 is requested then ENOENT will be returned. While we work with the existing meta parameter but ask to parse its contents like meta.foo then we have, I guess, two options here: 1) terminate the whole process with ENOENT and return nothing as a result, even if there are other keys requested; or 2) return an empty string along with other keys requested and leave it to end users. From the use cases we know it seems that the second option (currently implemented) provides better UX. For example, some keys could be queried periodically and if one of them is missing then it may create inconvenience and more work on the end user side to deal with such case in order to retrieve at least those keys which are presented. What do you think?
Would it be possible for the security.jail.env sysctl to expand the keys as sub-nodes, i.e. if a jail has env of "foo=x", then sysctl security.jail.env.foo would be "x".
Yeah, dynamically created SYSCTL nodes could be used here, but it always reminds that the SYSCTL subsystem is not divided onto jails, i.e. if we alter the tree by adding new nodes it will be altered for everyone, and it leads to the dilemma of different security.jail.env.* trees per jail. For now, the security.jail.env is used as the only option to provide a child jail with its metadata. Ideally, it could be the same jls(8) interface using something like -j0, which we discussed in the past as a possibility: https://reviews.freebsd.org/D43476#993381. I would propose to invest into this way instead. What do you think about this?
Thank you for confirming this way, it seems more usable instead of meta.nofoo or nometa.foo. The latest patch version adds key removal using NULL value as a trigger.
I'm glad that turned out to be a workable option - thanks for adding it.
As to your comment that return an empty value for a nonexistent key instead of stopping the whole thing up with ENOENT, I see your point. But now I see another way: the NULL value, this time in the other direction. From the kernel interface side, it would be just the same as the setting direction, where receiving a NULL indicated attempted retrieval of a nonexistent key, while an empty string would only be for a key with an empty value.
It gets trickier on the user end though, where jls(8) only sometimes allows for a distinction. In particular, "jls -n meta.foo" or "jls -s meta.foo" could be made to show the same distinction between meta.foo="" and just meta.foo, but without the option there would be no way to tell. That's probably sufficient for those who care, though.
For sure, now it makes sense to communicate back respectively. Indeed, this part is a bit more tricky, and I had to cover jls and flua sides. Currently, the interface is as follows:
 # jail -c name=j persist meta="$(echo 1=one; echo 2=two)"
 # jls meta.1 meta.2
 one two
 # jls meta.3
 
 # jls meta.3 | hexdump -C
 00000000  0a                                                |.|
 00000001
 # jls -n meta.3 | hexdump -C
 00000000  6d 65 74 61 2e 33 0a                              |meta.3.|
 00000007
 # jls -s meta.3 | hexdump -C
 00000000  6d 65 74 61 2e 33 0a                              |meta.3.|
 00000007
 # jls --libxo json meta.3
 {"__version": "2", "jail-information": {"jail": [{}]}}
 # jls --libxo json -n meta.3
 {"__version": "2", "jail-information": {"jail": [{}]}}
 # jls --libxo json -s meta.3
 {"__version": "2", "jail-information": {"jail": [{}]}}
 # /usr/libexec/flua -ljail -e 'jid, r = jail.getparams("j", {"meta.3"}); print(r["meta.3"] == nil)'
 true
 # jls meta.1
 one
 # /usr/libexec/flua -ljail -e 'jail.setparams("j", {["meta.1"] = {}}, jail.UPDATE)'
 # jls meta.1 | hexdump -C
 00000000  0a                                                |.|
 00000001
 # jls meta.2
 two
 # /usr/libexec/flua -ljail -e 'jail.setparams("j", {["meta.2"] = false}, jail.UPDATE)'
 # jls meta.2 | hexdump -C
 00000000  0a                                                |.|
 00000001
# jail -m name=j meta.3= meta.4=four
# jls -s meta.missing meta.3 meta.4
meta.missing meta.3="" meta.4=fourEverything looks good, except that allowed chars part. That still looks like a kernel-level solution to a user-level non-problem.
Thanks for the confirmation.
I wanted to refresh the context around the char set limiting, at least for myself. The reasoning was that this new meta parameter is a bit exceptional among the existing ones. The existing parameters are usually in hands of an administrator/operator to provide initial configuration once, while the meta param is intended to be actively used by different orchestration-like and hosting-like solutions where its contents may easily come from user input through different web interfaces and it might be a potential attack vector to inject sequences which can be interpreted special way by shell scripts which are eventually run with root privileges in order to modify or create a jail.
It seems that a relatively simple alphanumeric-like content is expected to be used for the meta parameter. But solution designers or implementers might need to apply extra defense layers to make sure that unwanted sequences are not passed to the solution internals and do not impact its security side. It would require extra code or tools to be used to filter that considering that end user facing interfaces like web are usually not treated as the only defense layer and the other parts of a system are expected to verify inputs. So that, the first concern is that adopters might require extra work on their side, and the second one is that it might be left unnoticed until a security incident occurs.
The binary-like data such as keys or tokens can easily be mapped to the allowed by default base64 charset, what is a common practice for today’s kubernetes and container based solutions. Setting it to an empty string by $(sysctl security.jail.meta_allowedchars="") provides an easy way to opt-in for all chars allowed. And it would be considered as a conscious decision by users of the feature.
If it does not provide such limiting by default from the very beginning and a demand comes to add it later we will face the breaking change dilemma. And adding it being turned off by default is kind of the same like not adding it at all, having the expectation that majority of users might not take care of such detailed aspects. Having nice defaults is usually more practical than tons of best practice guides left aside without attention.
The code itself is not much about complexity, it's just the initialization part, the sysctl handler to read/modify it, and the simple “for” loop to check the meta parameter input. Having this context refreshed I wonder if it could be re-considered. What do you think, could this be given a second look?
Thanks once more for all your help and guidance so far.
I don't disagree that the way the character set is limited is efficient and doesn't add much to the code base. It's just the entire direction that I don't like. There are many places where sloppy text interpretation can cause problems. Environment variables come to mind - they too are passed between processes in the kernel. But user-level string injection has never been considered a kernel problem, and I don't see why it should start now. That's just not the kernel's job.
I you want to continue to push for this, I would suggest separating it from the main feature, adding another revision just for that. I like the general meta/env feature, but you'll just have to look elsewhere for buy-in to that one part.
I'm not sure what the point of the soft/hard size limits is. I imagine I could find it in the code, but could you explain it to me? Something more than the one comment about it, but less detail than the code itself :-).
Also regarding the size limits, how about making them jail parameters instead of global sysctl? Limits on what jails can do are generally a better fit per-jail than global.
Indeed, if we step back from the jailmeta context it is clear that throughout the history such cases are not covered, then it feels like an exclusive care just for the jailmeta. Please, consider the updated patch w/o this feature.
It was just thinking in advance and answering the question like what if we set meta_maxbufsize lower than existing meta buffers. And it's turned out that libjail users will not work then. The jailparam_get() prepares the receiving buffer space according to the information from sysctl, where security.jail.param.meta and security.jail.param.env provide the max size of meta and env jail parameters. As a result, if the actual meta/env buffer is bigger than the limit announced via those sysctlS then libjail users fail:
# sysctl security.jail.meta_maxbufsize=4 security.jail.meta_maxbufsize: 4096 -> 4 # jail -c name=j persist meta=123 # jls meta 123 # sysctl security.jail.meta_maxbufsize=2 security.jail.meta_maxbufsize: 4 -> 2 # jls meta jls: jail_get: Invalid argument # sysctl security.jail.meta_maxbufsize=4 security.jail.meta_maxbufsize: 2 -> 4 # jls meta 123
To help with that we report the soft limit, which should be treated like the biggest existing buffer length, but technically it is a little bit simpler. Providing only a part of a meta buffer which fits is not expected by end users. And changing the existing behavior of libjail is considered as a potential breaking change.
Also regarding the size limits, how about making them jail parameters instead of global sysctl? Limits on what jails can do are generally a better fit per-jail than global.
Yes, this idea was mentioned before, I guess it could be considered as a separate patch not to inflate and prolong the things. By the way, I've been considering it a bit more from other angles and a question pops up like "How could we change the limit for the prison0 without a sysctl?". Currently it can be changed (even upon boot) to set something bigger than a page-like by default (4K), or can be set to zero effectively disabling the meta feature. Without a sysctl or similar way it feels like that max number of jails which is hard-coded and not tunable, but tunable for child jails. Probably, that another patch could still use the security.jail.meta_maxbufsize sysctl so that prison0 can also be tuned as the starting biggest number, and newly added jail(8) param named the same way (meta_maxbufsize) could be used to control children downwards the jail tree. What do you think?
OK, I understand the problem with the maximum sizes. I think we need to pursue something in libjail that makes this seamless, but that can be later work.