Details

Reviewers

markj
kp

Group Reviewers

Jails
security

Summary

Each one is an arbitrary string associated with a jail. It can be set
upon jail creation or added/modified later:

    > jail -cm ... meta="tag1=value1 tag2=value2" env="configuration"

The values are not inherited from the parent jail.

A parent jail can read both metadata parameters, while a child jail can
read only env via security.jail.env sysctl.

The maximum size of meta or env per jail is controlled by the
global security.jail.meta_maxbufsize sysctl. Decreasing it does not
alter the existing meta information.

The set of allowed chars is controlled through the global
security.jail.meta_allowedchars sysctl.

Each metadata buffer can be handled as a set of key=value\n strings:

    > jail -cm ... meta="$(echo t1=v1; echo t2=v2)" env.1=one
    > jls meta.t2 env.1 meta.t1

Sponsored by:   SkunkWerks GmbH

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 61148
Build 58032: arc lint + arc unit

Event Timeline

igoro created this revision.Nov 19 2024, 10:40 AM

Herald added subscribers: olce, imp. · View Herald TranscriptNov 19 2024, 10:40 AM

igoro requested review of this revision.Nov 19 2024, 10:40 AM

Harbormaster completed remote builds in B60681: Diff 146682.Nov 19 2024, 10:40 AM

Currently, there is no accessible way to attach metadata to jails. This is commonly used elsewhere, for example in Kubernetes, to allow non-unique properties to enrich load balancers, schedulers, and volume provisioners to make more informed decisions.

The initial motivation was to be able to have an external load balancer to update its pool of endpoints based on jail metadata - for example, as a web application is updated from v1 to v2, the load balancer would query the “www” tags, and be able to grow its pool until there are 3 new v2 jails, and then remove the older v1 versions.

While there are already other mechanisms to make metadata available inside the jail, these rely on shared memory, or files, and are not directly tied to the jail’s lifecycle.

The Object Specific Data approach is already used with jails, to provide the zfs-specific dataset names, and jailed on/off properties. This implementation follows the same approach allowing generic metadata, accessible from outside, inside, and via the existing jail and flua tools.

From the technical perspective, there are the following open topics:

It seems that _security_jail sysctl node ref exposure could be extracted as a separate patch if confirmed.
There is a question of whether the kernel should limit the set of allowed characters within the meta buffer, or if it's better to keep such policies outside of the kernel business.
The current state of the patch allows reading meta by any user within a jail. Do we want to disable it by default and add something like allow.read_meta or allow.metadata parameter to control it per jail?

kp added a reviewer: Jails.Nov 19 2024, 10:44 AM

igoro edited the summary of this revision. (Show Details)Nov 19 2024, 10:54 AM

igoro edited the summary of this revision. (Show Details)

I can see the appeal of this to jail managers, but having just a single meta sysctl per jail implies that it has to be "owned" by a single writer. This requires everything wanting to make use of this new feature to be tightly coupled to the writer. A flat list of key=value pairs (instead of a single value) would allow multiple users of this feature per jail e.g. multiple helper commands preparing just one aspect of a jail e.g. dynamic devfs ruleset loading, network setup/teardown, storage provisioning.

If I understand the code correctly security.jail.meta_maxbufsize is the upper limit of the amount of additional kernel memory the jail can tie down with this feature. If a jail is allowed to create sub-jails should each of them be able to allocate metadata up to the global limit or should each of them get its own limit and the allocation be counted (recursively) against the parent limits?

igoro mentioned this in D47671: kyua: Support is_exclusive metadata coming from an ATF test case.Nov 19 2024, 1:16 PM

zlei added a subscriber: zlei.Nov 19 2024, 2:59 PM

Just tossing in a general +1, I haven't taken any time to review this at the moment; I've thought about a similar notion in the past (jail tags) to be able to do some role-based querying of jails.

jlduran added a subscriber: jlduran.Nov 19 2024, 5:02 PM

Another +1.
Testing here using podman to inject metadata into containers, something similar to the original motivation.
Regarding the restrictions for reading the sysctl from inside the jail, I have no strong opinions, and fully trust your judgement (based on current names, most likely allow.read_meta, as I would interpret allow.metadata as being able to set metadata from inside the jail).

In D47668#1087166, @kevans wrote:

Just tossing in a general +1, I haven't taken any time to review this at the moment; I've thought about a similar notion in the past (jail tags) to be able to do some role-based querying of jails.

Right now I use the per-jail host.hostuuid to track jail identities and keep the associated state somewhere under /var/db/jails/${host_hostuuid}/ which is ugly.

In D47668#1087018, @crest_freebsd_rlwinm.de wrote:

A flat list of key=value pairs (instead of a single value) would allow multiple users of this feature per jail e.g. multiple helper commands preparing just one aspect of a jail e.g. dynamic devfs ruleset loading, network setup/teardown, storage provisioning.

This is the way to go IMHO. It would take more work to support it, but a lot less to use it.

If I understand the code correctly security.jail.meta_maxbufsize is the upper limit of the amount of additional kernel memory the jail can tie down with this feature. If a jail is allowed to create sub-jails should each of them be able to allocate metadata up to the global limit or should each of them get its own limit and the allocation be counted (recursively) against the parent limits?

I like the hierarchical limits.

In D47668#1087001, @igoro wrote:

The current state of the patch allows reading meta by any user within a jail. Do we want to disable it by default and add something like allow.read_meta or allow.metadata parameter to control it per jail?

I see a use case for both. Per-jail metadata just for the jail manager, and per-jail environment that the jail itself can see. If both exist, there's no need to set permissions.

In D47668#1087217, @jamie wrote:

In D47668#1087001, @igoro wrote:

The current state of the patch allows reading meta by any user within a jail. Do we want to disable it by default and add something like allow.read_meta or allow.metadata parameter to control it per jail?

I see a use case for both. Per-jail metadata just for the jail manager, and per-jail environment that the jail itself can see. If both exist, there's no need to set permissions.

There is a third usecase: Reporting *changes* to metadata from inside jail to the host (e.g. service health to a jail manager that ties into a load balancer). So there are (at least) three useful access permissions (host only, host write+jail read, host+jail write), but at some point it probably makes more sense to create a small tmpfs and nullfs mount it into the jail with the desired permissions.

In D47668#1087193, @jlduran wrote:

Another +1.
Testing here using podman to inject metadata into containers, something similar to the original motivation.
Regarding the restrictions for reading the sysctl from inside the jail, I have no strong opinions, and fully trust your judgement (based on current names, most likely allow.read_meta, as I would interpret allow.metadata as being able to set metadata from inside the jail).

Is this jail metadata visible inside the jail? For podman, read-only host-visible metadata can be added as annotations on the container but these are not visible inside the container's jail.

In D47668#1087469, @dfr wrote:

In D47668#1087193, @jlduran wrote:

Another +1.
Testing here using podman to inject metadata into containers, something similar to the original motivation.
Regarding the restrictions for reading the sysctl from inside the jail, I have no strong opinions, and fully trust your judgement (based on current names, most likely allow.read_meta, as I would interpret allow.metadata as being able to set metadata from inside the jail).

Is this jail metadata visible inside the jail? For podman, read-only host-visible metadata can be added as annotations on the container but these are not visible inside the container's jail.

Yes, it is!

# podman run -it --rm ghcr.io/jlduran/freebsd:15.0 sh
# sysctl security.jail.meta
security.jail.meta:
<do not exit>

Then inject the metadata:

jail -m jid=<jail-id> meta="www=v2"

Go back to the container, and check:

# sysctl security.jail.meta
security.jail.meta: www=v2

There seems to be strong support for separating tags. I am
indifferent to either approach (I can do what I need either way)
but there is a strong tradeoff:

The current approach is simple but clean - (one writer updates all
tags in a single "transaction"), and can be made compare-and-swap if
needed with minimal effort in future. Reading all tags again is
idempotent. This is sufficient for slow-moving / changing targets,
when most tags are applied at jail creation.

Using multiple tags mean its not possible to update all tags in a
single transaction, but it can be operated on more cleanly in a sysctl
style view. I don't know if this approach works well for string vs
integer data - are sysctls typed at all?

security.jail.meta.private.foo=bar
security.jail.meta.private.bar=baz
security.jail.meta.private.baz=1

Reading through https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/

app.kubernetes.io/name

The name of the application mysql string

app.kubernetes.io/instance

A unique name identifying the instance of an application mysql-abcxyz string

How could we support this arrangement? While good OCI alignment was
not part of the brief, it would be desirable to support it.

How would the multi-tag view work if exposed via sysctl?

How do we incorporate "app.kubernetes.io/instance" as a single key, in a sysctl
style view?

What would happen when security.jail.meta.private.baz=1 changes type, from 1 to "bim" ?

What is the state of data if you read security.jail.meta.io.kubernetes.app,
returning 2 keys and values, if its not a single read transaction?

security.jail.meta.io.kubernetes.app.name
security.jail.meta.io.kubernetes.app.instance

They could have inconsistent values.

I'm not opposed to either implementation, or another one, I just think that
we introduce a lot of hidden complexity *for the kernel* rather than keep
internal simple and leave the mess to userland tools.

I would love to see future OCI/kubernetes needs accommodated. Comments welcomed
on how we could support this!

v2: Split onto two buffers per jail: external & internal

Harbormaster completed remote builds in B60762: Diff 146915.Nov 25 2024, 5:22 PM

It seems there is an agreement to split it onto two buffers per jail: both are readable by parent jail, while only one is readable by a jail itself. The updated patch reflects this concept. For now metaext and metaint naming is used as external/internal concept. The naming is open for discussion.

Current state of the patch still does not insist on the structure of data in a buffer, it's up to users. As a demonstration of currently meant UI/UX, which obviously is not very straightforward, users could leverage the usual Unix-way tools for "tagging":

> jail -c name=tst persist metaext="$(echo role=www; echo version=20240824)"

> jls -jtst metaext
role=www
version=20240824

> jls -jtst metaext | grep ^role= | cut -d= -f2
www

> jls -jtst metaext | grep ^version= | cut -d= -f2
20240824

> jail -m name=tst metaext="$(jls -jtst metaext | sed 's/^version=.*$/version=20241120/')"

> jls -jtst metaext
role=www
version=20241120

This is where we have the open space for the next ideas which would ease "tagging": sysctl-based, a separate program like jtags, and so on.

crest_freebsd_rlwinm.de added a comment.Nov 26 2024, 5:50 PM

This comment was removed by crest_freebsd_rlwinm.de.

igoro edited the summary of this revision. (Show Details)Nov 26 2024, 5:58 PM

In D47668#1088649, @igoro wrote:

It seems there is an agreement to split it onto two buffers per jail: both are readable by parent jail, while only one is readable by a jail itself. The updated patch reflects this concept. For now metaext and metaint naming is used as external/internal concept. The naming is open for discussion.

There is a third useful combination: (privileged) access host access only e.g. as a place to store API token or similar things without ever writing them to a file.

The problem having a fixed set of sysctls with with their designated access permissions is that all accesses have to go through a single writer and use the same data format because a single sysctl is just one string. This means anyone else is required closely coupled to that writer unless there is some standardised patch and query interface (e.g. JSON patch and JSON query) and I can't imagine anyone wanting to deal with that complexity. A flat key=value store avoids this by allowing multiple writers to lay exclusive claim to their part of the keyspace, but it doesn't absolve the need for access control.

In D47668#1089218, @crest_freebsd_rlwinm.de wrote:

In D47668#1088649, @igoro wrote:

It seems there is an agreement to split it onto two buffers per jail: both are readable by parent jail, while only one is readable by a jail itself. The updated patch reflects this concept. For now metaext and metaint naming is used as external/internal concept. The naming is open for discussion.

There is a third useful combination: (privileged) access host access only e.g. as a place to store API token or similar things without ever writing them to a file.

The problem having a fixed set of sysctls with with their designated access permissions is that all accesses have to go through a single writer and use the same data format because a single sysctl is just one string. This means anyone else is required closely coupled to that writer unless there is some standardised patch and query interface (e.g. JSON patch and JSON query) and I can't imagine anyone wanting to deal with that complexity. A flat key=value store avoids this by allowing multiple writers to lay exclusive claim to their part of the keyspace, but it doesn't absolve the need for access control.

Yeah, it seems that on the Jail Call of 26-Nov we came to a conclusion that for now we would keep it very simple like two buffers per jail managed from the user-land side, while keeping a wide spectrum of opportunities to extend it in the future having a more specific production need in mind. Thus, we can postpone thinking about extra complexity on the kernel side.

The buffer char set limiting is going to be added soon to this patch. As we discussed we could start with the better defaults for the sake of security and allow a very limited set of chars, while there is an opportunity to provide some configuration knobs in the future to widen the set if needed.

In D47668#1089529, @igoro wrote:

Yeah, it seems that on the Jail Call of 26-Nov we came to a conclusion that for now we would keep it very simple like two buffers per jail managed from the user-land side, while keeping a wide spectrum of opportunities to extend it in the future having a more specific production need in mind. Thus, we can postpone thinking about extra complexity on the kernel side.

If I had managed to make that call, I have to say that at least the conclusion wouldn't have been unanimous. The single blob is indeed simpler to implement, but not simpler to use. Still, you mention extension, and that's a possibility. I'm thinking something like:

meta="foo=bar\0baz=bletch"
meta.foo="bar"
meta.baz="bletch"

Speaking of which, I like "meta" for per-jail private metadata, and "env" for environment readable inside the jail.

In D47668#1090028, @jamie wrote:

In D47668#1089529, @igoro wrote:

Yeah, it seems that on the Jail Call of 26-Nov we came to a conclusion that for now we would keep it very simple like two buffers per jail managed from the user-land side, while keeping a wide spectrum of opportunities to extend it in the future having a more specific production need in mind. Thus, we can postpone thinking about extra complexity on the kernel side.

If I had managed to make that call, I have to say that at least the conclusion wouldn't have been unanimous. The single blob is indeed simpler to implement, but not simpler to use. Still, you mention extension, and that's a possibility. I'm thinking something like:

meta="foo=bar\0baz=bletch"
meta.foo="bar"
meta.baz="bletch"

Yes, it's just a summarization of the meeting, the discussion continues.

The jails have a nice architecture, you know it :-), which allows adding a new parameter without even changing the user-land: jail(8), jls(8), etc. Of course, currently it's not meant to support such dynamic things like "meta.<anything>". I like the idea, especially when it's supported by others, to teach user-land to treat those meta buffers special way and introduce an extra interface like "meta.<key>=<value>". This is exactly one of the ways I envisioned for extending, and the string manipulations are kept out of the kernel. I would like to share my current understanding of how such UI/UX would look like to have a broader discussion:

# Users can still do whatever they want with a buffer:
> jail -c meta="anything..."

# Users can opt-in to use the structure proposed by new versions of jail(8)
# and jls(8), what means to have a buffer sliced by a <delimiter> onto
# "key=value" strings.
# All the examples below assume that the <delimiter> is '\n' char.
> jail -c persist meta.key1="value1" meta.key2="value2"
> jls meta
key1=value1
key2=value2
> jls meta.key2
value2

# Modify and append:
> jail -m meta.key2="newvalue" meta.key3="v3"
> jls meta
key1=value1
key2=newvalue
key3=v3

# We could support the existing syntax style for key=value pair removal:
> jail -m nometa.key1 nometa.key3
> jls
key2=newvalue

My current gut feeling is that \n newline char is the best delimiter to have the widest interoperability. And the kernel side, especially vfsopt subroutines, is like already based on a single NULL-terminated string. So that users can still work with a buffer as a human-readable text file and leverage key-based support of jail(8) and jls(8):

# Users may have metadata coming from an external source or a file:
> cat a.conf
k1=one
k2=two
> jail -m meta="$(cat a.conf)"
> jls meta.k2
two

All the above seem to outline a possible next project/patch to extend the initial idea of "just a blob of text". What do you think?

Speaking of which, I like "meta" for per-jail private metadata, and "env" for environment readable inside the jail.

Probably, I'm missing something here, but have you meant it like "meta" param for parent jails and "env" param for the jail itself readable via sysctl security.jail.env? Like jail -c meta="tagging..." env="configuring...".

igoro retitled this revision from jail: Add meta parameter to jail: Add meta and env parameters.Fri, Dec 13, 5:41 PM

Rename to meta/env, add meta_allowedchars.

Harbormaster completed remote builds in B61148: Diff 147930.Fri, Dec 13, 5:58 PM

igoro added a reviewer: security.Fri, Dec 13, 6:00 PM

igoro edited the summary of this revision. (Show Details)

The latest update aggregates the recent discussions:

Rename metaext/metaint to meta/env. meta is expected to be used for "tagging" and hidden from the jail. env is intended for "configuring" and readable by the jail through security.jail.env sysctl.
The allowed chars for each buffer are very limited by default, it covers Base64, k=v\n format, and some extra bytes. It can be changed via security.jail.meta_allowedchars sysctl. For convenience (as it seems to me for now), setting it to an empty string allows everything.
The tests and man page are upgraded respectively.

In D47668#1095796, @igoro wrote:

The allowed chars for each buffer are very limited by default, it covers Base64, k=v\n format, and some extra bytes. It can be changed via security.jail.meta_allowedchars sysctl. For convenience (as it seems to me for now), setting it to an empty string allows everything.

Why is this a kernel issue? Aside from NUL, because it preserves the C-string nature, allowed characters would seem only to be a concern on the user side.

In D47668#1095800, @jamie wrote:

In D47668#1095796, @igoro wrote:

The allowed chars for each buffer are very limited by default, it covers Base64, k=v\n format, and some extra bytes. It can be changed via security.jail.meta_allowedchars sysctl. For convenience (as it seems to me for now), setting it to an empty string allows everything.

Why is this a kernel issue? Aside from NUL, because it preserves the C-string nature, allowed characters would seem only to be a concern on the user side.

Yes, that's a good point. Indeed, it feels like a user-land concern by default. If we agree to limit it by default for extra security, then having it in a single place covers all use cases and entry points: jail(8), flua, libjail, direct syscall, and so on -- it seems preferable to addressing it separately for each existing and future syscall consumer. What do you think? Does this approach offer additional benefits for us?

In D47668#1095802, @igoro wrote:

In D47668#1095800, @jamie wrote:

In D47668#1095796, @igoro wrote:

The allowed chars for each buffer are very limited by default, it covers Base64, k=v\n format, and some extra bytes. It can be changed via security.jail.meta_allowedchars sysctl. For convenience (as it seems to me for now), setting it to an empty string allows everything.

Why is this a kernel issue? Aside from NUL, because it preserves the C-string nature, allowed characters would seem only to be a concern on the user side.

Yes, that's a good point. Indeed, it feels like a user-land concern by default. If we agree to limit it by default for extra security, then having it in a single place covers all use cases and entry points: jail(8), flua, libjail, direct syscall, and so on -- it seems preferable to addressing it separately for each existing and future syscall consumer. What do you think? Does this approach offer additional benefits for us?

@jamie thanks for the comments. The thinking for restriction of characters is this:

the most common expected use case is ascii character set, used in jails tools & shell scripts
if the stored data can be binary, we introduce a risk of shell escapes in a place where root privileges are often used
and every single program & tool needs to implement its own poorly secured, insufficiently tested, parsing,
it is simpler & cleaner to restrict character set in 1 place, consistently
that doesn't have to be in the kernel, it can be in userland, e.g. in a c library that can be used by jls(8) jail(8) jail(3lua) and libxo variants thereof.

There are currently a wide variety of strings in jail parameters, and for that matter many more elsewhere in the kernel. So far, we have gotten by with counting on administrators putting reasonable values in them.

guest-patmaddox added a subscriber: guest-patmaddox.Wed, Dec 18, 10:10 AM

Each buffer can be handled as a set of key=value\n strings

Harbormaster completed remote builds in B61237: Diff 148200.Wed, Dec 18, 4:11 PM

igoro edited the summary of this revision. (Show Details)Wed, Dec 18, 4:14 PM

libjail: Correctly differentiate no<boolparam> from keyvalue-based ones

Harbormaster completed remote builds in B61242: Diff 148205.Wed, Dec 18, 5:22 PM

Fix jm_h_cut_occurrences() logic

Harbormaster completed remote builds in B61247: Diff 148213.Wed, Dec 18, 9:01 PM

while testing jls output with new tags, I noted that the manpage says " Each jail is represented by one row which contains space-separated values of the listed parameter". Which is not quite the case as many parameters could have an embedded new line, that's not trimmed out.

So for meta/env we could either:

amend the manpage to match the existing reality (and then meta/env would just follow the existing inconsistent behaviour)
swap all newline for spaces during jls line-based output (not in libxo mode)

I added a PR for this issue, https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283414

Add keyvalue_contention test case

Harbormaster completed remote builds in B61260: Diff 148240.Thu, Dec 19, 3:01 PM

Tiny code improvements, no functional change

Harbormaster completed remote builds in B61290: Diff 148298.Fri, Dec 20, 9:14 PM

Make keyvalue_contention test case more accurate

Harbormaster completed remote builds in B61291: Diff 148299.Fri, Dec 20, 9:24 PM

jail: Add meta and env parameters
Needs ReviewPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 147930

sys/conf/files

sys/kern/kern_jail.c

sys/kern/kern_jailmeta.c

sys/sys/jail.h

tests/sys/kern/Makefile

tests/sys/kern/jailmeta.sh

usr.sbin/jail/jail.8

jail: Add meta and env parametersNeeds ReviewPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 147930

sys/conf/files

sys/kern/kern_jail.c

sys/kern/kern_jailmeta.c

sys/sys/jail.h

tests/sys/kern/Makefile

tests/sys/kern/jailmeta.sh

usr.sbin/jail/jail.8

jail: Add meta and env parameters
Needs ReviewPublic
Actions

Revision Contents
Changeset List