Page MenuHomeFreeBSD

rtld-elf: Track allocated TCBs internally and use for distribute_static_tls
AcceptedPublic

Authored by jrtc27 on Wed, Jun 18, 9:31 PM.
Tags
None
Referenced Files
F122224408: D50920.id.diff
Thu, Jul 3, 12:37 PM
F122198073: D50920.id157277.diff
Thu, Jul 3, 6:00 AM
Unknown Object (File)
Tue, Jul 1, 5:49 PM
Unknown Object (File)
Mon, Jun 30, 12:13 PM
Unknown Object (File)
Sun, Jun 29, 3:43 PM
Unknown Object (File)
Sat, Jun 28, 5:35 PM
Unknown Object (File)
Fri, Jun 27, 8:49 PM
Unknown Object (File)
Fri, Jun 27, 2:54 AM
Subscribers

Details

Reviewers
kib
Summary

Currently rtld delegates to libc or libthr to initialise the TCBs for
all existing threads when dlopen is called for a library that is using
static TLS. This creates an odd split where rtld manages all of TLS for
dynamically-linked executables except for this specific case, and is
unnecessarily complex, including having to reason about the locking due
to dropping the bind lock so libthr can take the thread list lock
without deadlocking if any of the code run whilst that lock is held ends
up calling back into rtld (such as for lazy PLT resolution).

The only real reason we call out into libc / libthr is that we don't
have a list of threads in rtld and that's how we find the currently used
TCBs to initialise (and at the same time do the copy in the callee
rather than adding overhead with some kind of callback that provides the
TCB to rtld. If we instead keep a list of allocated TCBs in rtld itself
then we no longer need to do this, and can just copy the data in rtld.
How these TCBs are mapped to threads is irrelevant, rtld can just treat
all TCBs equally and ensure that each TCB's static TLS data block
remains in sync with the current set of loaded modules, just as how
_rtld_allocate_tls creates a fresh TCB and associated data without any
embedded threading model assumptions.

As an implementation detail, to avoid a separate allocation for the list
entry and having to find that allocation from the TCB to remove and free
it on deallocation, we allocate a fake TLS offset for it and embed the
list entry there in each TLS block.

This will also make it easier to add a new TLS ABI downstream in
CheriBSD, especially in the presence of library compartmentalisation.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 64959
Build 61842: arc lint + arc unit

Event Timeline

libexec/rtld-elf/rtld.c
6189

This doesn't handle offsetting by obj->tlspoffset & (obj->tlsalign - 1); like allocate_tls itself does, but that's because this is inlining the exact implementation used by libc and libthr. I don't actually know how to produce an object with such an offset so can't write a test reproducer for that bug. But now it's in rtld-elf we can actually fix that bug without having to change __pthread_distribute_static_tls's API (though maybe you could get away with it not being an ABI break to add the extra argument on all our supported architectures?).

libexec/rtld-elf/rtld.c
88

Note the existing naming conventions in rtld, ugly but consistent. I believe it would be

typedef struct Struct_TCB_List_Entry {
...
} TCB_List_Entry;
5435

I am not sure that dying there is the best action.
Basically the tcb_list_remove() is externally callable, and IMO it is more user-friendly to report an error and then do nothing both there and in the free_tls() caller.

5441

You used STAILQ to have the tcb_list_entry down to two pointers? I would not object against using TAILQ instead, but it is up to you.

libexec/rtld-elf/rtld.c
88

Yeah I can continue that if you'd rather, I just figured for adding new code I should do it a bit more normally, and thought I was more at risk of being told not to use the odd rtld convention for new code.

5435

I was in two minds about that. I did it this way to match how calling free(3) will abort in various bad input cases (jemalloc doesn't do that so much though, but other implementations do). Especially since only libthr is using this interface, we don't expect random user code to do this outside of very select circumstances. We generally take the view in CheriBSD that it's better to fail up front rather than squash errors like double free and continue blindly on, since very occasionally ignoring something like a double free can be insufficient to mitigate a vulnerability.

5441

Given the linear search to find the entry in the first place there's not really a benefit to using a TAILQ. It would remove this condition but it's O(1) either way, so we'd just be adding an extra pointer for the sake of it IMO.

libexec/rtld-elf/rtld.c
5427

(I'm assuming this linear search isn't a performance concern; if it is we could always switch to something from tree(3))

libexec/rtld-elf/rtld.c
88

Up to you.

5427

pho did experiments and we were able to run > 100k threads on not too large machines.
May be pre-allocate the pointer in tcb itself, e.g. one word before the self pointer?

5435

My personal opinion is that libraries should not terminate apps.

libexec/rtld-elf/rtld.c
5427

That would let the application follow and/or mess with them on CHERI which would likely break our compartmentalisation model. Keeping this wholly internal to rtld is a requirement at least downstream, and I'd rather the upstream code did the same rather than having to have that as a diff (or #ifdef if and when CHERI is upstreamed).

5435

I guess my response would be they can do that if they really want to via all kinds of means, whether aborting, calling exit, making direct syscalls, signalling / faulting, etc.

If you're hitting this case either something's gone very wrong inside libthr or you have an actively malicious library that could just call fprintf+exit instead to achieve the same effect. The latter case isn't interesting, and in the former case continuing on à la ON ERROR RESUME NEXT is likely to not leave you in a stable state.

libexec/rtld-elf/rtld.c
5427

Then define MI tcp_to_tcb_list_entry(). On linear architectures it would be __containerof() to subtract the offset of the tcb from its pointer, on cheri you would use tree or stailq or whatever, for external container.

Same would be needed for list_insert/list_remove.

libexec/rtld-elf/rtld.c
5427

s/MI/MD of course, sorry

libexec/rtld-elf/rtld.c
5427

I think it gets messier than you might first think. struct tcb needs to stay the actual architectural TCB, and that's what _rtld_allocate/free_tls return/take. So you'd need some wrapper struct that you actually allocate, or just put two different structs next to each other in the allication. And I don't know how that would interact with calling _rtld_allocate/free_tls with a TCB size greater than sizeof(struct tcb); is the expectation from the caller that their additional memory goes directly before the TCB? Presumably, otherwise how do they know where it is? But then how does rtld know where to find this new link pointer for a given TCB?

5427

s/architectural/ABI's/ I guess

libexec/rtld-elf/rtld.c
5427

We need to allocate more than just requested size, to ensure the alignment. Similarly, we can do thinks to ensure that there is a space for one more pointer, I do not see a difference. Easiest would be to put the pointer right before the tcb (I sure talk about variant II).

libexec/rtld-elf/rtld.c
5427

I'm generally talking about variant I here, where the extra padding is head padding, i.e. at the start of the allocation. The only thing the caller can reasonably do is assume its extra space is directly prior to the TCB; anything else and it's making assumptions about how much extra padding there is, which will depend on the static TLS data alignment. To quote the ASCII diagram from libc:

* +----------+--------------+--------------+-----------+------------------+
* | pre gap  | extended TCB |     TCB      | post gap  |    TLS segment   |
* | pre_size |  extra_size  | TLS_TCB_SIZE | post_size | tls_static_space |
* +----------+--------------+--------------+-----------+------------------+

In the variant II case you can't put it before the TCB because that's where the static TLS data is. You can't put it after the TCB because that is in the way of the extra_size region, and if you put it in struct tcb itself then that's effectively the same problem, the extra_size region moves.

libexec/rtld-elf/rtld.c
5427

Why in the variant II the link word cannot be put after the TCB? It is essentially a tcb_size (specified by caller) + sizeof(uintptr_t).

Similarly, why for the variant I, the word cannot be added before the TCB?

In both cases, I am looking at the diagrams from the Drepper' article/spec.

libexec/rtld-elf/rtld.c
5427

Because our API allows the caller to allocate additional space already, and the only way they can determine where that extra space is is by assuming it's directly next to the TCB. If the TCB itself is now variable length then there's no way for it to know where its extra space is.

libexec/rtld-elf/rtld.c
5427

Technically it would mean increasing the TCB size by the word. Then for variant I the real tcb is at the alloc address + sizeof(word), for variant II nothing changes. The formula for looking up the location of the word in variant I is trivial, tcbp - sizeof(uintptr_t), for variant II we have to trust the size arg passed by the caller, tcbp + tcb_size.

What I am missing?

libexec/rtld-elf/rtld.c
5427

For variant II, the pointer is after the extra space that was requested by the caller, see the formula above.

libexec/rtld-elf/rtld.c
5427

How does rtld know how much extra space there is? It can vary between TCBs (especially given the main thread's is allocated by rtld itself).

libexec/rtld-elf/rtld.c
5427

From the tcbsize argument, for non-initial threads.

For the initial thread, the rtld is the caller of allocate_tls(), so the size is statically known.

libexec/rtld-elf/rtld.c
5427

So we'd need to save that in every TCB_Entry? Which is the same memory overhead (on non-CHERI) as just allocating the separate data structure.

libexec/rtld-elf/rtld.c
5427

For non-cheri, the pointer (or two pointers) would allow the removal in fixed time, without iterating over all tcbs, or without the cost of maintaining the tree.

WRT the memory overhead, it would be slightly more optimal to have the link pointers co-allocated with tcb due to allocator overhead. But this is definitely not important.

Another, somewhat strange, approach for embedding the linkage data into the tcb/tls could be to allocate the index for the rtld object itself. Then we can have as much space as we want there, and use it for TAILQ.

jhb added inline comments.
libexec/rtld-elf/rtld.c
88

Hmm, downstream in CheriBSD we have also added a new Plt_Entry as we permit multiple PLTs (and associated GOTs) within a single shared library. While for that case I did follow rtld's existing conventions, I would be happy to switch it to more typical FreeBSD style if we think we would like to slowly convert rtld over to KNF. If we would like to converge on KNF, then perhaps this should also use a KNF-style type name?

5427

Hmm, I guess one assumption it might be worth clarifying is do we expect thread exit to be a critical path. That is, do we think applications tend to create many short-lived threads? My impression from the move away from M:N threading (and the fact that Linux never bothered with M:N threading for pthreads), is that applications that run on Linux (and FreeBSD) tend to create pools of long-lived threads on which tasks are scheduled using application-specific scheduling (that is, something like taskqueue(9)) rather than creating lots of short-lived threads that are dedicated to a single task.

(FWIW, in C++ I would just use a std::unordered_map<struct tcb *> which is more of a hash table than a tree(3) (which is akin to a std::map<>?) as presumably thread create/exit is more of a hot path than dlopen. Too bad we don't have a nice set of macros for a hash table, it's one of the reasons I like using C++ in userspace.)

Also, to understand the two proposals, I believe kib@ is suggesting to effectively add a pointer to the associated struct tcb_list_entry into the TCB allocation on the "opposite side" of the static TLS space, but for TLS II we'd have to know how much extra space there is? But if we store the size of the space in the struct tcb_list_entry I don't understand how you can find it? That is, you need the tls_size value to know the offset of the new pointer relative to struct tcb *, but you need to get tls_size by reading from the new pointer?

The current version uses an O(n) list traversal, and Jess's alternate suggestion is to replace the list with a tree keyed by the struct tcb * value?

libexec/rtld-elf/rtld.c
5427
  1. I think we must cope with all uses, and high-rate thread create/destroy, or even high rate thread-destroy after slow accumulating a lot of threads, is all allowed and would make us look bad if we do not handle that fast enough.
  2. As I said, both _rtld_allocate_tls() and _rtld_free_tls() take the tcbsize as the argument, so we do know presumably correct size of pcb on free. But see item 4.
  3. Yes, I believe the code in review is O(n), tree gives O(log n), and my proposal gives O(1), where n is the number of threads.
  4. Initially I did proposed to add a pointer into the TCB allocation. Then I realized that we need two pointers for TAILQ. Then I think I have an idea that makes it work without extending TCB allocation. We can allocate tls index for rtld itself, and keep the linkage in the tls module data for rtld.
libexec/rtld-elf/rtld.c
5427

So for point 4, if I understand correctly, the idea is to allocate a dummy TLS index so that the DTV array contains an extra entry that points back to the struct tcb_tls_entry and to make struct tcb_tls_entry use a TAILQ_ENTRY so you can do O(1) removal?

For CHERI this may need a bit more thought as we don't want to create more places where an attacker with arbitrary code execution (e.g. the xz exploit) can follow a valid chain of pointers to gain access to data it shouldn't. We already have to play some games to avoid the data pointer in the PLT GOT entry that points to Obj_Entry and this would add another one of those, though perhaps not quite as bad as you can "only" get to all of the TCB's (and thus TLS data) of all running threads vs getting access to all of the Obj_Entry objects via the PLT GOT entry which in turn effectively gives you access to everything mapped in the address space.

libexec/rtld-elf/rtld.c
5427

Yes, the rtld-owned tls index should contain enough info to allow the static tls data propagation, and to allow for fixed-time removal from the global list.

For cheri, you can do whatever is suitable to cheri, instead of this linkage. Make the ops MD. For normal (linear) arches it would be using TAILQ, for cheri it is whatever you find suitable.

Embed TAILQ_ENTRY in TLS block as if it were TLS data

Consider increasing RTLD_STATIC_TLS_EXTRA, we now use 16 bytes there unconditionally.

libexec/rtld-elf/rtld.c
941

Perhaps add some rtld_error() in the failure path, otherwise rtld would die silently and mysteriously.

This revision is now accepted and ready to land.Sat, Jun 28, 3:05 AM
In D50920#1165848, @kib wrote:

Consider increasing RTLD_STATIC_TLS_EXTRA, we now use 16 bytes there unconditionally.

This is done before allocate_initial_tls so before we fix the size of the static TLS block, like any initially loaded object. RTLD_STATIC_TLS_EXTRA remains entirely free for dlopened objects.

In D50920#1165848, @kib wrote:

Consider increasing RTLD_STATIC_TLS_EXTRA, we now use 16 bytes there unconditionally.

This is done before allocate_initial_tls so before we fix the size of the static TLS block, like any initially loaded object. RTLD_STATIC_TLS_EXTRA remains entirely free for dlopened objects.

Ok.

libexec/rtld-elf/rtld.c
941

And on success, I would also added dbg() trace with the value of the allocated offset.