Currently rtld delegates to libc or libthr to initialise the TCBs for
all existing threads when dlopen is called for a library that is using
static TLS. This creates an odd split where rtld manages all of TLS for
dynamically-linked executables except for this specific case, and is
unnecessarily complex, including having to reason about the locking due
to dropping the bind lock so libthr can take the thread list lock
without deadlocking if any of the code run whilst that lock is held ends
up calling back into rtld (such as for lazy PLT resolution).
The only real reason we call out into libc / libthr is that we don't
have a list of threads in rtld and that's how we find the currently used
TCBs to initialise (and at the same time do the copy in the callee
rather than adding overhead with some kind of callback that provides the
TCB to rtld. If we instead keep a list of allocated TCBs in rtld itself
then we no longer need to do this, and can just copy the data in rtld.
How these TCBs are mapped to threads is irrelevant, rtld can just treat
all TCBs equally and ensure that each TCB's static TLS data block
remains in sync with the current set of loaded modules, just as how
_rtld_allocate_tls creates a fresh TCB and associated data without any
embedded threading model assumptions.
As an implementation detail, to avoid a separate allocation for the list
entry and having to find that allocation from the TCB to remove and free
it on deallocation, we allocate a fake TLS offset for it and embed the
list entry there in each TLS block.
This will also make it easier to add a new TLS ABI downstream in
CheriBSD, especially in the presence of library compartmentalisation.