Page MenuHomeFreeBSD

Implement __crt_aligned_alloc().
ClosedPublic

Authored by kib on Jul 22 2023, 4:50 AM.
Tags
None
Referenced Files
F103396500: D41150.id125102.diff
Sun, Nov 24, 11:49 AM
F103356374: D41150.id125335.diff
Sat, Nov 23, 11:53 PM
Unknown Object (File)
Sat, Nov 23, 12:01 AM
Unknown Object (File)
Fri, Nov 22, 4:20 PM
Unknown Object (File)
Thu, Nov 21, 2:16 PM
Unknown Object (File)
Wed, Nov 20, 5:01 AM
Unknown Object (File)
Wed, Nov 20, 3:04 AM
Unknown Object (File)
Tue, Nov 19, 8:53 PM

Details

Summary

To be used later in libthr. Note that malloc_aligned() in the rtld/xmalloc.c has additional requirements (offset) and cannot be that easily replaced by __crt_aligned_alloc().

rtld_malloc: remove outdated comments

The ovu_magic is not neccessary overlaps with low byte of the ov_next,
for the big endian machines.

There is no range checking in the allocator.
rtld: remove dup __crt_malloc prototypes
rtld_malloc: only include internal rtld headers when building for rtld
rtld_malloc: add cp2op() helper

converting user allocation address into overhead pointer
rtld_malloc: increase overhead index to uint16

Reorder it with magic, to keep alignment.
rtld_malloc: add __crt_aligned_alloc()

It is modelled after aligned_alloc(3).  Most importantly, to free the
allocation, __crt_free() can be used.
thr_mutex.c: style
 
 Reindend and re-fill the statement.
thr_malloc: add __thr_calloc_aligned_cacheline()
libthr: allocate mutexes aligned on the cache line

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Jul 22 2023, 4:50 AM

Correct the order of addresses calculation.

kib edited the summary of this revision. (Show Details)
libthr: allocate mutexes aligned on the cache line
lib/libthr/thread/thr_malloc.c
151

i think you need to multiply nbytes by nitems in that call to memset().

lib/libthr/thread/thr_malloc.c
151

Also, I suspect we should perform the same overflow check here that is done in __crt_calloc().

General question: AFAICT, on amd64 _ _crt_malloc() seems to always return an address that is a power-of-two + 8 bytes. Why isn't it at least _ _alignof(max_align_t) aligned?
For example, it appears that malloc(3) always returns a suitably aligned allocation as long as the request is for more than 8 bytes.

libexec/rtld-elf/rtld_malloc.c
192

Wouldn't malloc(size + sizeof(*ov) + align) be sufficient? It would also future-proof you against the unlikely case that the sizeof(union overhead) becomes larger than align...

libexec/rtld-elf/rtld_malloc.c
244

I may be wrong, but AFAICT on an ILP32 system _ _crt_malloc() will return a 4-byte aligned address, in which case the divide/multiply of ov_index by 8 will not get you back to the original allocation address. Am I missing something? Seems like LOW_BITS should be something like ilog2(__alignof(union overhead)).

But see my other comment about _ _crt_malloc() not returning a "suitably aligned address"...

libexec/rtld-elf/rtld_malloc.c
192

Hmm, seems like ov_index can overflow when align is large, right?. Imagine align==4096 and _ _crt_malloc() returns an address of 1048584 (i.e., 1MiB + 8). After roundup x is 1052672, and so we set ov_index to (1042672 / 8) which is 131584, which overflows ov_index such that ov_index ends up being zero.

andrew added inline comments.
lib/libthr/thread/thr_malloc.c
148

Can we add a way to query the cache line size? e.g. on arm64 we make a guess for CACHE_LINE_SIZE to be 128, but it is more likely 64 on most HW.

We can read it from userspace with the ctr_el0 register.

kib marked 5 inline comments as done.Jul 24 2023, 10:40 PM

General question: AFAICT, on amd64 _ _crt_malloc() seems to always return an address that is a power-of-two + 8 bytes. Why isn't it at least _ _alignof(max_align_t) aligned?
For example, it appears that malloc(3) always returns a suitably aligned allocation as long as the request is for more than 8 bytes.

There is no reason except that sizeof(void *) is the min required alignment of the __crt_malloc() to be usable, and the allocator is exceptionally simple. It is not intended to be even close to regular malloc(3) on its features.

libexec/rtld-elf/rtld_malloc.c
192

ov_index is set to the difference between user-visible address and address obtained from __crt_malloc(). As such, it should be around (align >>3)/2. This gives us 256*8 == 2048 max value for displacement. I can extend ov_index to short, with some pain due to mis-alignment but I do not see it useful for immediate application.

244

Right, this should be fixed. I changed the ov_index to point to original overhead instead of the user pointer. This should keep everything with low 3 bits zero.

kib marked 2 inline comments as done.Jul 24 2023, 10:42 PM
kib added inline comments.
lib/libthr/thread/thr_malloc.c
148

Definitely, I can add this. Can we agree on the interface before I implement this?

Would e.g. libc symbol size_t cacheline_size(void) be enough? Esp. for big.LITTLE configurations?

lib/libthr/thread/thr_malloc.c
147

Is there a reason to not use __builtin_mul_overflow to perform the multiplication and get the overflow check?

148

I thought it could be a macro in pthread_md.h with a default version in thr_private.h to define it as CACHE_LINE_SIZE.

It is ok to assume the cache line size on all CPUs is the same. As far as I know only one SoC has a different cache line size on different CPUs. This can be handled by the kernel by trapping the ctr_el0 register to provide a consistent value if FreeBSD is ever run there.

libexec/rtld-elf/rtld_malloc.c
190

Nicely done, I don't think that would have occurred to me. But I think that reduces the maximum for the align parameter to 1024 before ov_index overflows. I think it prudent that you check for and handle overflow to prevent _ _crt_free() from leaking these allocations (or worse, crashing due to an errant mismatch of ov_magic because the memory between mem and x isn't zeroed).

I think you could solve this for all values of align by storing the ilog2() of align in ov_index, e.g., something like ov_index = __builtin_popcount(align - 1); or maybe base it on _ _builtin_clzl().

Which reminds me, shouldn't we reject requests for which align is not a power of two? If it's not a power of two I think it will also cause ov_index to be incorrect as x won't have 8-byte alignment...

kib marked an inline comment as done.Jul 25 2023, 12:59 PM
kib added inline comments.
lib/libthr/thread/thr_malloc.c
147

Mostly to avoid compiler-specific feature, if I can use standard C.

libexec/rtld-elf/rtld_malloc.c
190

I am not sure how storing align in overhead would help, we need to restore non-aligned allocation address.

I increased the ov_index to short, it should be enough for everything.

Answering your question, yes alignment should be power of two, but I see low value in checking this for internal allocator. What to do if the condition not met?

kib marked an inline comment as done.
kib edited the summary of this revision. (Show Details)

Bump ovu_index to uint16_t

libexec/rtld-elf/rtld_malloc.c
192

Ah, you'e right, nevermind, that only works if align is greater than 16, and relies upon _ _ crt_malloc() allocating from buckets that are a power-of-two. So, messy to deal with and would be brittle...

The increase in size of ov_index should accomodate all reasonable use cases :-)

General observation. Overall I like this new feature! But I have a suspicion that using it for pthread mutexes might increase the lock/unlock latency a wee bit.

Consider that the size of struct pthread_mutex_t is something like 96 bytes, so _ _crt_malloc() will allocate the space from a 128-byte integrally aligned bucket, and hence the lock will straddle two adjacent cachelines, which on amd64 will prefetch the sibling when we first touch either the first or second line. This is good because we will modify both cache lines when acquiring and releasing a lock. But then I suspect this is all just a happy accident.

However, if we use _ _crt_aligned_alloc() to allocate the lock, then the memory will come from a 256-byte bucket (so, 4 64-byte cachelines), and the lock will straddle the two inner cache lines (i.e., it'll cross a 128-byte boundary). The adjacent cacheline prefetcher will eventually load all four cachelines of which we'll generally only need two, and we'll lose the benefit of the prefetcher to load the entire lock structure when we touch either the upper or lower part of the structure. So pthread mutex lock/unlock requests might wait on memory a bit longer than they do currently. Also, we'll get half as many locks per page (16 vs 32 at best), which might be noticable for apps that allocate a lot of locks.

So overall, it'll be a bit more resource intensive, but probably not noticable in situ.

General observation. Overall I like this new feature! But I have a suspicion that using it for pthread mutexes might increase the lock/unlock latency a wee bit.

Consider that the size of struct pthread_mutex_t is something like 96 bytes, so _ _crt_malloc() will allocate the space from a 128-byte integrally aligned bucket, and hence the lock will straddle two adjacent cachelines, which on amd64 will prefetch the sibling when we first touch either the first or second line. This is good because we will modify both cache lines when acquiring and releasing a lock. But then I suspect this is all just a happy accident.

However, if we use _ _crt_aligned_alloc() to allocate the lock, then the memory will come from a 256-byte bucket (so, 4 64-byte cachelines), and the lock will straddle the two inner cache lines (i.e., it'll cross a 128-byte boundary). The adjacent cacheline prefetcher will eventually load all four cachelines of which we'll generally only need two, and we'll lose the benefit of the prefetcher to load the entire lock structure when we touch either the upper or lower part of the structure. So pthread mutex lock/unlock requests might wait on memory a bit longer than they do currently. Also, we'll get half as many locks per page (16 vs 32 at best), which might be noticable for apps that allocate a lot of locks.

So overall, it'll be a bit more resource intensive, but probably not noticable in situ.

Are you saying that instead of aligning pthread_mutex allocations, we just need to ensure that allocation is padded to the end of the cache line? Then, perhaps the same thing needs to be done to rwlocks and spinlocks?

Are you saying that instead of aligning pthread_mutex allocations, we just need to ensure that allocation is padded to the end of the cache line? Then, perhaps the same thing needs to be done to rwlocks and spinlocks?

No, I don't think any padding is required, but perhaps I'm not fully understanding your question. For pthread mutex I think the current allocation is ideal, but I suspect that's just an accident of it's size (~96 bytes) and the way in which the allocator works (i.e., the structure fits well within an integrally-aligned 128-byte bucket, including the allocator's overhead, and for platforms that have 64-byte cachelines and adjacent cacheline prefetching we get fully usable prefetches and maximum density of locks per page).

But rwlocks are 40-bytes and spinlocks are 32-bytes, so they each wind up in 64-byte buckets as allocated by aligned_alloc(). Given two such locks that are adjacent within an integral 128-bytes then heavily accessing them from different CPUs on a platform that performs adjacent cacheline prefetching we see a slight degradation. I've tried to measure this, it's does impact the performance a little, but it's nowhere near as bad as false-sharing of the same cache line.
Presuming you switch their allocator over to _ _ crt_malloc(), then to maximize performance in this case all that's required is to round up the allocation size of rw/spin lock to CACHE_LINE_SIZE and the bucket allocator will put them into their own private 128-byte buckets (due to it's addition of the tracking structure that resides within the allocation).

But then it gets messy. For a platform like s390x which I think has 256-byte cachelines and presumably does not perform adjacent cacheline prefetch, you would only want to round up the lock to half a cacheline, such that after the allocator adds its overhead it gives you back a 256-byte bucket.
For example, currently if we were to allocate 8 spinlocks or 6 rwlocks in a row on s390x, then on a freshly booted machine they likely all would reside in contiguous virtual memory (a series of 64-byte buckets). This would cause severe false sharing on that platform.

Doh, I see now FreeBSD doesn't support s390x, but the example still holds. It would be nice to have a table of cacheline sizes for all supported architectures, as well as know which ones support adjacent cacheline prefetch.

But all the above relies on the current behavior of _ _crt_malloc(), which makes me uneasy in that if the implementation changed it might throw off the layout calculus. That said, I don't think it would be terrible in that case, but it might quietly revert from reasonably optimal...

Does that make sense?

With tail pad, I mean the following. The goal of the exercise is not to get specific alignments, but ultimately ensure that lock objects do not (false) share cache lines with any other objects. Alignment ensures that there is no sharing with something before the lock, while padding would do the same for objects after the the lock. Currently details of the allocation implementation (buddy split) gives that for free, but in principle we need both align and padding, to guarantee the property regardless of the allocator.

I do not intend to switch rwlocks or spinlocks to __crt_malloc(), there is no point. For mutexes, it is sort of required because regular malloc(3) needs mutexes to operate, and I lost the fight with avoiding the recursion _and_ handling the special cases of early initialization/libthr dyn loading and other small but practically important situations.

So now I am wondering if __crt_aligned_alloc() should be abandoned. Other cleanup from this review should go in, IMO.

In D41150#938121, @kib wrote:

With tail pad, I mean the following. The goal of the exercise is not to get specific alignments, but ultimately ensure that lock objects do not (false) share cache lines with any other objects. Alignment ensures that there is no sharing with

So now I am wondering if __crt_aligned_alloc() should be abandoned. Other cleanup from this review should go in, IMO.

Ah, right. So what I generally do in this case is append "__aligned(CACHE_LINE_SIZE)" (or whatever size) to the end of the structure definition, which then gets me the correct amount of unnamed tail padding and ensures that static definitions have the correct alignment and spacing (and works well for arrays of such items). But for dynamic allocations of course you must use something like aligned_alloc() to get the correct alignment. I think this approach eliminates the messiness in the code of having to maintain explicit tail padding.

Unfortunately, if we did that for struct pthread_mutex_t, then in _ _crt_malloc(sizeof(struct pthread_t)) would use a 256-byte bucket and hence waste about 128 bytes. But the upside is that the lock would still reside only in the first two cachelines like it does today. OTOH, statically defined struct pthread_mutex_t is hit or miss, as on amd64 they really should be 128-byte aligned to prevent false-sharing that would occur if the lock started on an odd multiple of cache line size.

Agreed, I typically don't like to introduce new code that doesn't have an immediate use, so tabling it for now seems like the right course of action...

This revision was not accepted when it landed; it landed in state Needs Review.Jul 26 2023, 2:30 PM
This revision was automatically updated to reflect the committed changes.
kib updated this revision to Diff 125190.

Update patch after trivial fixes were committed.

Implement rtld malloc_aligned() with __crt_aligned_alloc(). Handle offset into align.

libthr bits are kept around for now.

libexec/rtld-elf/xmalloc.c
91 ↗(On Diff #125335)

Nice! We now request (align * 2) fewer bytes per call, which could eliminate crossing into a next larger bucket in some cases, potentially wasting less memory and improving allocation page density.

Generally speaking, I think it would be cleaner if _ _ crt_aligned_alloc() adhered to the same interface as aligned_alloc(3), and we create a new interface for rtld's use-case (which appears to be a one-off). For example, you coluld simply rename _ _crt_aligned_alloc() (above) to _ _crt_aligned_alloc_offset(), and then create a new interface for _ _crt_aligned_alloc() which just calls _ _crt_aligned_alloc_offset(align, size, 0);.

That way, new use-cases of _ _crt_aligned_alloc() can simply use the well-known API, which should serve to reduce confusion among both implementors and general readers of the code.

kib marked an inline comment as done.Jul 30 2023, 9:58 PM
kib added inline comments.
libexec/rtld-elf/xmalloc.c
91 ↗(On Diff #125335)

I renamed crt_aligned_alloc() to crt_aligned_alloc_offset().
Currently there is no need in __crt_aligned_alloc() as is.

kib marked an inline comment as done.

Rename to __crt_aligned_alloc_offset.

Ping? Note that libthr changes are not planned for commit ATM.

markj added inline comments.
libexec/rtld-elf/rtld_malloc.c
187

roundup2()?

This revision is now accepted and ready to land.Aug 21 2023, 1:11 PM
kib marked an inline comment as done.Aug 21 2023, 1:22 PM
kib added inline comments.
libexec/rtld-elf/rtld_malloc.c
187

In principle, the align is user-controlled. We do depend on align being power of two for non-zero offset, OTOH.

kib marked an inline comment as done.

Use roundup2()

This revision now requires review to proceed.Aug 21 2023, 1:24 PM