Page MenuHomeFreeBSD

Needs ReviewPublic

Authored by kib on Mar 10 2016, 3:55 PM.



With this change, randomization is applied to all non-fixed mappings.
By randomization I mean the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours
the superpage attributes.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation
prevents entropy injection. It is trivial to implement a strong mode
where failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation.
The current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, the locality is implemented for
anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
the small AS architectures (funny that 32bits is considered small).
This is tied with the question of following an application's hint
about the mmap(2) base address. Testing shows that ignoring the hint
does not affect the function of common applications, but I would expect
more demanding code could break. By default sbrk is preserved and mmap
hints are satisfied, which can be changed by using the
kern.elf{32,64}.aslr_care_sbrk sysctl.

Stack gap, W^X, shared page randomization, KASLR and other techniques
are explicitely out of scope of this work.

The paxtest results for the run with the patch applied and aggresively
tuned can be seen at the .
For comparision, the run on Fedora 23 on the same machine is at .

ASLR is enabled on per-ABI basis, and currently it is only enabled on
native i386 and amd64 (including compat 32bit) ABIs. I expect to test
and enable ASLR for armv6 and arm64 as well, later.

The procctl(2) control for ASLR is implemented, by I have not provided
a userspace wrapper around the syscall. In fact, the most reasonable
control needed is per-image and not per-process, but we have no
tradition to put the kernel-read attributes into the extattrs of binary,
so I am still pondering that part and this also explains the non-written

Diff Detail

rS FreeBSD src repository
Lint Skipped
Unit Tests Skipped
Build Status
Buildable 21596

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kib updated this revision to Diff 49195.Oct 15 2018, 10:44 PM

Fix asserted issues on i386.

markj added inline comments.Nov 14 2018, 10:39 PM

I'm not sure that this makes sense after r327218. Consider the loop in vm_map_find() which causes do_aslr to decrement to 0. Subsequent iterations of the loop are not affected by this update to *addr. Really, we want to be updating min_addr.

kib updated this revision to Diff 50435.Nov 15 2018, 3:34 AM

Split curr_min_addr tracking out of *addr.

kib marked an inline comment as done.Nov 15 2018, 3:34 AM
markj added inline comments.Nov 15 2018, 7:07 PM

I think this is still problematic. Consider an application which dlopen()s many small libraries (so anon coalescing isn't relevant). We always start the search at the same point in the address space. vm_map_find() performs a first-fit search and will thus return successive regions corresponding to the gaps between already-mapped objects. However, these gaps may not be large enough to satisfy vm_map_find_aslr_adjust(), so we end up giving up the search.

kib marked an inline comment as done.Nov 16 2018, 9:38 AM
kib added inline comments.

The code should give up the search for the region that allows the randomization part, and fall to the do_aslr == 0 case (if block right under the again: label). There, we would do the vm_map_findspace() from the original min_addr base and return whatever vm_map_findspace() found.

So yes, this case would get less rnds than it could be, but the allocation should not fail. And the later part, not failing the allocation which can be satisfied at all in !aslr case, is really the only thing I care about.

If user is concerned that this boundary case does not get enough rnd into VAs, he can increase aslr_slopiness by the cost of increased allocation CPU overhead. This is the reason for slopiness to be tunable.

alc added a comment.Nov 16 2018, 5:55 PM

ASR has a measurable performance hit; ASLR does not.

I'd be quite interested in the detail here if you can provide a link or reference to this.

How expensive is pulling from the entropy pool at least once, at most five times, per call to mmap?

I would also expect that ASR is going to require the allocation of more page table pages and reduce the effectiveness of the MMU's page walk caches at caching mid-level page table entries because of the greater dispersal of mappings in the address space.

alc added a comment.Nov 16 2018, 6:32 PM

On a somewhat related note, the version of jemalloc that we use in HEAD and 12.x is frequently allocating small, unaligned chunks from the kernel. In contrast, older versions were (always?) allocating large, aligned chunks. You can see the effects here: P196. Clang reads source files using mmap(2), and between most mapped files there is a sliver of anonymous heap. For "buildworld", I estimate that this has reduced superpage promotions by 1/3. The explicit clustering of anonymous memory allocations by this patch likely cures this problem.

kib added a comment.Nov 16 2018, 10:58 PM
In D5603#385047, @alc wrote:

I would also expect that ASR is going to require the allocation of more page table pages and reduce the effectiveness of the MMU's page walk caches at caching mid-level page table entries because of the greater dispersal of mappings in the address space.

Yes, this is the point known from the beginning. But with the current scheme where I limit the applied entropy, the adjustment of otherwise selected address typically would allow to reuse the pde. You can look at the following sample debugging (pidx and preserve are from the updated patch, addr is the stock allocated range start, adj is the start adjusted by ASLR):

ASLR: pidx 1 preserve 16 xxx 0 addr 0x800227000 adj 0x800227000
ASLR: pidx 0 preserve 4096 xxx 0x38a addr 0x800228000 adj 0x8005b2000
ASLR: pidx 1 preserve 16 xxx 0x8 addr 0x8005ee000 adj 0x8015ee000
ASLR: pidx 0 preserve 4096 xxx 0x2bd addr 0x8015ef000 adj 0x8018ac000
ASLR: pidx 1 preserve 16 xxx 0xd addr 0x801cb8000 adj 0x8036b8000
ASLR: pidx 0 preserve 4096 xxx 0x6ba addr 0x8005ee000 adj 0x800ca8000

If aslr_pages_rnd[] arrays start allowing more entropy, this would be not that good, of course.

kib updated this revision to Diff 50504.Nov 16 2018, 11:01 PM

Implement Mark suggestion:
stop sloppiness, instead do at most two attempts of allocation (ignoring fallback for optimal->any):
first try to allocate required length + space needed for randomization;
second, if the first try failed, only allocate the requested length.


Is there really a necessity for separately enabling ASR for the PIE execution base? I'm thinking that either ASR is on for the entire executable image or it's off.

markj added inline comments.Nov 19 2018, 11:24 PM

I'm surprised that the default value for honor_sbrk is 0. The review description suggests that the default is intended to be 1.


I think "coalesce_anon" or "merge_anon" might be a better name, but I don't feel too strongly about it.

As Alan pointed out, this merging has benefits independent of ASR; perhaps we should make it orthogonal to any randomization that we may do?


Won't this case be true any time we attempt an anon mapping? This looks like it should just be if (en_aslr && !do_aslr).


I don't see how randomization gets applied if anon is true.


I think I missed earlier discussion of this, but it seems wrong that we don't support multiple -m values.

kib marked 6 inline comments as done.Nov 20 2018, 9:13 AM
kib added inline comments.

Ok, moved it out of pure alsr control. Now, combined with the retry change, it should be usable alone (I hope so).


I rewrote (or rather, reformulated) the retry logic to not depend on the combination of anon+do_aslr. Instead, there is the try number which explicitly guide the selection of the applicable choices.


I think I fixed that in the new loop construction, by applying the rnd logic for anon case when curr_min_address == 0.

Before the first non-coalesced anon mapping got the randomization because previous mappings were.


We do not need to, you can chain proccontrol, same as e.g. nice/nohup/sudo etc.

kib updated this revision to Diff 50628.Nov 20 2018, 9:15 AM
kib marked 2 inline comments as done.

Rewrite the retry loop to explicitly depend on the loop iteration (first vs. second) instead of outguess it by the combination of do_aslr/anon booleans.

kib marked an inline comment as done.Nov 20 2018, 9:17 AM
kib added inline comments.

I think at one time defaults were changed to be the most aggressive mode for testing. Also it was needed, AFAIR, to make the consumers happy which use whole 32bit address space, like pypy. Anyway, aslr will be disabled by default.

kib updated this revision to Diff 50870.Nov 21 2018, 11:58 AM

Add parsing of the feature control note both to the kernel loader and to the dynamic linker.
Allocate a bit in the feature control mask for disabling ASLR for the image.

kib updated this revision to Diff 51030.Nov 23 2018, 11:39 PM

Update after the feature control note parsing bits were merged.

markj added inline comments.Nov 24 2018, 10:30 PM

I think you need to initialize anon_loc to 0 in _vm_map_init() for this to work as intended?

Even then, when curr_min_addr == 0, the amount of randomization applied to the initial anon mapping is quite small. For PIEs, libraries are loaded after the (random) base load address, but otherwise, the set of possible initial addresses is quite small.

kib marked an inline comment as done.Nov 25 2018, 10:05 AM
kib added inline comments.

I fixed several more bugs with anon_loc, e.g. copying it on fork. Also I added explicit setting of anon_loc on execution of the ELF binary in 'hard' mode, similar to the interpreter base address selection.

I am not sure what do you mean by the amount of randomization. Either rnd is applied or not, if it is applied, then the amount of the entropy is guaranteed to be some.

kib updated this revision to Diff 51086.Nov 25 2018, 10:07 AM

Some fixes for anon coalescing.

  • Rework selection of coalescing base address after failure in vm_map_find(), trying to randomize it first.
  • Fix map init and fork by zeroing and copying anon_loc.
  • Apply same randomization to anon_loc as for the interpreter if ASLR is enabled.
markj added inline comments.Nov 25 2018, 7:00 PM

IMO it would be clearer if you rename "anon" to "coalesce".


I mean that if the curr_min_addr is not randomized, the amount of entropy added is quite small. In the latest version this is still a problem for non-anonymous mappings in non-PIE binaries: the starting min address is constant (vm_daddr + lim(RLIMIT_DATA)), so the load address of, for example, can be guessed without much work. I am not sure if this is really a significant problem when the executable's address is not randomized, however.


I guess you can just write anon = update_anon instead.


Does it make sense to update anon_loc if find_space == VMFS_NO_SPACE?

kib marked 4 inline comments as done.Nov 25 2018, 7:43 PM
kib added inline comments.

I am still not sure about this. Do you mean that the amount of entropy allowed by the aslr_pages_rnd_XXX arrays is too small ?

kib updated this revision to Diff 51092.Nov 25 2018, 7:44 PM

Rename anon to coalesce.
Simplify restart when coalesce failed, Inline the last helper.
goto done is never done for KERN_SUCCESS.
Do not update anon_loc for VMFS_NO_SPACE.

markj added inline comments.Nov 25 2018, 8:36 PM

Indeed, it does not provide nearly as much entropy as the initial randomization of et_dyn_addr for PIEs or anon_loc. Consider that is mapped with VMFS_OPTIMAL_SPACE, so we will set

*addr += (arc4random() % 0x10) * 0x200000;

For a non-PIE on amd64 this means that will get loaded somewhere in [0x800000000, 0x800200000], so the entropy added is quite minimal. PIEs do not have this problem.

markj added inline comments.Nov 25 2018, 10:04 PM

Don't we need to reset curr_min_addr here too?

kib marked 2 inline comments as done.Nov 25 2018, 11:09 PM
kib added inline comments.

Yes, this is how I want to keep it now, by disturbing the normal layout as minimal as possible for PoC. On the other hand, since PIE base, ld.elf load address, and now initial anon base are already 'hard' randomized, might be it is indeed does not make sense to keep that part of entropy low. In fact I think we will see after another exp run.


The intent is to make two normal passes without coalescing. Second pass resets curr_min_addr.

markj added inline comments.Nov 26 2018, 1:58 PM

I noticed that we are coalescing mappings in pipe_map. Is there any advantage to be gained by doing this? As a downside, I think the coalescing increases page table usage, especially since most pipe_map mappings are the same size.


mm, we reset curr_min_addr only if en_aslr is set though.

kib updated this revision to Diff 51226.Nov 27 2018, 6:37 PM

Disable coalescing on submaps.
Remove bogus try reset on retry with coalescing disabled, since curr_min_addr recalculation depends on try == 2.

markj added inline comments.Dec 3 2018, 8:39 PM

Sorry for being indecisive. Thinking some more, I think "clustering" actually makes more sense than "coalescing." Coalescing is the process of bringing together multiple entities that were previously separate, but in this case, the anonymous mappings are not separate to begin with.


"When creating an anonymous mapping, try clustering with an existing anonymous mapping first."


The text is a bit misleading since coalesce == false doesn't imply that coalescing failed. How about:

"We make up to two attempts to find address space for a given find_space value. The first attempt may apply randomization or may cluster with an existing anonymous mapping. If this first attempt fails, perform a first-fit search of the available address space."


Why is it necessary to set curr_min_addr here? We know try == 1, so after following the goto we will assign to curr_min_addr again.


MAP_IS_SUB_MAP, for consistency with MAP_ENTRY_IS_SUB_MAP?


Did you mean to clear the flag here?

kib updated this revision to Diff 51553.Dec 3 2018, 9:13 PM
kib marked 5 inline comments as done.

Reword comments.
Rename variable and symbol.
Remove dup assignment.

markj accepted this revision.Dec 3 2018, 9:43 PM

I suspect that the anon clustering will interact suboptimally with the jemalloc behaviour discussed in D16501 and elsewhere. In particular, jemalloc will unmap small regions of the address space, leaving holes. With clustering, those holes won't be reused since we no longer perform a first-fit search. IMO it would be worth reconsidering how anon_loc works; rather than advancing it after each successful clustering, maybe it should be constant after the initialization to a non-zero value, so that we attempt to fill holes with new mappings before extending the clustered region further. I do not think this needs to be done prior to commit though.


"... or to cluster with an existing mapping."


I'd consider calling this a "gap" instead, here and in the code (instead of "preserve").

kib updated this revision to Diff 51555.Dec 3 2018, 9:58 PM
kib marked 2 inline comments as done.

Comment update.

markj added inline comments.Dec 3 2018, 10:01 PM

I don't think we should update anon_loc if max_addr is specified (e.g., MAP_32BIT was passed).

kib updated this revision to Diff 51561.Dec 3 2018, 11:52 PM
kib marked an inline comment as done.

Disable clustering if map is limited by max_addr.

markj added inline comments.Dec 15 2018, 8:03 PM

"Decide whether to"


I think it would be worth adding a counter for vm_map_findspace() failures, at least for the en_aslr case.

kib marked 2 inline comments as done.Dec 16 2018, 2:23 AM
kib added inline comments.

I added the counter for try == 2 restarts.

IMO iti is of limited usefulness because it is global, but I do not think it is worth adding the per-vmspace counters and the whole required infrastructure for it.

kib updated this revision to Diff 52070.Dec 16 2018, 2:23 AM
kib marked an inline comment as done.

Add global restart counter.