Page MenuHomeFreeBSD

Handle reference count overflow in non-INVARIANTS kernels.
ClosedPublic

Authored by markj on Jul 27 2019, 7:01 PM.
Tags
None
Referenced Files
Unknown Object (File)
Oct 3 2024, 7:45 PM
Unknown Object (File)
Sep 17 2024, 6:03 PM
Unknown Object (File)
Sep 4 2024, 3:51 PM
Unknown Object (File)
Aug 9 2024, 4:33 PM
Unknown Object (File)
Jul 31 2024, 9:27 PM
Unknown Object (File)
Jul 31 2024, 6:48 PM
Unknown Object (File)
Jun 28 2024, 10:53 AM
Unknown Object (File)
May 19 2024, 12:46 AM
Subscribers

Details

Summary

When a refcount(9) counter reaches INT_MAX+1 or underflows, maintain the
counter at a fixed value, causing refcount_release() to always return
false. This helps mitigate exploitation of refcount leaks. The trick
relies on reserving the range [INT_MIN, -1] of refcount values for
special purposes.

This increases kernel text size of amd64/GENERIC-NODEBUG by ~12KB.
Refcount acquisition now uses fetchadd instead of plain add. On amd64,
add can take an immediate operand, while xadd does not, so we pay five
bytes to load the immediate value into a register. The check of the
result uses test %eax,%eax followed by a js to the saturation
handler. Refcount release already uses fetchadd; now it checks for
underflow and saturation with a test %eax,%eax followed by a jle.

Once a refcount reaches the saturation point, subsequent operations on
the counter simply set the counter to the midpoint of the range
[INT_MIN, 0]. This is to help ensure that subsequent operations on the
counter keep it in the saturation range. There is a small race
however: if a thread increments a counter to INT_MAX+1, a subsequent
release will result in a call to refcount_update_saturated(). If the
releasing thread is preempted before setting the counter value to
(3U << 30), and 2^31 consecutive releases occur while the thread is
preempted, a use-after-free may occur. I think this possibility is
sufficiently small so as not to be a concern; we tolerate at least one
similar edge case in the seqc code.

Test Plan

An example function with and without the change. Note that the MI
implementation adds unnecessary overhead: the test instructions
are redundant.

With:

0xffffffff80bdef60 <+0>:     push   %rbp
0xffffffff80bdef61 <+1>:     mov    %rsp,%rbp
0xffffffff80bdef64 <+4>:     mov    %rdi,%rax
0xffffffff80bdef67 <+7>:     mov    $0x1,%ecx
0xffffffff80bdef6c <+12>:    lock xadd %ecx,(%rdi)
0xffffffff80bdef70 <+16>:    test   %ecx,%ecx
0xffffffff80bdef72 <+18>:    js     0xffffffff80bdef76 <crhold+22>
0xffffffff80bdef74 <+20>:    pop    %rbp
0xffffffff80bdef75 <+21>:    retq   
0xffffffff80bdef76 <+22>:    movl   $0xc0000000,(%rax)
0xffffffff80bdef7c <+28>:    pop    %rbp
0xffffffff80bdef7d <+29>:    retq

Without:

Dump of assembler code for function crhold:
   0xffffffff807d14b0 <+0>:     push   %rbp
   0xffffffff807d14b1 <+1>:     mov    %rsp,%rbp
   0xffffffff807d14b4 <+4>:     mov    %rdi,%rax
   0xffffffff807d14b7 <+7>:     lock addl $0x1,(%rdi)
   0xffffffff807d14bb <+11>:    pop    %rbp
   0xffffffff807d14bc <+12>:    retq

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj edited the test plan for this revision. (Show Details)
markj added a subscriber: emaste.

Why not change both acq and rel to use atomic_fcmpset ? You would get rid of the ugly race and can declare a single value as saturated case. I do not think that cmpxchg vs. xadd matter much.

In D21089#457692, @kib wrote:

Why not change both acq and rel to use atomic_fcmpset ? You would get rid of the ugly race and can declare a single value as saturated case. I do not think that cmpxchg vs. xadd matter much.

cmpxchg would bloat the text size quite a bit more in the hot path and requires an extra branch. The race is indeed ugly, but I don't think it's really a practical problem, same as in the seqc case.

This revision is now accepted and ready to land.Jul 27 2019, 8:33 PM

The entire thing can be probably hacked to be better, but I don't have good ideas as it is.

One note though is that we probably can get better assembly from clang by redefining hand-rolled atomic ops in terms of their c11 counterparts. Several ops do the crappy sete/setc after the actual op for instance, most notably fcmpset. I had a patch somewhere just for this op and it nicely unscrewed assembly, but ran into a bug in clang (which may be fixed by now). Stuff of this sort:

static __inline int
atomic_fcmpset_ptr(volatile u_long *dst, u_long *expect, u_long src)
{
       return (__c11_atomic_compare_exchange_strong((_Atomic volatile long *)dst, expect, src, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST));
}

Equivalent change for fetchadd may convince clang to generate better asm for refcount_acquire as well.

I don't think this is a blocker for the patch, but something to consider.

imho replacing fetchadd with a fcmpset loop in _acquire would negatively affect performance in a measurable manner - it would force a read before doing anything and then would be prone to failing the cmpxchg op in presence of other threads playing with the counter. I don't have numbers handy, but a multi-threaded loop bumping the counter with lock xadd is way faster than an equivalent loop doing cmpxchg.

This revision was automatically updated to reflect the committed changes.