Paths

Table of Contentst

Differential D22084

Fix atomic_*cmpset32 on riscv64 with clang.
ClosedPublic
Actions

Authored by jhb on Oct 18 2019, 9:51 PM.

Details

Reviewers

mhorne
markj
br
kp

Commits

rS353931: Fix atomic_*cmpset32 on riscv64 with clang.

Summary

The lr.w instruction used to read the value from memory sign-extends
the value read from memory. GCC sign-extends the 32-bit comparison
value passed in whereas clang currently does not. As a result, if the
value being compared has the MSB set, the comparison fails for matching
32-bit values when compiled with clang.

Use a cast to explicitly sign-extend the unsigned comparison value. This
works with both GCC and clang.

There is commentary in the RISC-V spec that suggests that GCC's approach
is more correct, but it is not clear if the commentary in the RISC-V spec is
binding.

Obtained from: Axiado

Test Plan

A single-cpu qemu instance compiled with clang was hanging trying to
start init. Attaching gdb to qemu I found it was spinning forever in
hardclock due to an atomic_fcmpset32 that kept spinning even though
the two values matched. The value of ticks is negative at the start
of boot which provoked this issue. I'm not sure why this didn't fail
before now as the value of ticks is deterministic (we intentionally
set it to a negative value to detect wrapover bugs soon after boot)
and so this should have always hung.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhb created this revision.Oct 18 2019, 9:51 PM

Harbormaster completed remote builds in B27118: Diff 63463.Oct 18 2019, 9:52 PM

So I now wonder if this isn't actually a compiler bug in LLVM. Specifically, section 4.2 of the 2.2 spec when talking about RV64 has this "explanatory" comment in italics:

The compiler and calling convention maintain an invariant that all 32-bit values are held in a sign-extended format in 64-bit registers. Even 32-bit unsigned integers extend bit 31 into bits 63 through 32. Consequently, conversion between unsigned and signed 32-bit integers is a no-op, as is conversion from a signed 32-bit integer to a signed 64-bit integer. Existing 64-bit wide SLTU and unsigned branch compares still operate correctly on unsigned 32-bit integers under this invariant. Similarly, existing 64-bit wide logical operations on 32-bit sign-extended integers preserve the sign-extension property. A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts to ensure reasonable performance for 32-bit values.

If I'm reading this correctly, the compiler should be storing the unsigned 32-bit values as sign-extended instead of zero-extended and that clang is zero-extending the value that gets passed in to 'cmpval' incorrectly. It may be that it is only fcmpset32 that is affected due to cmpval being a pointer? In fact, I wonder if this change will in fact break GCC now as it might be passing in a 64-bit value in the cmpval register.

I used this test program:

#include <sys/types.h>

static __inline int
atomic_fcmpset_32(volatile uint32_t *p, uint32_t *cmpval, uint32_t newval)
{
	uint32_t tmp;
	int res;

	res = 0;

	__asm __volatile(
		"0:"
			"li   %1, 1\n"		/* Preset to fail */
			"lr.w %0, %2\n"		/* Load old value */
#if 0
		        "slli %0, %0, 0x20\n"
		        "srli %0, %0, 0x20\n"
#endif
			"bne  %0, %z4, 1f\n"	/* Compare */
			"sc.w %1, %z5, %2\n"	/* Try to store new value */
			"j 2f\n"
		"1:"
			"sw   %0, %3\n"		/* Save old value */
		"2:"
			: "=&r" (tmp), "=&r" (res), "+A" (*p), "+A" (*cmpval)
			: "rJ" (*cmpval), "rJ" (newval)
			: "memory");

	return (!res);
}


uint32_t test_fun(uint32_t *p, uint32_t x)
{
	uint32_t y;

	y = *p;
	while (!atomic_fcmpset_32(p, &y, y + x))
		;
	return (y + x);
}

I then compiled it with GCC -O2:

test_fcmpset_gcc.o:     file format elf64-littleriscv


Disassembly of section .text:

0000000000000000 <test_fun>:
   0:	411c                	lw	a5,0(a0)
   2:	1141                	addi	sp,sp,-16
   4:	c63e                	sw	a5,12(sp)
   6:	a011                	j	a <.L3>

0000000000000008 <.L6>:
   8:	47b2                	lw	a5,12(sp)

000000000000000a <.L3>:
   a:	00f5873b          	addw	a4,a1,a5
   e:	00c10813          	addi	a6,sp,12
  12:	4685                	li	a3,1
  14:	1005262f          	lr.w	a2,(a0)
  18:	00f61563          	bne	a2,a5,22 <.L1^B1>
  1c:	18e526af          	sc.w	a3,a4,(a0)
  20:	a019                	j	26 <.L2^B1>

0000000000000022 <.L1^B1>:
  22:	00c82023          	sw	a2,0(a6)

0000000000000026 <.L2^B1>:
  26:	0006879b          	sext.w	a5,a3
  2a:	fff9                	bnez	a5,8 <.L6>
  2c:	4532                	lw	a0,12(sp)
  2e:	0141                	addi	sp,sp,16
  30:	9d2d                	addw	a0,a0,a1
  32:	8082                	ret

and clang -O2:

test_fcmpset_clang.o:     file format elf64-littleriscv


Disassembly of section .text:

0000000000000000 <test_fun-0x2>:
   0:	0001                	nop

0000000000000002 <test_fun>:
   2:	1141                	addi	sp,sp,-16
   4:	00056603          	lwu	a2,0(a0)
   8:	c632                	sw	a2,12(sp)
   a:	00c10813          	addi	a6,sp,12

000000000000000e <.LBB0_1>:
   e:	00b60733          	add	a4,a2,a1
  12:	4685                	li	a3,1
  14:	100527af          	lr.w	a5,(a0)
  18:	00c79063          	bne	a5,a2,18 <.LBB0_1+0xa>
  1c:	18e526af          	sc.w	a3,a4,(a0)
  20:	a001                	j	20 <.LBB0_1+0x12>

0000000000000022 <.Ltmp1>:
  22:	00f82023          	sw	a5,0(a6)

0000000000000026 <.Ltmp2>:
  26:	00c16603          	lwu	a2,12(sp)
  2a:	1682                	slli	a3,a3,0x20
  2c:	9281                	srli	a3,a3,0x20
  2e:	e281                	bnez	a3,2e <.Ltmp2+0x8>
  30:	00b6053b          	addw	a0,a2,a1
  34:	0141                	addi	sp,sp,16
  36:	8082                	ret

Note that GCC uses 'lw' in instruction at offset 0 to read the $a5 that is eventually used in the bne instruction (and is thus sign-extended), while clang uses 'lwu' at offset 4.

In D22084#482546, @jhb wrote:
So I now wonder if this isn't actually a compiler bug in LLVM. Specifically, section 4.2 of the 2.2 spec when talking about RV64 has this "explanatory" comment in italics:
The compiler and calling convention maintain an invariant that all 32-bit values are held in a sign-extended format in 64-bit registers. Even 32-bit unsigned integers extend bit 31 into bits 63 through 32. Consequently, conversion between unsigned and signed 32-bit integers is a no-op, as is conversion from a signed 32-bit integer to a signed 64-bit integer. Existing 64-bit wide SLTU and unsigned branch compares still operate correctly on unsigned 32-bit integers under this invariant. Similarly, existing 64-bit wide logical operations on 32-bit sign-extended integers preserve the sign-extension property. A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts to ensure reasonable performance for 32-bit values.
If I'm reading this correctly, the compiler should be storing the unsigned 32-bit values as sign-extended instead of zero-extended and that clang is zero-extending the value that gets passed in to 'cmpval' incorrectly. It may be that it is only fcmpset32 that is affected due to cmpval being a pointer? In fact, I wonder if this change will in fact break GCC now as it might be passing in a 64-bit value in the cmpval register.

Nice investigation, it sounds like this is indeed an LLVM bug. I believe this patch would break a GCC-built kernel as is, but if you can sign-extend *cmpval instead of zero-ing tmp, we could commit it as a workaround.

In D22084#482595, @mhorne wrote:
In D22084#482546, @jhb wrote:
So I now wonder if this isn't actually a compiler bug in LLVM. Specifically, section 4.2 of the 2.2 spec when talking about RV64 has this "explanatory" comment in italics:
The compiler and calling convention maintain an invariant that all 32-bit values are held in a sign-extended format in 64-bit registers. Even 32-bit unsigned integers extend bit 31 into bits 63 through 32. Consequently, conversion between unsigned and signed 32-bit integers is a no-op, as is conversion from a signed 32-bit integer to a signed 64-bit integer. Existing 64-bit wide SLTU and unsigned branch compares still operate correctly on unsigned 32-bit integers under this invariant. Similarly, existing 64-bit wide logical operations on 32-bit sign-extended integers preserve the sign-extension property. A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts to ensure reasonable performance for 32-bit values.
If I'm reading this correctly, the compiler should be storing the unsigned 32-bit values as sign-extended instead of zero-extended and that clang is zero-extending the value that gets passed in to 'cmpval' incorrectly. It may be that it is only fcmpset32 that is affected due to cmpval being a pointer? In fact, I wonder if this change will in fact break GCC now as it might be passing in a 64-bit value in the cmpval register.
Nice investigation, it sounds like this is indeed an LLVM bug. I believe this patch would break a GCC-built kernel as is, but if you can sign-extend *cmpval instead of zero-ing tmp, we could commit it as a workaround.

It's easy to brand this as an LLVM bug, but I can't actually find anywhere that specifies how inline assembly register operands are supposed to be represented, regardless of the architecture. For the other architectures we care about, it just so happens to be that LLVM and GCC match in the cases we use, and this is the first time we see divergence for 32-bit integers. However, once you get down to 16-bit integers, things get weirder, and not just on RISC-V, but also MIPS and AArch64 (we will ignore X86 given it can just use the 16-bit registers, and I don't feel POWER adds anything to the mix). I'm compiling the following test code:

extern signed short gs;
extern unsigned short gu;

void asm_load(void) {
        signed short ls = gs+2;
        unsigned short lu = gu+2;
        __asm__ __volatile__("#%0" :: "r"(gs));
        __asm__ __volatile__("#%0" :: "r"(gu));
        __asm__ __volatile__("#%0" :: "r"(ls));
        __asm__ __volatile__("#%0" :: "r"(lu));
}

void asm_param(signed short ps, unsigned short pu) {
        signed short ls = ps+2;
        unsigned short lu = pu+2;
        __asm__ __volatile__("#%0" :: "r"(ps));
        __asm__ __volatile__("#%0" :: "r"(pu));
        __asm__ __volatile__("#%0" :: "r"(ls));
        __asm__ __volatile__("#%0" :: "r"(lu));
}

Let's look at AArch64, since one would expect it to be the most mature of the three for both LLVM and GCC. With GCC we get the following:

asm_load:
        adrp    x1, :got:gs
        adrp    x0, :got:gu
        ldr     x1, [x1, #:got_lo12:gs]
        ldr     x0, [x0, #:got_lo12:gu]
        ldrsh   w1, [x1]
        ldrh    w0, [x0]
        #x1
        #x0
        add     w1, w1, 2
        #x1
        add     w0, w0, 2
        #x0
        ret

asm_param:
        sxth    w0, w0
        and     w1, w1, 65535
        #x0
        #x1
        add     w0, w0, 2
        #x0
        add     w1, w1, 2
        #x1
        ret

Starting at asm_load, we see that values loaded for inline assembly are sign-extended according to their type. However, values resulting from the computation, despite the C clearly forcing them to be truncated, remain untruncated when passed to the inline assembly. This is fine for the signed variable, because it's sign-extended in the register so any integer overflow (which would notionally go from being a leading 0 to leading 1 in 16-bit and thus require more sign-extension) is undefined, and if we cross from negative to non-negative then the carries will propagate up. However, this is problematic for the unsigned variable, since 0xfff[ef] should wrap back around to [01], but we will instead go to 0x1000[01], which is different from how that value would be represented if it had gone out to memory and come back (an __asm__ __volatile__("" :: "r"(&lu) : "memory"); to force it to be in memory will cause it to be loaded back with an ldrh, thereby truncating it). For asm_param, things look very similar, just with explicit sign/zero-extension instructions since there are no loads to fold them into. So, GCC is not self-consistent with what the high bits are.

But with LLVM we get the following:

asm_load:
        adrp    x8, :got:gs
        ldr     x8, [x8, :got_lo12:gs]
        adrp    x9, :got:gu
        ldrh    w8, [x8]
        ldr     x9, [x9, :got_lo12:gu]
        ldrh    w10, [x9]
        #x8
        ldrh    w9, [x9]
        add     w8, w8, #2
        add     w10, w10, #2
        #x9
        #x8
        #x10
        ret

asm_param:
        add     w8, w0, #2
        add     w9, w1, #2
        #x0
        #x1
        #x8
        #x9
        ret

Ignoring the poor register allocation, there are two crucial differences: in asm_load, *everything* is zero-extended so the high bits differ for the signed variable (and also the negative-to-nonnegative case now has bit 16 becoming 1 just like the unsigned overflow), and in asm_param, no sign/zero-extension takes place, the registers are used as-is, but the PCS explicitly leaves the high bits completely undefined.

So, even for AArch64, we have inconsistency both within each compiler and between compilers. For both of them, they don't seem to impose strict rules. GCC just does whatever it would do if it were just a "normal" instruction consuming it, and LLVM uses the ANY_EXTEND mode, with ZERO_EXTEND being used if zero-extension is free (which in practice means "is my input a load instruction"), which is why the only explicit extending of any kind is when it's part of the loads, and is always zero-extending. MIPS turns out to be slightly *more* friendly, in that LLVM's behaviour is more similar to GCC's. It still generates the same kind of code as LLVM did for AArch64 (namely, all unsigned loads in asm_load, and no extensions in asm_param), but because the calling convention mandates zero/sign-extension according to the type, it can get away with this laxness in asm_param and end up behaving the same as GCC's asm_load and asm_param (which also omits the extensions since they're guaranteed), though still differently from its own asm_load due to the unsigned loads.

We generally don't see issues with i32 simply because, when using 32-bit types, there are normally instructions explicitly for 32-bit integers, and thus as far as most LLVM backends are concerned, i32 is a valid type and so it does not get promoted. However, for the RISC-V backend, it was decided to simplify the backend by only having a 64-bit GPR register class, so i32 needs to be promoted much earlier, and thus suffers from all this weirdness that i16 also suffers from on other architectures. This is not a new problem, it has just been made more widespread and apparent.

The good news is that there is a very simple way to avoid all this behaviour that's not specified anywhere and silently relied upon in code like this: put in explicit casts to either zero-extend or sign-extend your value to a register-sized quantity (where "register-sized" *can* be smaller than the native word size, but only if you know there are sub-registers of that size, such as on X86 (8, 16 and 32-bit) or AArch64 (just 32-bit), but for the latter you must make sure to use %wX rather than %X in the inline assembly so you use the sub-register rather than the full register; Clang actually has a warning about this). Obviously, if you don't actually care about the high bits and your code works regardless of what they are then you don't need to do anything and can continue to pass in sub-register-sized values. For example, for asm_load, if you want zero extension, do:

void zeroext_asm_load(void) {
        signed short ls = gs+2;
        unsigned short lu = gu+2;
        __asm__ __volatile__("#%0" :: "r"((unsigned long)(unsigned short)gs));
        __asm__ __volatile__("#%0" :: "r"((unsigned long)gu));
        __asm__ __volatile__("#%0" :: "r"((unsigned long)(unsigned short)ls));
        __asm__ __volatile__("#%0" :: "r"((unsigned long)lu));
}

Or, for asm_param, if you want sign extension, do:

void signext_asm_param(signed short ps, unsigned short pu) {
        signed short ls = ps+2;
        unsigned short lu = pu+2;
        __asm__ __volatile__("#%0" :: "r"((signed long)ps));
        __asm__ __volatile__("#%0" :: "r"((signed long)(signed short)pu));
        __asm__ __volatile__("#%0" :: "r"((signed long)ls));
        __asm__ __volatile__("#%0" :: "r"((signed long)(signed short)lu));
}

Both GCC and LLVM do what you would expect for these, and are (of course, though at this point perhaps that isn't a given) smart enough to know when the extensions are not necessary (including implicit ones to sanitise the input), and when they can be folded into the loads. This makes it strictly better than doing the sanitisation on the inputs inside the inline assembly, since they can't be folded into any of the generated extensions or loads outside the inline assembly (not to mention that clobbering inputs is in general a bad idea unless you put all the necessary constraints in; it might turn out to be harmless given it's only sign/zero-extending, but you never know what the generated code after the inline assembly might be relying on about those high bits that you've just violated).

The only reason I think it might be a bug is that the RISC-V spec seems to be fairly explicit about how unsigned 32-bit ints should be stored in registers. That said, I had also gotten a similar patch using the casts privately and will be updating the patch to use that once I test it.

Use a cast to sign extend the comparison value.

Harbormaster completed remote builds in B27142: Diff 63521.Oct 21 2019, 8:30 PM

A reference to a relevant LLVM review: https://reviews.llvm.org/D69212

To be clear, I booted both clang and GCC-compiled versions with this patch. I also compared the output of my test program for earlier and both clang and gcc now insert a 'sext.w' instruction to sign extend 'cmpval' before the 'li'.

mhorne accepted this revision.Oct 22 2019, 2:41 AM

mhorne added inline comments.