Page MenuHomeFreeBSD

iflib: remove convoluted custom zeroing code
ClosedPublic

Authored by brooks on Thu, Jan 8, 6:09 PM.
Tags
None
Referenced Files
F141910908: D54605.id169338.diff
Mon, Jan 12, 9:56 AM
F141909686: D54605.diff
Mon, Jan 12, 9:29 AM
F141875770: D54605.diff
Sun, Jan 11, 8:39 PM
Unknown Object (File)
Sat, Jan 10, 4:20 PM
Unknown Object (File)
Sat, Jan 10, 3:32 PM
Unknown Object (File)
Sat, Jan 10, 5:24 AM
Unknown Object (File)
Sat, Jan 10, 5:19 AM
Unknown Object (File)
Sat, Jan 10, 5:00 AM

Details

Summary

Replace a collection of aliasing violations and ifdefs with memset
(which now expands to __builtin_memset and should be quite reliably
inlined.) The old code is hard to maintain as evidenced by the most
recent change to if_pkt_info_t updating the defines, but not the zeroing
code.

Effort: CHERI upstreaming
Sponsored by: Innovate UK
Fixes: 43d7ee540efe ("iflib: support for transmit side nic KTLS offload")

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

brooks requested review of this revision.Thu, Jan 8, 6:09 PM

CHERI adds yet another branch to the ifdefs and makes this worse and harder to maintain so it's time for it to go. The compiler is also free to recognize the pattern here as memset and replace the code with a libcall so this optimization isn't reliable in any way.

Inline them? There are only 5 calls in total.

I would suggest to go deeper and call memset() directly. These empty proxy functions do not improve readability.

For additional context, I took a look at the disassembly on amd64 and the memsets are all expanded in place:

;       memset(pi, 0, sizeof(*pi));
    8461: 48 c7 85 78 ff ff ff 00 00 00 00      movq    $0x0, -0x88(%rbp)
    846c: 48 c7 85 70 ff ff ff 00 00 00 00      movq    $0x0, -0x90(%rbp)
    8477: 48 c7 85 68 ff ff ff 00 00 00 00      movq    $0x0, -0x98(%rbp)
    8482: 48 c7 85 60 ff ff ff 00 00 00 00      movq    $0x0, -0xa0(%rbp)
    848d: 48 c7 85 58 ff ff ff 00 00 00 00      movq    $0x0, -0xa8(%rbp)
    8498: 48 c7 85 50 ff ff ff 00 00 00 00      movq    $0x0, -0xb0(%rbp)
...
;       memset(ri, 0, sizeof(*ri));
    8b1d: 48 c7 85 78 ff ff ff 00 00 00 00      movq    $0x0, -0x88(%rbp)
    8b28: 48 c7 85 70 ff ff ff 00 00 00 00      movq    $0x0, -0x90(%rbp)
    8b33: 48 c7 85 60 ff ff ff 00 00 00 00      movq    $0x0, -0xa0(%rbp)
    8b3e: 48 c7 85 58 ff ff ff 00 00 00 00      movq    $0x0, -0xa8(%rbp)
    8b49: 48 8b 85 48 ff ff ff          movq    -0xb8(%rbp), %rax
...
;       memset(pi, 0, sizeof(*pi));
    a9aa: 48 c7 45 a0 00 00 00 00       movq    $0x0, -0x60(%rbp)
    a9b2: 48 c7 45 98 00 00 00 00       movq    $0x0, -0x68(%rbp)
    a9ba: 48 c7 45 90 00 00 00 00       movq    $0x0, -0x70(%rbp)
    a9c2: 48 c7 45 88 00 00 00 00       movq    $0x0, -0x78(%rbp)
    a9ca: 48 c7 45 80 00 00 00 00       movq    $0x0, -0x80(%rbp)
    a9d2: 48 c7 85 78 ff ff ff 00 00 00 00      movq    $0x0, -0x88(%rbp)
    a9dd: 48 c7 85 70 ff ff ff 00 00 00 00      movq    $0x0, -0x90(%rbp)
...
;       memset(ri, 0, sizeof(*ri));
    bc7c: 48 c7 45 b0 00 00 00 00       movq    $0x0, -0x50(%rbp)
    bc84: 48 c7 45 a8 00 00 00 00       movq    $0x0, -0x58(%rbp)
    bc8c: 48 c7 45 a0 00 00 00 00       movq    $0x0, -0x60(%rbp)
    bc94: 48 c7 45 98 00 00 00 00       movq    $0x0, -0x68(%rbp)
    bc9c: 48 c7 45 90 00 00 00 00       movq    $0x0, -0x70(%rbp)
...
;       memset(pi, 0, sizeof(*pi));
    dc56: 48 c7 45 c8 00 00 00 00       movq    $0x0, -0x38(%rbp)
    dc5e: 48 c7 45 c0 00 00 00 00       movq    $0x0, -0x40(%rbp)
    dc66: 48 c7 45 b8 00 00 00 00       movq    $0x0, -0x48(%rbp)
    dc6e: 48 c7 45 b0 00 00 00 00       movq    $0x0, -0x50(%rbp)
    dc76: 48 c7 45 a8 00 00 00 00       movq    $0x0, -0x58(%rbp)
    dc7e: 48 c7 45 a0 00 00 00 00       movq    $0x0, -0x60(%rbp)
    dc86: 48 c7 45 98 00 00 00 00       movq    $0x0, -0x68(%rbp)

Drop *_zero functions entierly

Bravo. Anything we can do to make iflib simpler is a good thing in my book!

This revision is now accepted and ready to land.Thu, Jan 8, 8:58 PM

For additional context, I took a look at the disassembly on amd64 and the memsets are all expanded in place:

Did you look at the assembly from before the change and did it look like the post-memset change contents? I'm curious about what that custom zeroing code looked like and what it achieved.

In D54605#1247657, @erj wrote:

For additional context, I took a look at the disassembly on amd64 and the memsets are all expanded in place:

Did you look at the assembly from before the change and did it look like the post-memset change contents? I'm curious about what that custom zeroing code looked like and what it achieved.

It's bit hard to tease out, but it looks like the compiler might be slightly better at dead store elimination with the open-coded version. In particular rxd_init_zero expands to:

;               ri_pad->rxd_val[i] = 0;
ffffffff80cf98ad: 48 c7 85 58 ff ff ff 00 00 00 00      movq    $0x0, -0xa8(%rbp)
;               ri_pad->rxd_val[i + 3] = 0;
ffffffff80cf98b8: 48 c7 85 70 ff ff ff 00 00 00 00      movq    $0x0, -0x90(%rbp)
;       ri_pad->rxd_val[RXD_INFO_SIZE - 1] = 0;  
ffffffff80cf98c3: 48 c7 85 78 ff ff ff 00 00 00 00      movq    $0x0, -0x88(%rbp)

where the i + 1 and i + 2 statements are eliminated.

If this really matters we should find a more maintainable solution.

This revision was automatically updated to reflect the committed changes.
In D54605#1247657, @erj wrote:

For additional context, I took a look at the disassembly on amd64 and the memsets are all expanded in place:

Did you look at the assembly from before the change and did it look like the post-memset change contents? I'm curious about what that custom zeroing code looked like and what it achieved.

It's bit hard to tease out, but it looks like the compiler might be slightly better at dead store elimination with the open-coded version. In particular rxd_init_zero expands to:

;               ri_pad->rxd_val[i] = 0;
ffffffff80cf98ad: 48 c7 85 58 ff ff ff 00 00 00 00      movq    $0x0, -0xa8(%rbp)
;               ri_pad->rxd_val[i + 3] = 0;
ffffffff80cf98b8: 48 c7 85 70 ff ff ff 00 00 00 00      movq    $0x0, -0x90(%rbp)
;       ri_pad->rxd_val[RXD_INFO_SIZE - 1] = 0;  
ffffffff80cf98c3: 48 c7 85 78 ff ff ff 00 00 00 00      movq    $0x0, -0x88(%rbp)

where the i + 1 and i + 2 statements are eliminated.

If this really matters we should find a more maintainable solution.

I'm assuming it probably doesn't matter, at least until someone profiles it. Thanks for looking at it!

In D54605#1248134, @erj wrote:

If this really matters we should find a more maintainable solution.

I'm assuming it probably doesn't matter, at least until someone profiles it. Thanks for looking at it!

The compiler could probably be fixed to do the optimization without too much pain if the right person looked at it. We probably suffer a bit from the flags used by the kernel in that LLVM head seems to use FPU instructions for zeroing by default which reduces pressure to fix this kind of thing.