Page MenuHomeFreeBSD

Optimize zpcpu macros a little bit.
ClosedPublic

Authored by mjg on Feb 7 2020, 3:23 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Nov 25, 5:43 AM
Unknown Object (File)
Sat, Nov 23, 12:23 AM
Unknown Object (File)
Tue, Nov 19, 7:40 PM
Unknown Object (File)
Sun, Nov 10, 9:06 PM
Unknown Object (File)
Sun, Nov 10, 2:49 PM
Unknown Object (File)
Sun, Nov 10, 2:33 PM
Unknown Object (File)
Fri, Nov 1, 3:50 PM
Unknown Object (File)
Sep 29 2024, 7:08 PM
Subscribers

Details

Summary

This eliminates the following:

  • multiplication from zpcpu_get. zpcpu_get is used all the time, for example by malloc
  • subtraction from counteru64_add on amd64

Changes are simple:

  1. instead of recomputing the offset every time for current cpu, store it
  2. instead of subracting __pcpu every time in counter code, store the already subtracted pointer. there is only one way to both alloc and free them and that place can convert as necessary

The add routine should be removed from counter and instead reimplemented as zpcpu_add. Simlarly someone should add zpcpu_set. Perhaps I'll do it later.

I'm not fond of names used for uma, but I think the concept is sound.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 29232

Event Timeline

mjg retitled this revision from Store offset into zpcpu allocations in the per-cpu area. to Optimize zpcpu macros a little bit..
mjg edited the summary of this revision. (Show Details)
mjg added reviewers: glebius, markj.

According to Agner Fog' tables, 64bit mul takes 3-5 cycles and has the latency of 2 cycles.

Can you explain why a memory access instead of the mul instruction is preferred ?

This does not add a memory reference. The current code already references memory in order to load cpu id through gs (needed for multiplication). With the patch instead of loading said id we load the offset. Multiplication of course disappears.

In particular this:

*(int *)zpcpu_get(mp->mnt_thread_in_ops_pcpu) = 1;

Assembly-wise converts from:

mov    0xac8(%rdi),%rax
mov    %gs:0x3c,%ecx
shl    $0xc,%ecx
movl   $0x1,(%rax,%rcx,1)

to:

mov    0xac8(%rdi),%rax
mov    %gs:0xb0,%rcx
movl   $0x1,(%rax,%rcx,1)

Also note saving the offset is used to prevent pessimizing zpcpu_get users in face of the pointer being subtracted with __pcpu. WIth the offset precalculated the change is transparent, otherwise we would have to keep adding it (trading not subtracting in counter code for adding here).

Then this is even stranger. All you shaving off is one shift left by a constant, in the direct dependecy chain. Why bother with that ?

That alone would be almost cosmetics. I noted with storing the pointer with __pcpu subtracted from it there is no need to subtract in counter code. However, unless the offset is cached, there would be a need to keep it adding it back in zpcpu_get instead. With the change there is no need to do anything.

Toy function which simply bumps a counter goes from:

push   %rbp
mov    %rsp,%rbp
mov    0x4f00ed(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
sub    0x808ff6(%rip),%rax        # 0xffffffff80f1a698 <__pcpu>   
addq   $0x1,%gs:(%rax)
pop    %rbp
retq

to

push   %rbp
mov    %rsp,%rbp
mov    0x4f02fd(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
addq   $0x1,%gs:(%rax)
pop    %rbp
retq

meaning this elminates both a memory reference and a subtraction

ping? I don't think there is much to discuss here. the change shortens both counter and zpcpu code without adverse effects.

In D23570#518088, @mjg wrote:

ping? I don't think there is much to discuss here. the change shortens both counter and zpcpu code without adverse effects.

I do not like the fact that struct pcpu carries easily deductible data for the saving of single one-cycle instruction. If anything, it makes cognitive load and causes more severe effects under memory corruption.

The patch removes a sub instruction with a memory operand from a very commonly used macro (counter_u64_add). According to Agner Fog's tables this is anything from 5 cycles upwards (and that's assuming __pcpu is cached). Storing the offset prevents shifting the burden to zpcpu and just happens to have a side effect of also removing shl from that codepath. Note zpcpu stuff is used by per-cpu write suspension support.

In generally I also don't think it's valuable to fuss around a lot to save an instruction, however, in this case I agree with mjg. If we can eliminate a memory reference and indirection that has value.

We would need to audit zpcpu consumers because they may sometimes assume the pointer goes directly to the first cpu and not use the accessor. Originally smr did this for read only data but I think I changed it.

sys/amd64/include/pcpu.h
243–246

Are there any archs that do this differently? Is there a reason not to make the shift everywhere?

sys/vm/uma_core.c
2961

It is cleaner to do this part in uz_import/uz_export. This way it's done once when they're placed in the bucket, not for every caller.

sys/amd64/include/pcpu.h
243–246

Looks like everyone but amd64 uses the real address here. Whether some of the other archs can be patched to things differently I don't know.

sys/vm/uma_core.c
2961

I think that would be highly iffy. Note that right now the per-cpu stuff just wraps the standard uma_zalloc. To kib's debug-ability point, I think it's much saner to keep "real" addresses with confines of uma.

sys/vm/uma_core.c
2961

UMA's zone layer just deals with opaque pointers. It doesn't touch the memory. It would already be broken if it touched pcpu memory.

We would need to audit zpcpu consumers because they may sometimes assume the pointer goes directly to the first cpu and not use the accessor. Originally smr did this for read only data but I think I changed it.

So I just grepped for zpcpu_get(. There are very few consumers and all of them want curcpu.

As for uma_zalloc_pcpu, the only suspicious consumer is ipsec which seems to not use the pointer for anything.

My point is different.

Some consumers may assume the pointer is a normal pointer and look at the first data element and omit zpcpu_get(). This is really a bug but it will work today and break after the change.

Looks like I managed to edit before you responded again. Would be good to have it typed somehow to prevent such misuse at compilation time, but that would get in the way of current type checking.

So ipsec has:

sav->lft_c = uma_zalloc_pcpu(V_key_lft_zone, M_NOWAIT);

counter_u64_t lft_c;            /* CURRENT lifetime */

#define lft_c_allocations lft_c
#define lft_c_bytes lft_c + 1

Which are then always used a counter(9)

  • sed 's/real_to_pcpu/base_to_offset/; s/pcpu_to_real/offset_to_base/'
This revision is now accepted and ready to land.Feb 12 2020, 3:16 AM