Page MenuHomeFreeBSD

Optimize zpcpu macros a little bit.
ClosedPublic

Authored by mjg on Feb 7 2020, 3:23 PM.

Details

Summary

This eliminates the following:

  • multiplication from zpcpu_get. zpcpu_get is used all the time, for example by malloc
  • subtraction from counteru64_add on amd64

Changes are simple:

  1. instead of recomputing the offset every time for current cpu, store it
  2. instead of subracting __pcpu every time in counter code, store the already subtracted pointer. there is only one way to both alloc and free them and that place can convert as necessary

The add routine should be removed from counter and instead reimplemented as zpcpu_add. Simlarly someone should add zpcpu_set. Perhaps I'll do it later.

I'm not fond of names used for uma, but I think the concept is sound.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mjg created this revision.Feb 7 2020, 3:23 PM
mjg updated this revision to Diff 67944.Feb 7 2020, 4:27 PM
mjg retitled this revision from Store offset into zpcpu allocations in the per-cpu area. to Optimize zpcpu macros a little bit..
mjg edited the summary of this revision. (Show Details)
mjg added reviewers: glebius, markj.
kib added a comment.Feb 7 2020, 6:03 PM

According to Agner Fog' tables, 64bit mul takes 3-5 cycles and has the latency of 2 cycles.

Can you explain why a memory access instead of the mul instruction is preferred ?

mjg added a comment.EditedFeb 7 2020, 6:30 PM

This does not add a memory reference. The current code already references memory in order to load cpu id through gs (needed for multiplication). With the patch instead of loading said id we load the offset. Multiplication of course disappears.

In particular this:

*(int *)zpcpu_get(mp->mnt_thread_in_ops_pcpu) = 1;

Assembly-wise converts from:

mov    0xac8(%rdi),%rax
mov    %gs:0x3c,%ecx
shl    $0xc,%ecx
movl   $0x1,(%rax,%rcx,1)

to:

mov    0xac8(%rdi),%rax
mov    %gs:0xb0,%rcx
movl   $0x1,(%rax,%rcx,1)

Also note saving the offset is used to prevent pessimizing zpcpu_get users in face of the pointer being subtracted with __pcpu. WIth the offset precalculated the change is transparent, otherwise we would have to keep adding it (trading not subtracting in counter code for adding here).

kib added a comment.Feb 7 2020, 6:56 PM

Then this is even stranger. All you shaving off is one shift left by a constant, in the direct dependecy chain. Why bother with that ?

mjg added a comment.EditedFeb 7 2020, 7:16 PM

That alone would be almost cosmetics. I noted with storing the pointer with __pcpu subtracted from it there is no need to subtract in counter code. However, unless the offset is cached, there would be a need to keep it adding it back in zpcpu_get instead. With the change there is no need to do anything.

Toy function which simply bumps a counter goes from:

push   %rbp
mov    %rsp,%rbp
mov    0x4f00ed(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
sub    0x808ff6(%rip),%rax        # 0xffffffff80f1a698 <__pcpu>   
addq   $0x1,%gs:(%rax)
pop    %rbp
retq

to

push   %rbp
mov    %rsp,%rbp
mov    0x4f02fd(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
addq   $0x1,%gs:(%rax)
pop    %rbp
retq

meaning this elminates both a memory reference and a subtraction

mjg added a comment.Feb 10 2020, 6:49 PM

ping? I don't think there is much to discuss here. the change shortens both counter and zpcpu code without adverse effects.

kib added a comment.Feb 10 2020, 7:18 PM
In D23570#518088, @mjg wrote:

ping? I don't think there is much to discuss here. the change shortens both counter and zpcpu code without adverse effects.

I do not like the fact that struct pcpu carries easily deductible data for the saving of single one-cycle instruction. If anything, it makes cognitive load and causes more severe effects under memory corruption.

mjg added a comment.EditedFeb 10 2020, 7:32 PM

The patch removes a sub instruction with a memory operand from a very commonly used macro (counter_u64_add). According to Agner Fog's tables this is anything from 5 cycles upwards (and that's assuming __pcpu is cached). Storing the offset prevents shifting the burden to zpcpu and just happens to have a side effect of also removing shl from that codepath. Note zpcpu stuff is used by per-cpu write suspension support.

jeff added a subscriber: jeff.Feb 10 2020, 7:56 PM

In generally I also don't think it's valuable to fuss around a lot to save an instruction, however, in this case I agree with mjg. If we can eliminate a memory reference and indirection that has value.

We would need to audit zpcpu consumers because they may sometimes assume the pointer goes directly to the first cpu and not use the accessor. Originally smr did this for read only data but I think I changed it.

sys/amd64/include/pcpu.h
243–246 ↗(On Diff #67944)

Are there any archs that do this differently? Is there a reason not to make the shift everywhere?

sys/vm/uma_core.c
2961 ↗(On Diff #67944)

It is cleaner to do this part in uz_import/uz_export. This way it's done once when they're placed in the bucket, not for every caller.

mjg added inline comments.Feb 10 2020, 8:08 PM
sys/amd64/include/pcpu.h
243–246 ↗(On Diff #67944)

Looks like everyone but amd64 uses the real address here. Whether some of the other archs can be patched to things differently I don't know.

sys/vm/uma_core.c
2961 ↗(On Diff #67944)

I think that would be highly iffy. Note that right now the per-cpu stuff just wraps the standard uma_zalloc. To kib's debug-ability point, I think it's much saner to keep "real" addresses with confines of uma.

jeff added inline comments.Feb 10 2020, 8:15 PM
sys/vm/uma_core.c
2961 ↗(On Diff #67944)

UMA's zone layer just deals with opaque pointers. It doesn't touch the memory. It would already be broken if it touched pcpu memory.

mjg added a comment.EditedFeb 10 2020, 9:01 PM

We would need to audit zpcpu consumers because they may sometimes assume the pointer goes directly to the first cpu and not use the accessor. Originally smr did this for read only data but I think I changed it.

So I just grepped for zpcpu_get(. There are very few consumers and all of them want curcpu.

As for uma_zalloc_pcpu, the only suspicious consumer is ipsec which seems to not use the pointer for anything.

jeff added a comment.Feb 10 2020, 9:03 PM

My point is different.

Some consumers may assume the pointer is a normal pointer and look at the first data element and omit zpcpu_get(). This is really a bug but it will work today and break after the change.

mjg added a comment.EditedFeb 10 2020, 9:05 PM

Looks like I managed to edit before you responded again. Would be good to have it typed somehow to prevent such misuse at compilation time, but that would get in the way of current type checking.

mjg added a comment.Feb 12 2020, 2:47 AM

So ipsec has:

sav->lft_c = uma_zalloc_pcpu(V_key_lft_zone, M_NOWAIT);

counter_u64_t lft_c;            /* CURRENT lifetime */

#define lft_c_allocations lft_c
#define lft_c_bytes lft_c + 1

Which are then always used a counter(9)

mjg updated this revision to Diff 68159.Feb 12 2020, 3:15 AM
  • sed 's/real_to_pcpu/base_to_offset/; s/pcpu_to_real/offset_to_base/'
jeff accepted this revision.Feb 12 2020, 3:16 AM
This revision is now accepted and ready to land.Feb 12 2020, 3:16 AM