Optimize zpcpu macros a little bit.
ClosedPublic
Actions

Authored by mjg on Feb 7 2020, 3:23 PM.

Details

Reviewers

kib
jhb
glebius
markj
jeff

Commits

rS357805: amd64: store per-cpu allocations subtracted by __pcpu
rS357804: Store offset into zpcpu allocations in the per-cpu area.

Summary

This eliminates the following:

multiplication from zpcpu_get. zpcpu_get is used all the time, for example by malloc
subtraction from counteru64_add on amd64

Changes are simple:

instead of recomputing the offset every time for current cpu, store it
instead of subracting __pcpu every time in counter code, store the already subtracted pointer. there is only one way to both alloc and free them and that place can convert as necessary

The add routine should be removed from counter and instead reimplemented as zpcpu_add. Simlarly someone should add zpcpu_set. Perhaps I'll do it later.

I'm not fond of names used for uma, but I think the concept is sound.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg created this revision.Feb 7 2020, 3:23 PM

Herald added a subscriber: imp. · View Herald TranscriptFeb 7 2020, 3:23 PM

Harbormaster completed remote builds in B29230: Diff 67937.Feb 7 2020, 3:24 PM

mjg updated this revision to Diff 67944.Feb 7 2020, 4:27 PM

mjg retitled this revision from Store offset into zpcpu allocations in the per-cpu area. to Optimize zpcpu macros a little bit..

mjg edited the summary of this revision. (Show Details)

mjg added reviewers: glebius, markj.

Harbormaster completed remote builds in B29232: Diff 67944.Feb 7 2020, 4:27 PM

According to Agner Fog' tables, 64bit mul takes 3-5 cycles and has the latency of 2 cycles.

Can you explain why a memory access instead of the mul instruction is preferred ?

This does not add a memory reference. The current code already references memory in order to load cpu id through gs (needed for multiplication). With the patch instead of loading said id we load the offset. Multiplication of course disappears.

In particular this:

*(int *)zpcpu_get(mp->mnt_thread_in_ops_pcpu) = 1;

Assembly-wise converts from:

mov    0xac8(%rdi),%rax
mov    %gs:0x3c,%ecx
shl    $0xc,%ecx
movl   $0x1,(%rax,%rcx,1)

to:

mov    0xac8(%rdi),%rax
mov    %gs:0xb0,%rcx
movl   $0x1,(%rax,%rcx,1)

Also note saving the offset is used to prevent pessimizing zpcpu_get users in face of the pointer being subtracted with __pcpu. WIth the offset precalculated the change is transparent, otherwise we would have to keep adding it (trading not subtracting in counter code for adding here).

Then this is even stranger. All you shaving off is one shift left by a constant, in the direct dependecy chain. Why bother with that ?

That alone would be almost cosmetics. I noted with storing the pointer with __pcpu subtracted from it there is no need to subtract in counter code. However, unless the offset is cached, there would be a need to keep it adding it back in zpcpu_get instead. With the change there is no need to do anything.

Toy function which simply bumps a counter goes from:

push   %rbp
mov    %rsp,%rbp
mov    0x4f00ed(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
sub    0x808ff6(%rip),%rax        # 0xffffffff80f1a698 <__pcpu>   
addq   $0x1,%gs:(%rax)
pop    %rbp
retq

push   %rbp
mov    %rsp,%rbp
mov    0x4f02fd(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
addq   $0x1,%gs:(%rax)
pop    %rbp
retq

meaning this elminates both a memory reference and a subtraction

ping? I don't think there is much to discuss here. the change shortens both counter and zpcpu code without adverse effects.

In D23570#518088, @mjg wrote:

ping? I don't think there is much to discuss here. the change shortens both counter and zpcpu code without adverse effects.

I do not like the fact that struct pcpu carries easily deductible data for the saving of single one-cycle instruction. If anything, it makes cognitive load and causes more severe effects under memory corruption.

The patch removes a sub instruction with a memory operand from a very commonly used macro (counter_u64_add). According to Agner Fog's tables this is anything from 5 cycles upwards (and that's assuming __pcpu is cached). Storing the offset prevents shifting the burden to zpcpu and just happens to have a side effect of also removing shl from that codepath. Note zpcpu stuff is used by per-cpu write suspension support.

In generally I also don't think it's valuable to fuss around a lot to save an instruction, however, in this case I agree with mjg. If we can eliminate a memory reference and indirection that has value.

We would need to audit zpcpu consumers because they may sometimes assume the pointer goes directly to the first cpu and not use the accessor. Originally smr did this for read only data but I think I changed it.

sys/amd64/include/pcpu.h
243–246 ↗	(On Diff #67944)	Are there any archs that do this differently? Is there a reason not to make the shift everywhere?
sys/vm/uma_core.c
2961 ↗	(On Diff #67944)	It is cleaner to do this part in uz_import/uz_export. This way it's done once when they're placed in the bucket, not for every caller.

mjg added inline comments.Feb 10 2020, 8:08 PM

sys/amd64/include/pcpu.h
243–246 ↗	(On Diff #67944)	Looks like everyone but amd64 uses the real address here. Whether some of the other archs can be patched to things differently I don't know.
sys/vm/uma_core.c
2961 ↗	(On Diff #67944)	I think that would be highly iffy. Note that right now the per-cpu stuff just wraps the standard uma_zalloc. To kib's debug-ability point, I think it's much saner to keep "real" addresses with confines of uma.

jeff added inline comments.Feb 10 2020, 8:15 PM

sys/vm/uma_core.c
2961 ↗	(On Diff #67944)	UMA's zone layer just deals with opaque pointers. It doesn't touch the memory. It would already be broken if it touched pcpu memory.

We would need to audit zpcpu consumers because they may sometimes assume the pointer goes directly to the first cpu and not use the accessor. Originally smr did this for read only data but I think I changed it.

So I just grepped for zpcpu_get(. There are very few consumers and all of them want curcpu.

As for uma_zalloc_pcpu, the only suspicious consumer is ipsec which seems to not use the pointer for anything.

My point is different.

Some consumers may assume the pointer is a normal pointer and look at the first data element and omit zpcpu_get(). This is really a bug but it will work today and break after the change.

Looks like I managed to edit before you responded again. Would be good to have it typed somehow to prevent such misuse at compilation time, but that would get in the way of current type checking.

So ipsec has:

sav->lft_c = uma_zalloc_pcpu(V_key_lft_zone, M_NOWAIT);

counter_u64_t lft_c;            /* CURRENT lifetime */

#define lft_c_allocations lft_c
#define lft_c_bytes lft_c + 1

Which are then always used a counter(9)

sed 's/real_to_pcpu/base_to_offset/; s/pcpu_to_real/offset_to_base/'

Harbormaster completed remote builds in B29302: Diff 68159.Feb 12 2020, 3:15 AM

jeff accepted this revision.Feb 12 2020, 3:16 AM

This revision is now accepted and ready to land.Feb 12 2020, 3:16 AM

Closed by commit rS357804: Store offset into zpcpu allocations in the per-cpu area. (authored by mjg). · Explain WhyFeb 12 2020, 11:11 AM

This revision was automatically updated to reflect the committed changes.

mjg added a commit: rS357804: Store offset into zpcpu allocations in the per-cpu area..

mjg added a commit: rS357805: amd64: store per-cpu allocations subtracted by __pcpu.

Revision Contents
Changeset List

Path

Size

head/

sys/

kern/

subr_pcpu.c

1 line

riscv/

include/

pcpu.h

2 lines

sys/

pcpu.h

19 lines

vm/

uma_core.c

15 lines

Diff 68165

View Options

head/sys/kern/subr_pcpu.c

View Options

head/sys/riscv/include/pcpu.h

View Options

head/sys/sys/pcpu.h

View Options

Optimize zpcpu macros a little bit.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 68165

head/sys/kern/subr_pcpu.c

head/sys/riscv/include/pcpu.h

head/sys/sys/pcpu.h

head/sys/vm/uma_core.c

Optimize zpcpu macros a little bit.
ClosedPublic
Actions

Revision Contents
Changeset List