Page MenuHomeFreeBSD

NTB Perf
Needs ReviewPublic

Authored by shreyankamartya229_gmail.com on Jun 19 2019, 11:32 AM.
Tags
None
Referenced Files
Unknown Object (File)
Apr 10 2024, 8:15 PM
Unknown Object (File)
Jan 19 2024, 5:40 PM
Unknown Object (File)
Dec 20 2023, 3:16 AM
Unknown Object (File)
Nov 9 2023, 3:30 PM
Unknown Object (File)
Nov 9 2023, 9:12 AM
Unknown Object (File)
Oct 31 2023, 3:04 AM
Unknown Object (File)
Oct 29 2023, 3:56 AM
Unknown Object (File)
Oct 8 2023, 8:01 AM

Details

Reviewers
emaste
cem
Summary

Diff between linux ntb_perf tool and the freebsd ported version. Created for @emaste.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

./sys/dev/ntb/test/ntb_perf.c
291–295

So this is a bit orthogonal to this change, but:

Do we ever want to allow the compiler to reorder NTB SPAD or DB writes?

We don't want the CPU to reorder some things but don't care about others; but that isn't a concern because the MMIO memory for SPAD and DB is mapped UC anyway, right?

What I'm trying to ask is, would it be reasonable to just add barrier() (compiler barrier) annotations directly to ntb_*_{spad,db}_{read,write}() directly instead?

./sys/dev/ntb/test/ntb_perf.c
291–295

Okay, so marking a region as UC would also prevent compiler reordering? Or, does the memory barrier prevent the CPU from reordering instructions as well? I ask because I'm not very clear on the effects of changing the memory attributes of a region. In ntb_perf, when I changed the memory window region attribute to WC, there was a huge jump in performance (30x increase). I was not able to determine why that happened.

Yes, adding the barrier to spad read/write commands will add more fences, so to speak and I don't see a problem with that.

./sys/dev/ntb/test/ntb_perf.c
291–295

No, compiler isn't aware of UC memory; compiler reordering is still a concern.

Both mmiowb() (Linux) and barrier() (FreeBSD) are just compiler barriers. The other barrier primitives like wmb(), mb(), etc, are both compiler and CPU barriers.

UC mode ("uncacheable") is really slow. All accesses are synchronous and cannot be combined or buffered at all. In contrast, all other modes are more relaxed and faster. WC is less synchronized than UC. I don't recall exact semantics but at least adjacent writes are allowed to be combined by the CPU before being sent out on the PCIe bus to hardware (and they can be asynchronous w.r.t. the instruction stream).

(The barrier() primitive does not add any actual {m,l,s}fence instructions to generated code.)

./sys/dev/ntb/test/ntb_perf.c
291–295

Understood. Adding fence instruction would have prevented CPU reordering which is already being done by UC.

Will make this change in the next review along with @emaste 's suggestions. Any other changes you envisage?

./sys/dev/ntb/test/ntb_perf.c
291–295

That's the only thing that came to mind. I haven't looked at the patch thoroughly.

320

ntb_spad_read returns int, not uint32_t. And there's not much point storing ret if you aren't going to check it. I'd probably just assert it. The only real error condition is EINVAL due to a bad index value, so I think you're ok just KASSERT()ing ret == 0.