Page MenuHomeFreeBSD

ARM, ARM64: Workaround for buf_ring reordering
AbandonedPublic

Authored by wma on Jun 27 2016, 1:11 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Mar 13, 4:07 PM
Unknown Object (File)
Feb 21 2024, 3:13 AM
Unknown Object (File)
Jan 27 2024, 9:19 AM
Unknown Object (File)
Jan 23 2024, 6:38 PM
Unknown Object (File)
Dec 20 2023, 12:17 AM
Unknown Object (File)
Dec 5 2023, 11:55 AM
Unknown Object (File)
Dec 3 2023, 10:07 AM
Unknown Object (File)
Nov 13 2023, 1:33 PM
Subscribers
None

Details

Summary

Because of both previous proposed changes were not integrated, I'd like to submit a workaround. Without this patch, network drivers on ARM and ARM64 are most likely unusable if using drbr_* functions.
I want to put it to 11.0 in the following shape (it's well tested and confirmed to be working) and then wait for redefining buf_ring internals in a more generic manner.

Here are mentioned reviews
https://reviews.freebsd.org/D1833
https://reviews.freebsd.org/D1945

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

wma retitled this revision from to ARM, ARM64: Workaround for buf_ring reordering.
wma updated this object.
wma edited the test plan for this revision. (Show Details)
wma added reviewers: kib, emaste, imp, kmacy.
wma set the repository for this revision to rS FreeBSD src repository - subversion.
kib edited edge metadata.

Alan once updated buf_ring with the proper barrier semantic. I believe that the revision was not committed because it was not tested.

It might make sense to ressurect that work and check whether it helps arm64 case.

A member of the Semihalf team tested my patch on arm and reported that it resolved their problems. However, around the same time someone else added a followup to D1945 pointing out a couple other problems that aren't addressed by my patch that can occur under PowerPC's memory ordering model. Take another look at D1945.

Kostik, I've emailed you the patch that was tested by Semihalf.

As I recall, code from https://reviews.freebsd.org/D1945 helps in arm/arm64 case. However, I'm not sure it would be a safe approach to commit this in 11.0. It changes the bufring behaviour significantly for all known architectures, which should be thoroughly tested first.

My main goal is to have buf_ring working in arm64, because we just found that ThunderX is also prone to this kind of reordering (very less likely though, so this is wy we found it so late).

I propose to commit this workaround for 11.0 and remove it very early in 12.0-CURRENT as soon as D1945 is well-tested. What do you think?

In D6986#146386, @wma wrote:

I propose to commit this workaround for 11.0 and remove it very early in 12.0-CURRENT as soon as D1945 is well-tested. What do you think?

IMO this is a reasonable plan.

In D6986#146386, @wma wrote:

I propose to commit this workaround for 11.0 and remove it very early in 12.0-CURRENT as soon as D1945 is well-tested. What do you think?

I think that you misunderstand the status of D1945. Lack of testing is not the issue. First, D1945 does not use acquires and releases correctly. See my comments in D1945. Second, another problem in buf_ring.h on a particular weak memory ordering model was pointed out last September. This newer problem is not addressed by D1945 (or my patch to correct the use of acquires and releases).

I think that you misunderstand the status of D1945.

To be clear, in my comment above I mean I think it's a reasonable plan to proceed with this D6986 change for 11, and replace it with something else (whether that's D1945 or not) later on after stable/11 branches.

Thanks for comments. It seems I was not aware of the present status of D1945.

In that case please let me know if you have anything against committing this patch. If not, I will prepare a re@ request tomorrow.

It's clear to me why the load of br->br_prod_tail should be an acquire. It establishes a synchronizes-with relationship with the release on the same location in buf_ring_enqueue(), and this relationship directly addresses the problem described in the comment. However, I don't understand why the earlier (in program order) load of br->br_cons_head is also being made an acquire. Can you explain why?

I believe this is an effect of D1833 discussions. I wanted to place memory barrier after both reads on armv7 (where load_acq is emulated as atomic_read and them dmb() ). The change to prod_tail is to match store_rel used in other places in the code. Agreed, on armv8 it makes little sense because we have an appropriate memory model, but this change was tested in Semihalf for over a year and confirmed is working fine in the current shape.
Of course I can remove the change to cons_head, but therefore I coudln't guarantee any proper testing on armv8 (the issue reproduces extremely rarely on ThunderX). This is why I'd like to leave this line intact, even if most likely is not necessary. I expect this whole commit will be reverted very early in 12-current anyway and replaced with something more robust.

The intrinsic race between transmitting and yielding with buf_ring and the frequent misuse of it and drbr by driver writers has led me to abandon it in favor of mp_ring.

After mulling this over, I have two suggestions.

  1. Make the acquire on br->br_prod_tail unconditional on architecture. In principle, this access *should* be an acquire, and changing it is also going to benefit powerpc and riscv. Moreover, in practice, the risk that you're going to break anything by making it an acquire (on all architectures) is negligible. For example, on both amd64 and i386, all you would be doing is introducing a compiler-level memory barrier. At the instruction level, an acquire is just an ordinary load on both amd64 and i386. In short, nothing is going break.
  1. I would ask that you add a comment explicitly stating that there is uncertainty around the need for the acquire on br->br_cons_head, but that is what you tested.

Thanks for making my suggested changes! In addition, could you please try to reformat the comment so that it isn't quite so wide. I think that I needed a window that was about 140 characters wide to view the comment, and style(9) requires that lines not exceed 80 characters.

It's already in the tree, r302292. I have a plan to check which atomic_load_acq is necessary, but first I want to upstream support for the board which is actually able to reorder these pointers.

Commit link: rS302292
@wma can you close this review?