Short of creating separate buf_rings per-package there is no way to avoid a steady stream of coherence traffic on br_prod updates. By definition many threads are simultaneously trying to acquire an index by updating it. However, once a producer has a unique index there is no intrinsic cache line sharing with other producers. With the current implementation if thread A is on package 1 and thread B is on package 2 and they're both producing a steady stream of updates br_ring[] will change ownership CACHE_LINE_SIZE/sizeof(void *) times for each cache line. If instead we pad out each entry to be CACHE_LINE_SIZE this ping-ponging can be avoided entirely.
The motivation for this change is that, at least on some architectures such as AMD's HyperTransport, the number of coherence messages per second is lower than the speed of the link might imply. In other words, although there is very low latency, the actual bandwidth is not that high.
Because this clearly explodes the size of the ring by a factor of CACHE_LINE_SIZE/sizeof(void *) I'm merely putting this out there and am not (currently) championing it. I seek (informed) commentary.