Deferred write seq advancement
ClosedPublic
Actions

Authored by jeff on Feb 2 2020, 12:19 AM.

Details

Reviewers

rlibby
markj

Commits

rS357487: Implement a deferred write advancement feature that can be used to further

Summary

This is a memory/latency vs cpu tradeoff. The less frequently you advance wr_seq the longer it takes to reclaim memory but the shared cacheline is written less frequently and the read sections become cheaper. I do see a small positive effect on read section time in my tests.

In my experiments so far UMA does not benefit from this. The buckets amortize the global cacheline writes enough that it's irrelevant. In fact, I see a couple thousand alloc/free pairs before poll() is called. I suspect that at very high cpu counts this may start to change so one strategy may be limit = ncpus/factor.

However, if we are to integrate this with subr_epoch.c, a nice strategy would be to avoid calling smr_advance() for every callback, and instead moderate it by some value. If we advance the epoch when the callout is scheduled it will most likely be available to free by the time the timer fires. This should give us one clock tick of latency instead of two today even if we batch advancement.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jeff created this revision.Feb 2 2020, 12:19 AM

Harbormaster completed remote builds in B29101: Diff 67645.Feb 2 2020, 12:19 AM

jeff edited the summary of this revision. (Show Details)Feb 2 2020, 12:24 AM

jeff added reviewers: rlibby, markj.

jeff set the repository for this revision to rS FreeBSD src repository - subversion.

Herald added a subscriber: imp. · View Herald TranscriptFeb 2 2020, 12:24 AM

jeff added inline comments.Feb 2 2020, 12:30 AM

sys/kern/subr_smr.c
285 ↗	(On Diff #67645)	I think this now suffers from the same problem that required the acq barrier above. Perhaps I can do an unfenced load on wr-seq then a thread_fence_acq() after this check.

rlibby added inline comments.Feb 3 2020, 2:03 AM

sys/kern/subr_smr.c
220–226 ↗	(On Diff #67645)	Micro-opt: maybe consider this, in order to have one well-predicted branch? I think the compiler can't reduce the double branch otherwise, due to the compiler barrier in critical_exit(). if (__predict_true(++csmr->c_deferred < limit) { critical_exit(); return (smr_shared_current(csmr->c_shared) + SMR_SEQ_INCR); } else { csmr->c_deferred = 0; critical_exit(); return (smr_advance(smr)); }
280 ↗	(On Diff #67645)	I think this would read more clearly with s/This/The goal/.
285 ↗	(On Diff #67645)	You've also already done a fetch above, so I think you could just atomic_add_acq() here and then increment the local s_wr_seq. A slightly stale s_wr_seq must not be a problem, because you can be racing with another cpu anyway. However, I think this might be a little better as atomic_cmpset_acq, because you don't really care to add here, you really want atomic_max(). But since you know that the clock always ticks forward, cmpset yields a max.