Page MenuHomeFreeBSD

Deferred write seq advancement
ClosedPublic

Authored by jeff on Feb 2 2020, 12:19 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 14, 3:35 AM
Unknown Object (File)
Tue, Nov 12, 11:00 PM
Unknown Object (File)
Oct 5 2024, 12:54 AM
Unknown Object (File)
Oct 4 2024, 9:05 PM
Unknown Object (File)
Oct 4 2024, 4:50 PM
Unknown Object (File)
Oct 4 2024, 11:54 AM
Unknown Object (File)
Oct 3 2024, 1:06 PM
Unknown Object (File)
Oct 3 2024, 11:01 AM
Subscribers

Details

Summary

This is a memory/latency vs cpu tradeoff. The less frequently you advance wr_seq the longer it takes to reclaim memory but the shared cacheline is written less frequently and the read sections become cheaper. I do see a small positive effect on read section time in my tests.

In my experiments so far UMA does not benefit from this. The buckets amortize the global cacheline writes enough that it's irrelevant. In fact, I see a couple thousand alloc/free pairs before poll() is called. I suspect that at very high cpu counts this may start to change so one strategy may be limit = ncpus/factor.

However, if we are to integrate this with subr_epoch.c, a nice strategy would be to avoid calling smr_advance() for every callback, and instead moderate it by some value. If we advance the epoch when the callout is scheduled it will most likely be available to free by the time the timer fires. This should give us one clock tick of latency instead of two today even if we batch advancement.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jeff added reviewers: rlibby, markj.
jeff set the repository for this revision to rS FreeBSD src repository - subversion.
sys/kern/subr_smr.c
285 ↗(On Diff #67645)

I think this now suffers from the same problem that required the acq barrier above.

Perhaps I can do an unfenced load on wr-seq then a thread_fence_acq() after this check.

sys/kern/subr_smr.c
220–226 ↗(On Diff #67645)

Micro-opt: maybe consider this, in order to have one well-predicted branch? I think the compiler can't reduce the double branch otherwise, due to the compiler barrier in critical_exit().

if (__predict_true(++csmr->c_deferred < limit) {
        critical_exit();
	return (smr_shared_current(csmr->c_shared) + SMR_SEQ_INCR);	
} else {
	csmr->c_deferred = 0;
        critical_exit();
	return (smr_advance(smr));
}
280 ↗(On Diff #67645)

I think this would read more clearly with s/This/The goal/.

285 ↗(On Diff #67645)

You've also already done a fetch above, so I think you could just atomic_add_acq() here and then increment the local s_wr_seq. A slightly stale s_wr_seq must not be a problem, because you can be racing with another cpu anyway.

However, I think this might be a little better as atomic_cmpset_acq, because you don't really care to add here, you really want atomic_max(). But since you know that the clock always ticks forward, cmpset yields a max.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 4 2020, 2:45 AM
This revision was automatically updated to reflect the committed changes.
head/sys/kern/subr_smr.c
220

I think we are missing an initialization of c_deferred in smr_create()?