Page MenuHomeFreeBSD

nvme: Use adaptive spinning when polling for completion or state change
ClosedPublic

Authored by imp on Oct 1 2021, 5:33 PM.
Tags
None
Referenced Files
F82586159: D32259.diff
Tue, Apr 30, 4:11 PM
Unknown Object (File)
Sat, Apr 6, 6:42 PM
Unknown Object (File)
Mar 24 2024, 2:58 PM
Unknown Object (File)
Feb 21 2024, 2:09 PM
Unknown Object (File)
Feb 21 2024, 2:09 PM
Unknown Object (File)
Feb 21 2024, 2:09 PM
Unknown Object (File)
Feb 21 2024, 2:09 PM
Unknown Object (File)
Feb 21 2024, 2:09 PM
Subscribers

Details

Reviewers
cperciva
mav
chuck
Summary

Users of nvme_completion_poll are all in the initialization path. Most
of the commands they queue and wait for finish quickly as they involve
no I/O to the drive's media. These command finsh much faster than a
single tick, on the order of a few microseconds. Adaptively polling for
the first tick allows us to return much earlier than we would
otherwise. The cumulative effect of not waiting until the next tick to
re-poll the condition is impressive (~80 of 100 ms saved).

Use this same technique waiting for RDY state transitions as well. Those
transition quickly as well and we have to wait for a couple of
them. This saves the rest (~20ms).

This eliminates almost 100ms of delay on boot in cperciva's EC2 test
harness and makes nvme disappear from the flame graph of boot times.

Tested by: cperciva
Sponsored by: Netflix

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 41896
Build 38784: arc lint + arc unit

Event Timeline

imp requested review of this revision.Oct 1 2021, 5:33 PM
imp retitled this revision from adaptive spin to nvme: Use adaptive spinning when polling for completion or state change.Oct 1 2021, 8:18 PM
imp edited the summary of this revision. (Show Details)
imp added reviewers: cperciva, mav, chuck.
sys/dev/nvme/nvme_ctrlr.c
266

TICKS_2_US here too

sys/dev/nvme/nvme_private.h
472

TICKS_2_US here

It looks OK, just in the first chunk instead of sanity2 I'd call it timeout2 or somehow else.

But I think it could look better with pause_sbt() with interval doubled every call instead of DELAY(). I haven't benchmarked pause_sbt() recently, but when it was developed years ago it was able to sleep for as little as few microseconds.

implement mav@'s suggestion. Add some basic profiling. Used it to determine that
1.5 is a better scale factor than 2. Most of these commands take between 30us
and 200us, and there's little variance within a drive, but from vendor to vendor
the variance is much larger.

It looks good to me. I'd definitely remove the NVME_MEASURE_WAIT blocks after initial testing to not pollute the code. Also SBT_1US in the precision field is not very needed, since C_PREL(1) will be bigger after first couple iterations, so 0 would work just fine too.

In D32259#728245, @mav wrote:

It looks good to me. I'd definitely remove the NVME_MEASURE_WAIT blocks after initial testing to not pollute the code. Also SBT_1US in the precision field is not very needed, since C_PREL(1) will be bigger after first couple iterations, so 0 would work just fine too.

Crazy question: Why not just do 5 or 10 microsecond sleeps and not worry about scaling. What does this scaling buy us?

A data point: most commands take between 20 and 200 microseconds. The RDY waiting takes between 5 microseconds and 1.3 seconds (for what I hope is pre-release hardware I got from a vendor).

In D32259#728249, @imp wrote:

Crazy question: Why not just do 5 or 10 microsecond sleeps and not worry about scaling. What does this scaling buy us?

If we have to wait for full 10 seconds with 5 microsecond sleeps, it may end up not too different from the original tight loop from CPU usage and power consumption points. This looks like a good compromise to me.

This revision is now accepted and ready to land.Oct 2 2021, 12:37 AM