Page MenuHomeFreeBSD

iSCSI: Fast and per-session timeouts
AcceptedPublic

Authored by rscheff on Oct 17 2021, 10:50 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Dec 2, 7:33 AM
Unknown Object (File)
Mon, Dec 2, 7:28 AM
Unknown Object (File)
Mon, Dec 2, 3:58 AM
Unknown Object (File)
Sun, Dec 1, 11:58 PM
Unknown Object (File)
Sun, Dec 1, 11:58 PM
Unknown Object (File)
Sun, Dec 1, 11:58 PM
Unknown Object (File)
Sun, Dec 1, 11:56 PM
Unknown Object (File)
Sun, Dec 1, 11:55 PM

Details

Summary

In order to support latency critical devices, relying on TCP retransmissions is not giving a timely indication, once the connecitivity is restored.
TCP retransmissions are using an exponential backoff interval between each retransmission, thus after the first few initial tries, connectivity is probed with multi-second gaps.
This in turn can easily exceed the application timeout even when network services got restored in time, but simply not probed fast enough.

MPIO is not always possible, and would only address this effect when sessions are using truly indendent network paths.
Each individual session is still exposed to the same effect of TCP.
Timing differences and possible some random out-of-phase processing of SCSI IOs may help somewhat,

This patch has a number of enhancements:

Per-Session configurable ping (SCSI NOP) and login timeouts.
These are configures similar to what is supported in the Linux iSCSI initiator configuration file.

Reduced iscsi timer granularity to 0.1 sec.
For backwards compatibility, when all timers are selected on integer numbers, all status messages are unchanged.
Further, the schedule will also be run only once per second (instead of 10 times per second) to minimize overhead, where this is unnecessary.

The sysctl(3) interface has changed from pure int to string, in order to support deciseconds in a human-readable, dotted (fractional) notation.

Test Plan

set kern.iscsi.ping_timeout to e.g. "4.5" and verify that on connectivity loss with an iSCSI target, the new session will be attempted at the shorter interval.

Add "PingTimeout" or "LoginTimeout" to a target in /etc/iscsi.conf, and verify the same.

iscsictl -Lv shows individualized timeouts

Note: per-session timeouts are initialized from the global variables on session setup, and then kept. Changing system global sysctls will no longer affect existing sessions.
Also, a timeout of 1 second (if both ping and login make use of full seconds) or 0.1 sec will likely result in frequent, spurious session teardowns/reestablishing due to how
tracking of the timeout is performed. (avoidig unnecessary CPU overhead in the data path).

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 53941
Build 50831: arc lint + arc unit

Event Timeline

  • reduce TCP_KEEPIDLE to match login timeout
  • move keepinit output to debug level
sys/dev/iscsi/iscsi.c
223

I like it! One question though, this part here - why not strtod(3)?

share/man/man4/iscsi.4
77

I find such documented misbehavior disrespectful to user.

sys/dev/iscsi/iscsi.c
598

I find this logic bad and the cause of the mentioned problems of 1s timeout. I would switch to higher rate when timeout is set to anything less then 2s. It makes no sense to burn at 10Hz rate when timeout is set to 30.5s.

616

I find this excessively complicated just to keep messages textually unchanged.

629

Same.

651

Same.

2065

See above.

  • use SO_LINGER to abruptly close session to be reconnected
rscheff marked 2 inline comments as done.
  • address comments for the man page
  • use fast timer only when timeout is short
sys/dev/iscsi/iscsi.c
223

We are in the kernel, and I don't want to deal with any FPU / float shannanigans while here...

Also, strtoX don't give "nice" error handling - this hand-coded parser only deals with deci-secs, but not any NaN / INF / SI-prefix things which are unnecessary in this context.

I do not like the " while ((c=s*++)) " though (but s*++ at the bottom of the while block makes the compiler complain...

616

I wanted to keep the message unchanged in the common (default) case, for tools that expect an integer here to not break with a "float" (deci-sec). If no such tool-compatibility is needed, I'm happy to do a "%d.%d" here.

usr.bin/iscsictl/parse.y
444

Just a minor nit: can you make the message match the user-visible option name? As in/s/ping timeout/PingTimeout/, and the same thing for login timeout below?

rscheff added a subscriber: jhb.

@jhb - any thoughts on making the timeout of iscsi sessions configurable in deci-seconds? This is to facilitate more rapid tcp reconnections during network disruptions.

I believe I have addressed all the concerns - but keeping the old log format, such that any tools currently interpreting log files do not get confused.

sys/dev/iscsi/iscsi.c
114

FWIW, if we wanted to support sub-second intervals I would probably make the values be in milliseconds rather than deciseconds. We have several other sysctls in TCP for example that already operate in milliseconds. I would be tempted to say that you can just change these sysctls (as iSCSI use on FreeBSD is a bit more of an obscure thing) to be in milliseconds and keep them as simple integers.

Alternatively, if you are really worried about compat I would add new nodes that take the times in milliseconds and have these nodes truncate the exported value to seconds (and if set, set the underlying real value to x * 1000).

600

Why not use callout_schedule_sbt and avoid dealing with hz? There are existing helpers to convert milliseconds values to valid sbt values for example.

In short: I would like to commit this first go with decidec granularity, mostly unchanged external (script) interfaces (.conf, sysctl - well, changes from INT to STR, but the sysctl would not care in a script context). Unless there are strong objections against this current approach.

I am open to discuss to change the internal representation to milliseconds proper, and revampt how iscsi ping get scheduled and run, to reduce overhead in a subsequent Diff.

sys/dev/iscsi/iscsi.c
114

In this context, a (perceived) granulaity of milliseconds is certainly well above reasonable expectations; I would make the argument, that for all but some particular use cases, the current granularity of seconds is fine.

One of the reasons for me to make this change is, that iscsi timeout does not work independently from the tcp session timers - thus while someone configuring a iscsi ping or login timeout of 1 or 2 sec may believe, that every 1 sec a new packet is sent to probe for restored network connectivity, TCP will actually simply queue up these iscsi pings in the socket buffer, and may rather quickly end up sending a retransmission of an old iscsi ping, only once ever 60 sec (with an exponential backoff, typically starting around 1/4 sec in a local network - thus the probing becomes very coarse grained very quickly.

Part of this change is also the immediate removal of the old tcp session - without a lengthy wait time (the SO_LINGER setsockopt just prior to closing a session that timed out).

In short, a granularity of 100ms is quite ok - simultaneously, I know that in the area where this will become relevant (iscsi quorum device, where rapid network probing after loss of connectivity is essential), we have a lot of scripts and tools.

That is the major driver I wanted to keep as much as possible in place - up to the output and input into the sysctl utility, and logging facilities.

600

I guess that should be part of a larger discussion around how the timeout / hello pings are processed in general; a typical approach in other protocol is to sent out probes at 1/3 the timeout interval, evenly spaced. - instead of a hello at fixed 1 sec intervals, and then count if none were received during the last timeout period. For longer timeouts, this would decrease the workload significantly over the current design. (Especially considering, that TCP may still have the old ping stuck in the send socket, and retrys independently from the iscsi hello timer anyway).

deciseconds are too weird. We should really specify them in terms of ms, even if internally we round that to 100ms increments.I'd hate to see that get established as the user interface and not be able to change it later.

In D32540#773132, @imp wrote:

deciseconds are too weird. We should really specify them in terms of ms, even if internally we round that to 100ms increments.I'd hate to see that get established as the user interface and not be able to change it later.

All the external visible formatting is in full seconds.

And unless specifying them at an odd fraction of a second, the sysctl input / output with regular (string) integers will not be different.

Only if one chooses to configure the iscsid.conf with a timeout of e.g. PingTimeout = 0.3 LoginTimeout = 1.5 there is a difference. Just the internal representation is in (decimal) left shifted integers which are deciseconds.

If everything (internal representation and external conf and sysctl interfaces) moves to milliseconds, there needs to be a translation, and two different (thus handling of conflicts) tokens in the conf file, and new (additional / changed) sysctl OIDs...

Does that help?

(BTW, sysctl does support DeciKelvin notation natively - but without the dot, with using sysctl_handle_int "IK" flag).

In D32540#773132, @imp wrote:

deciseconds are too weird. We should really specify them in terms of ms, even if internally we round that to 100ms increments.I'd hate to see that get established as the user interface and not be able to change it later.

Or did I misunderstand, and you only want to have the internal representation to be in proper milliseconds rather than deciseconds, but maintain backwards compatibility by retaining full (fractional) second strings in the conf / sysctl?

pauamma_gundo.com added inline comments.
usr.bin/iscsictl/iscsi.conf.5
167

"starting with" != "with resolution" (consistency with rest of change as it stands now)

177

Same as above.

gbe added a subscriber: gbe.

LGTM for the man page part.

usr.bin/iscsictl/iscsi.conf.5
167

I would think that this is fine, since is more a delta added to the timeout and has nothing to do with a resolution.

In D32540#773132, @imp wrote:

deciseconds are too weird. We should really specify them in terms of ms, even if internally we round that to 100ms increments.I'd hate to see that get established as the user interface and not be able to change it later.

The point - at least for the config files - is to maintain backwards compatibility; If we want to change this on the sysctl interface to ms, that is fine with me - but the config files should retain compatibility (with decimal sub-second numbers rather than changing the unit there to milliseconds...

Would that be acceptable?

In D32540#773132, @imp wrote:

deciseconds are too weird. We should really specify them in terms of ms, even if internally we round that to 100ms increments.I'd hate to see that get established as the user interface and not be able to change it later.

The point - at least for the config files - is to maintain backwards compatibility; If we want to change this on the sysctl interface to ms, that is fine with me - but the config files should retain compatibility (with decimal sub-second numbers rather than changing the unit there to milliseconds...

Would that be acceptable?

Yes.

  • change internal representation to milliseconds
  • round sub-second iscsid timeouts to 100ms granularity

The config and log file formats retain the use of seconds (accepting fractions or printing fractions respectively, when these are not set to any full second. No specifc checks are made, to prevent the use of the sysctl variables set to too small values (the timer fires either once per 100 ms or once per 1000 ms; configuring timeouts below 200 ms would likely result in iscsi sessions getting churned on every firing of the timer.

D34320 would be one final update in this space, to send out iscsi pings (NOP) only 3 times within a timeout interval, to address the use of very short (e.g. <100 ms) timeouts and also reduce the number of callouts at higher timeouts from once per sec / once every 100ms.

  • manage internal change to milliseconds properly in various places
  • nicify dmesg/log output for backwards compatibility

Manual pages English looks good to me.

This revision is now accepted and ready to land.Dec 10 2023, 3:00 AM