Page MenuHomeFreeBSD

mlx5(4) rx timestamps.
ClosedPublic

Authored by kib on Oct 11 2017, 4:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Apr 17, 3:39 AM
Unknown Object (File)
Wed, Apr 17, 3:39 AM
Unknown Object (File)
Wed, Apr 17, 3:39 AM
Unknown Object (File)
Wed, Apr 17, 3:39 AM
Unknown Object (File)
Wed, Apr 17, 3:39 AM
Unknown Object (File)
Wed, Apr 17, 3:39 AM
Unknown Object (File)
Wed, Apr 17, 3:21 AM
Unknown Object (File)
Thu, Apr 11, 12:35 PM
Subscribers

Details

Summary

The patch adds rx timestamps in nanoseconds from boot for the received packets. The basic pkthdr rearrangement change in sys/mbuf.h was provided by Andrew. There are two accompanying M_ flags: M_TSTMP means that there is the timestamp and it was generated by hardware.

Another flag M_TSTMP_HPREC indicates that the timestamp is high-precision. Practically M_TSTMP_HPREC means that hardware guaranteed additional precision. E.g., for ConntectX all packets are stamped by hardware when PCIe transaction to write out the completion descriptor is performed, but PTP packet are stamped on ingress. For intel cards, when PTP assist is enabled, only PTP packets are stamped in the limited number of registers, so if intel cards ever start support this mechanism, they would always set M_TSTMP | M_TSTMP_HPREC if hw timestamp is ever available for the given packet.

I changed the SCM_TIMESTAMP and similar control messages to use hardware timestamp instead of the time of the packet ip v4 and ip v6 input processing, if M_TSTMP is set. It would be nice to add similar facility to the raw socket used to receive raw ethernet PTP packets, but it is out of scope right now, I believe.

Driver support is only implemented for ConnectX4/5. System timestamp is calculated based on the free-running counter timestamp provided by hardware. We periodically sample the counter to calibrate it against the system clock. The calculation of the system timestamp is somewhat delicate because all values are 64bit and overflow the naive formula for linear interpolation. I adjusted the calculation by dropping the least signifcant bits in advance, see the PREC shift in mlx5_mbuf_tstmp().

Test Plan

SO_TIMESTAMP is used even by ping(8), pinging a remote machine on a host which network card is capable of rx timestamps exercises the capability. Also, I wrote and used the following program to examine the generated timestamps in more details: https://github.com/kostikbel/timestamp. Perhaps the program needs some more polishing before being usable for general public.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

What's the quality of the free-running clock?

In D12638#262502, @imp wrote:

What's the quality of the free-running clock?

I cannot answer authoritative. From what I saw, I believe that it is a crystal-driven clock, but the card is able to sync to the master clock.

In D12638#262504, @kib wrote:
In D12638#262502, @imp wrote:

What's the quality of the free-running clock?

I cannot answer authoritative. From what I saw, I believe that it is a crystal-driven clock, but the card is able to sync to the master clock.

Hmmm... Either it will be a crystal driven clock, or it will be a synthesized signal that's steered to the master clock via slight frequency offsets. It sounds more like the latter, which is what I needed to know to evaluate the calibration you're doing with it. Does Mellinox have a published spec for its accuracy/Allan deviation that can be shared? I guess that last bit doesn't matter too much. If it is steered, though, it would help to know the time constant used so your calibration isn't an exact multiple of that.

I have not tested it yet, but this looks fantastic. Thank you for this!

This revision is now accepted and ready to land.Oct 11 2017, 8:56 PM
sys/dev/mlx5/mlx5_en/mlx5_en_main.c
652 ↗(On Diff #33887)

Beware these callouts will always run on CPU#0. Maybe use the curcpu variant of callout_reset?

671 ↗(On Diff #33887)

You could optimise the pointer pointer pointer thing!

sys/sys/mbuf.h
300 ↗(On Diff #33887)

How about using M_UNUSED_8 ?

1368 ↗(On Diff #33887)

is ULL suffix required here?

kib marked 2 inline comments as done.Oct 11 2017, 9:43 PM
kib added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_main.c
652 ↗(On Diff #33887)

I do not think it matter. BSP is not any different than curcpu, the callout is still bound to the same CPU during the whole system lifetime, and BSP does not have any significant usage screw. E.g. the watchdog callout in mlx5 is also on CPU0.

I changed it according to your liking, but there we need something different, which is not handled by callout KPI at all. Ideally, we should be able to specify the affinity of a callout using cpuset.

sys/sys/mbuf.h
300 ↗(On Diff #33887)

I like that the two flags definitions are textually side by side. If you insist strongly, I will change one to _8, otherwise I prefer the current arrangement.

1368 ↗(On Diff #33887)

No, compiler selects the integer type of the least integer convertion rang that can represent the constant without truncation. Such suffix is only needed when manually guiding promotion, typically for other part of an expression.

In all uses of the 1000000000 constant from the patch, other operands are of the exact type uint64_t, which makes any guidance for promotions superfluous.

kib marked an inline comment as done.

hselasky notes.

Use _curcpu for rearming calibration callout.
Simplify syntax for accesses to the initialization segment.

This revision now requires review to proceed.Oct 11 2017, 9:45 PM

I patched this into our netflix tree. I confirmed that, even with rx timestamps enabled, the new functionality does not cause a significant performance loss in our case. There seems to be roughly one or two cache misses per mlx5e_poll_rx_cq() to check the p->priv->clbr_done > 2 condition, and roughly 8x as many for the timestamp handling itself. This pushes the cost of mlx5e_build_rx_mbuf() higher, but not horribly so.

See the attached vtune screencap.

vtune_tstamp.jpg (1×2 px, 566 KB)

I patched this into our netflix tree. I confirmed that, even with rx timestamps enabled, the new functionality does not cause a significant performance loss in our case. There seems to be roughly one or two cache misses per mlx5e_poll_rx_cq() to check the p->priv->clbr_done > 2 condition, and roughly 8x as many for the timestamp handling itself. This pushes the cost of mlx5e_build_rx_mbuf() higher, but not horribly so.

See the attached vtune screencap.

vtune_tstamp.jpg (1×2 px, 566 KB)

So are you fine with the patch as is, or do you still want to add the timestamp enable attribute to rq ?

Did you tried to use timestamps for your original purpose ?

In D12638#263636, @kib wrote:

So are you fine with the patch as is, or do you still want to add the timestamp enable attribute to rq ?

Did you tried to use timestamps for your original purpose ?

I'm fine with this as-is. I have not had time to use this as we originally intended (to modify LRO to track the timestamps of each coalesced packet). It may be several weeks before I can do this, so I'm fine with you committing it pending others approval.

I'm fine with this as-is. I have not had time to use this as we originally intended (to modify LRO to track the timestamps of each coalesced packet). It may be several weeks before I can do this, so I'm fine with you committing it pending others approval.

Thank you. I want to answer @imp questions about the clock before considering the commit. I have no ETA for this.

In D12638#262512, @imp wrote:
In D12638#262504, @kib wrote:
In D12638#262502, @imp wrote:

What's the quality of the free-running clock?

I cannot answer authoritative. From what I saw, I believe that it is a crystal-driven clock, but the card is able to sync to the master clock.

Hmmm... Either it will be a crystal driven clock, or it will be a synthesized signal that's steered to the master clock via slight frequency offsets. It sounds more like the latter, which is what I needed to know to evaluate the calibration you're doing with it. Does Mellinox have a published spec for its accuracy/Allan deviation that can be shared? I guess that last bit doesn't matter too much. If it is steered, though, it would help to know the time constant used so your calibration isn't an exact multiple of that.

I was told that this is a crystal driven. The characteristics are that the freq tolerance of the crystal is +-50 ppm at the operational temperature.

This is all the information I was able to gather at the moment. I can continue the quest if you still need the answer about the Allan deviation, although I am not sure that I will be able to obtain it.

In D12638#265613, @kib wrote:
In D12638#262512, @imp wrote:
In D12638#262504, @kib wrote:
In D12638#262502, @imp wrote:

What's the quality of the free-running clock?

I cannot answer authoritative. From what I saw, I believe that it is a crystal-driven clock, but the card is able to sync to the master clock.

Hmmm... Either it will be a crystal driven clock, or it will be a synthesized signal that's steered to the master clock via slight frequency offsets. It sounds more like the latter, which is what I needed to know to evaluate the calibration you're doing with it. Does Mellinox have a published spec for its accuracy/Allan deviation that can be shared? I guess that last bit doesn't matter too much. If it is steered, though, it would help to know the time constant used so your calibration isn't an exact multiple of that.

I was told that this is a crystal driven. The characteristics are that the freq tolerance of the crystal is +-50 ppm at the operational temperature.

This is all the information I was able to gather at the moment. I can continue the quest if you still need the answer about the Allan deviation, although I am not sure that I will be able to obtain it.

50ppm tells me that many of the issues I'm worried about are unlikely to appear.

sys/dev/mlx5/mlx5_en/mlx5_en_main.c
643 ↗(On Diff #33894)

Should these be exposed as tunable SYSCTLS ?

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
195 ↗(On Diff #33894)

Are you doing signed or unsigned shift here? Isn't there a compiler rule that when "-" minus is involved in an expressions, the result becomes signed?

196 ↗(On Diff #33894)

Is this multiplication guarded against numerical overflow?

197 ↗(On Diff #33894)

Please add a check for division by zero!

kib marked 4 inline comments as done.Oct 31 2017, 10:11 AM
kib added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
195 ↗(On Diff #33894)

There is no such 'rule'. Arithmetic between unsigned data is performed as unsigned and the result is unsigned as well.

More, the implicit conversion rules result in operations involving at least one unsigned value also performed as unsigned most of the time (but not always, and I am not going to cite the legalese there).

196 ↗(On Diff #33894)

Yes, this is exactly why the PREC stuff is there. It reduces the precision but also avoids loosing the highest significant bits of the result.

197 ↗(On Diff #33894)

It is already guarded, the calibration callback compares clbr_hw_curr and clbr_hw_prev and stops timestamping if they are equal.

After your question, I think that this can be improved by also shifting the difference between the clock readings to simulate the timestamping process more closely. See the new version of the mlx5e_calibration_callout().

kib marked 3 inline comments as done.

Shift the difference between clock stamps when checking for malfunctioning hardware.
Add some sysctls.

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
197 ↗(On Diff #33894)

Right, but remember that checking "if (c->priv->clbr_done >= 2)" and reading "(cp->clbr_hw_curr - cp->clbr_hw_prev)" and setting "clbr_done" is not atomic!

The callout callback is running unprotected and possibly concurrently with the other piece of code which is accessing these fields. That means clbr_done could initially be >= 2, but later on might be set to zero. Then if the build_rx_mbuf() is running concurrently, this can be an issue.

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
197 ↗(On Diff #33894)

The algorithm for update is lock-less but it guarantees practically that the calibration data is stable. Calibration callback only updates other clbr_point, the current clbr_point is stable for the whole duration of the calibration interval. After we read current calibration index, we know that it is correct for >= mlx5e_fast_calibration seconds.

So even if clbr_done goes from 2 to 0 due to the driver anxiety over the clock conditions, the current values are still safe.

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
197 ↗(On Diff #33894)

If thread 1 enters, mlx5e_calibration_callout(), while thread 2 enters mlx5e_build_rx_mbuf(), then you have no control really how many times mlx5e_calibration_callout() can be called.

  1. if (c->priv->clbr_done >= 2) - thread 2
  1. c->priv->clbr_done = 0 - thread 1
  1. res /= (cp->clbr_hw_curr - cp->clbr_hw_prev) => panic() - thread 2

Further I see that the order of updating "clbr_hw_curr" and "clbr_hw_prev" is not fixed, because volatile is not used in mlx5e_calibration_callout() when setting priv->clbr_curr.

sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
197 ↗(On Diff #33894)

I do not understand what do you mean by 'how many times'.

The order of updates to clbr_hw_prev/clbr_hw_curr among same clbr_point does not matter because both updates must be visible if update to the cblr_curr is visible.

You might miss that there is more than one calibration points, and switch of the previous clbr_point to the next one, as perceived by the rx routine, is atomic due to the coordination, and switch keeps previous calibration point stable for calibration interval.

hselasky added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_rx.c
197 ↗(On Diff #33894)

I see. No more comments.

This revision is now accepted and ready to land.Oct 31 2017, 2:49 PM

Add comment noting that calibration callback ensures that the division is safe.
Minor editing of the messages printed.

This revision now requires review to proceed.Oct 31 2017, 3:25 PM
This revision was automatically updated to reflect the committed changes.