Page MenuHomeFreeBSD

Fix few issues in ioat(4) driver.
ClosedPublic

Authored by mav on Feb 18 2019, 1:25 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 7, 6:35 AM
Unknown Object (File)
Dec 1 2024, 2:39 PM
Unknown Object (File)
Nov 21 2024, 8:47 AM
Unknown Object (File)
Oct 15 2024, 4:08 AM
Unknown Object (File)
Sep 6 2024, 1:47 AM
Unknown Object (File)
Aug 19 2024, 2:52 AM
Unknown Object (File)
Aug 18 2024, 6:50 AM
Unknown Object (File)
Aug 18 2024, 5:30 AM
Subscribers

Details

Summary

It all started when I found that due to use of single counter for both
IOAT_DMAENGINE_REF and IOAT_ACTIVE_DESCR_REF, ioat_reset_hw() always
hangs if there is any consumer exist, not even doing anything. I first
thought to split the counters, but then dropped the active descriptor
counter completely, comparing queue head and tail instead. Aside of
fixing reset it allowed to remove some atomics/locking per request.

Next issue was a large excessive overhead from starting/stopping callout
each time queue gets active/idle. Since you said it is a workaround for
hardware bug, I left on in, but I changed its logic, so that callout is
not stopped every time. Under high request rate it is cheaper to let it
run and fire sometimes rather then stop and restart each time. Plus it
allowed to removed complicated locking dance used to update
is_completion_pending, which is not needed any more.

As part of previous, I also completely decoupled cleanup_lock and
submit_lock, since previous relation caused LORs in some usage patterns.
I still left Wittness hint on attach, but changed its direction, since
it is more likely to happen if consumer code try to send new request
based on previous completion.

Also I closed some race conditions and issues around device attach/detach.

Also I added some trick to not call ioat_get_chansts() on each
ioat_process_events() call, since it appeared pretty heavy in profiler.
I kind of replicated whan Linux driver does -- polling it only on much
more rare time events. Please let me know if you think this is wrong,
since I still have no hardware documentation (trying to get one).

Also, while being there I found several unneeded variables, which I
removed. hw_head variable seemed to be full copy of head, so I replaced
unified them. Let me know if I missed something.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

+CC jimharris@, who might have some feedback or ability to help facilitate access to NDA'd register documents

Thanks, the locking around the global ioat list seems more robust. How much testing have you been able to perform with this patch? Do you have a test that can induce an error condition?

sys/dev/ioat/ioat.c
303–305 ↗(On Diff #54028)

Maybe ENFILE? Or print something to console/log if we ever encounter this condition (probably unlikely).

310–311 ↗(On Diff #54028)

With this change, I think ioat_channel_index can go away; it doesn't serve much purpose anymore.

334–341 ↗(On Diff #54028)

It seems like we should take submit_lock under list_mtx in order to set destroying in a way that will not race with the logic below in ioat_get_dmaengine(). We can drop it and pick it up again around test_detach/taskqueue_drain if we need to.

738 ↗(On Diff #54028)

This can be expensive to do all the time. Maybe waiters can set a flag that wakeup is requested, and we can wake conditionally?

745–747 ↗(On Diff #54028)

I'm not sure this is safe to remove. Edit: I see you moved it outside of this routine to only reset when the callout fires (or !pending and new work is added). That's probably ok.

750–751 ↗(On Diff #54028)

The motivation here is to avoid the MMIO read during the interrupt handler? If so, please add a sentence in the comment mentioning that.

766 ↗(On Diff #54028)

This sentence is now stale -- we don't have both locks

831 ↗(On Diff #54028)

This isn't accurate after arbitrary detach (it wasn't before, either, but it might as well be fixed or probably just removed).

882 ↗(On Diff #54028)

nitpick: this is a style(9) regression. declarations should be separate from initialization, as it was before.

985 ↗(On Diff #54028)

A real value here is closer to hz/100000; 1 was fine for all values of hz that I've heard of people using. You could do something more specific with callout_reset_sbt if you want.

1526 ↗(On Diff #54028)

I don't think it's safe to remove this ioat_drain_locked(), or something similar to this.

1659–1661 ↗(On Diff #54028)

This change probably leaves the nop started in ioat_start_channel() incomplete indefinitely.

sys/dev/ioat/ioat_internal.h
489–490 ↗(On Diff #54028)

What is the motivation for placing the cleanup lock adjacent to the submit lock? Doesn't that somewhat defeat the point of having multiple locks? It's also unclear to me why the members below were reordered.

In D19231#412056, @cem wrote:

How much testing have you been able to perform with this patch?

I have a block storage driver using this to copy data with this engine. I am able to get ~600K write per second through it.

Do you have a test that can induce an error condition?

Not really. Only think I tested is disbaling NTB link under traffic, but not sure it counts for error here. If you have ideas -- I am open to try.

sys/dev/ioat/ioat.c
303–305 ↗(On Diff #54028)

I thought about it, but word "file" in error message confused me.

310–311 ↗(On Diff #54028)

Yes, it can. I'll do it.

334–341 ↗(On Diff #54028)

I don't think it is important here. Device can not disappear here, since we are is what destroying it. And just below we wait for last reference to disappear. I just think taskqueue_drain could be moved later, if you say.

738 ↗(On Diff #54028)

I thought about it, but first I don't see it in my CPU profiles, and second, it was there, just dozen lines later out of the lock, that opened possible race conditions.

750–751 ↗(On Diff #54028)

Yes, I'll add the comment.

766 ↗(On Diff #54028)

OK

831 ↗(On Diff #54028)

It is accurate in sense of reporting maximal value, though yes, there can be holes in the list. Not sure how to better handle that.

882 ↗(On Diff #54028)

While formally you are right, I find code much more compact and as result more readable when I don't need to add separate lines for trivial assignment of function arguments.

985 ↗(On Diff #54028)

Real value for what? I don't want here 100KHz of interrupts for polling, I'd prefer 100Hz, as written. At most I agree on 1KHz. Is there hardware where interrupts not functioning at all and where polling is not a workaround for sometimes lost interrupts, but a main event source?

1526 ↗(On Diff #54028)

Oops! Thank you for spotting it. Seems like I moved the loop from here to detach instead of copying. I'll fix this.

1659–1661 ↗(On Diff #54028)

Why? ioat_release() called by ioat_start_channel() should reset the callout.

sys/dev/ioat/ioat_internal.h
489–490 ↗(On Diff #54028)

Use of mtx_padalign instead of mtx puts them on different cache lines, so no conflict. My motivation for reorder was to move all (mostly) static elements into separate cache lines from heavily congested. I was thinking about also separating below fields too, but unfortunately both submission and completion access both head and tail.

sys/dev/ioat/ioat.c
310–311 ↗(On Diff #54028)

Ah, I recalled why I kept it -- to not bother writing SYSCTL_PROC handler.

Fix few things noticed by cem@.

sys/dev/ioat/ioat.c
303–305 ↗(On Diff #54028)

ENXIO is the proper, traditional error for a unit out of the supported range IMHO. ENFILE means something else. This is an attach routine, not an open routine.

Any more comments?

sys/dev/ioat/ioat.c
1526 ↗(On Diff #54028)

With this fixed I successfully triggered resets many times with sysctl under heavy load without any problems.

Thanks! I really appreciate additional eyes/interest and improvement for this driver.

In D19231#412091, @mav wrote:
In D19231#412056, @cem wrote:

How much testing have you been able to perform with this patch?

I have a block storage driver using this to copy data with this engine. I am able to get ~600K write per second through it.

Sure; high throughput without channel errors is the happy condition. We also push about 650k x 8kB copies through it in the happy state. Once you've cycled the ring once, though, you're not really testing anything new.

Do you have a test that can induce an error condition?

Not really. Only think I tested is disbaling NTB link under traffic, but not sure it counts for error here. If you have ideas -- I am open to try.

Repeatedly disabling NTB link (sysctl ...admin_up=0) (when ioat DMA destination is in the NTB window) is a good test of error recovery (or better than nothing). I think we see IOAT_CHANERR_XDADDERR in that case.

sys/dev/ioat/ioat.c
303–305 ↗(On Diff #54028)

*shrug*

Adding a printf instead is fine.

310–311 ↗(On Diff #54028)

Sure, but why not just remove that SYSCTL entirely?

334–341 ↗(On Diff #54028)

I think you're right.

738 ↗(On Diff #54028)

Ok

750–751 ↗(On Diff #54028)

Thanks!

831 ↗(On Diff #54028)

Sure, but from that standpoint, constant IOAT_MAX_CHANNELS is equally valid :-). I think it is safe to remove, and anyone who cares can enumerate newbus and check unit numbers.

882 ↗(On Diff #54028)

Sure, personal preferences vary. But part of having a style guide is that a project ends up with consistent style in spite of having many authors. Personally, I find it surprising when declarations have initializations associated with them. So I'd prefer leaving these two lines as-is.

985 ↗(On Diff #54028)

Real value for "if this much time has elapsed and we haven't gotten an interrupt-driven completion, the operation has probably timed out."

I think 1 is probably fine — it will line up with the native hardclock frequency no matter what, and we already poll the system at hardclock on at least one CPU. Forcing 100 Hz on a 1kHz system introduces an extraneous +9 ms delay on a ~1.5µs operation.

When this hardware is running full-out, it can generate a tremendous number of interrupts (it can do ~650k ops/s, and every single one can produce an interrupt). From that perspective, it may not make sense to enable interrupts all the time and instead poll.

We don't use the poll-only mode at ISLN, but it is not an unreasonable option for some usecases. [We also do not use interrupt coalescing due to the impact on latency.]

1526 ↗(On Diff #54028)

Thanks! That's good to hear. As a stress test, it might be good to randomly introduce reset repeatedly for "a long time" and make sure it does not fall over. But this is a good early indication.

1659–1661 ↗(On Diff #54028)

Nvm, I think you're right.

sys/dev/ioat/ioat_internal.h
489–490 ↗(On Diff #54028)

At least some amd64 models will prefetch adjacent cachelines. Might it make more sense to put another cache line in between the two lock lines? We have plenty of filler material in the earlier part of the softc. I don't feel strongly about this; it's fine as you have it.

To summarize: LGTM

Here are outstanding comments:

  1. Suggest adding printf() in IOAT_MAX_CHANNELS overrun case (future work: just remove the ioat_channel registration system entirely. out of scope here, though.)
  2. Suggest removing ioat_channel_index and ioat_get_nchannels() entirely.
  3. Style regression nitpick (ioat_put_dmaengine)
  4. Poll callout time — hz/100 isn't super-objectionable, but I prefer just 1 tick. Not a big deal.
  5. Consider moving locks away from adjacent cachelines. Again, I have no evidence to suggest this change is material, so it is fine to leave as-is.

The rest looks good. Thank you.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 21 2019, 4:47 PM
This revision was automatically updated to reflect the committed changes.