geom_slice: do not destroy softc until providers are gone
ClosedPublic

Authored by avg on Oct 27 2017, 1:20 PM.

Details

Summary

At present, g_slice_orphan and g_slice_spoiled destroy the softc
(struct g_slicer) even before calling g_wither_geom, so there can
be active and incoming io requests at that time and g_slice_start
can access the softc.

There are several ways to fix this problem.
I decided to use providergone callback to destroy the softc only after
all providers have detached. It would have been sufficient to wait
until the providers leave the doorstep (upon nend reaching nstart),
but there is no notification for that.

Test Plan

This change has been stress-tested by Ben RUBSON <ben.rubson@gmail.com>.
That was done by destroying iSCSI disks with GEOM labels on top of them
and intensive I/O through the label providers (using ZFS).

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
avg created this revision.Oct 27 2017, 1:20 PM
smh accepted this revision.Oct 28 2017, 6:21 PM
smh added a subscriber: smh.

LGTM

This revision is now accepted and ready to land.Oct 28 2017, 6:21 PM
imp added a comment.Oct 28 2017, 9:50 PM

Unsure what makes this protocol safe... Some better comments might help, or else there seems to be an issue with the code.

sys/geom/geom_slice.c
480 ↗(On Diff #34393)

Doesn't this SET WITHER?

483 ↗(On Diff #34393)

and then you free here.

517 ↗(On Diff #34393)

why do you need to check wither here?
How do you keep a double free from happening here, since the above path sets wither too...

avg added inline comments.Oct 29 2017, 7:37 AM
sys/geom/geom_slice.c
480 ↗(On Diff #34393)

It does, of course.

483 ↗(On Diff #34393)

But only if nprovider is zero. This is important.

517 ↗(On Diff #34393)
  1. I do not want to destroy softc only because all providers got unconfigured. For example, if all partitions are deleted. I want to destroy softc only when the geom is really being removed.
  1. The above path frees softc only if there are no providers. In that case this callback is never called.
mav added a comment.EditedOct 29 2017, 8:49 AM

It would have been sufficient to wait until the providers leave the doorstep (upon nend reaching nstart), but there is no notification for that.

Because that is not a normal operation logic, but a dirty hacky workaround for broken classes, that do not wait for I/O completion before closing provider (there were some, most IIRC are fixed now). This code just should not be there, and that is why it has no notification.

I have no specific objections about this patch, but I am not sure it really needs to use providergone method, that was not in original GEOM design and was added as workaround. softc could be freed on the last close. In case of gmirror, for example, providergone has to be used, since it uses custom locks and require topology lock to be dropped during g_mirror_access() call, that created a race window. This class seems less complicated, there is no any lock magic, so it could work in original GEOM logic, tracking opens instead.

avg added a comment.Oct 29 2017, 8:55 AM

@mav thank you for the suggestion! Just to clarify, do you mean adding tracking of the open provider count into g_slice_access ?

mav added a comment.Oct 29 2017, 9:07 AM
In D12809#266148, @avg wrote:

Just to clarify, do you mean adding tracking of the open provider count into g_slice_access ?

Yes. You may see, for example, how gmultipath uses sc->sc_opened.

avg updated this revision to Diff 34432.Oct 29 2017, 9:29 AM

Track opens instead of depending on providergone, per mav's suggestion.

This is more lightweight and it should be just as good.

This revision now requires review to proceed.Oct 29 2017, 9:29 AM
avg added a comment.Oct 29 2017, 9:32 AM
In D12809#266149, @mav wrote:
In D12809#266148, @avg wrote:

Just to clarify, do you mean adding tracking of the open provider count into g_slice_access ?

Yes. You may see, for example, how gmultipath uses sc->sc_opened.

Thank you! How does the new version look?

mav added a comment.Oct 29 2017, 10:42 AM

The idea looks fine, but logic of gsp->nopen and d_nopen in g_slice_access() looks over-engineered. I suppose gsp->nopen will never be above 1 here. I think you could even use just (cp->acr == 0 && cp->acw == 0 && cp->ace == 0) expression instead of additional open counter.

avg added a comment.Oct 29 2017, 8:11 PM
In D12809#266162, @mav wrote:

The idea looks fine, but logic of gsp->nopen and d_nopen in g_slice_access() looks over-engineered. I suppose gsp->nopen will never be above 1 here.

I think that gsp->nopen can be as high as slices parameter passed to g_slice_new that is greater than one for several geom classes.
There is some confusing naming, but gsp does not correspond to a single slice, it represents a "slicer" with potentially many slices.

I think you could even use just (cp->acr == 0 && cp->acw == 0 && cp->ace == 0) expression instead of additional open counter.

Do you mean cp as the consumer used by g_slice to access the underlying provider?
Looks like that can work, let me think about it.

Thanks!

mav added a comment.Oct 29 2017, 8:16 PM
In D12809#266212, @avg wrote:
In D12809#266162, @mav wrote:

The idea looks fine, but logic of gsp->nopen and d_nopen in g_slice_access() looks over-engineered. I suppose gsp->nopen will never be above 1 here.

I think that gsp->nopen can be as high as slices parameter passed to g_slice_new that is greater than one for several geom classes.
There is some confusing naming, but gsp does not correspond to a single slice, it represents a "slicer" with potentially many slices.

But (cp->acr == 0 && cp->acw == 0 && cp->ace == 0) in g_slice_access() will be true only once until all slices will be closed, since cp here is a lower level consumer and it is opened by every slice.

I think you could even use just (cp->acr == 0 && cp->acw == 0 && cp->ace == 0) expression instead of additional open counter.

Do you mean cp as the consumer used by g_slice to access the underlying provider?

Yes, I am talking about line 127 as an example.

avg updated this revision to Diff 34448.Oct 30 2017, 7:41 AM

Further simplify based on suggestions from mav; no need for nopen.

mav accepted this revision.Oct 30 2017, 7:49 AM

Looks good to me. Thank you.

This revision is now accepted and ready to land.Oct 30 2017, 7:49 AM
smh added a comment.Oct 30 2017, 9:16 AM

Nice improvements, I learned some interesting details while following these improvements :-)

This revision was automatically updated to reflect the committed changes.