Page MenuHomeFreeBSD

g_access: deal with races created by geoms that drop the topology lock
ClosedPublic

Authored by avg on Feb 27 2018, 3:41 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Nov 25, 5:32 AM
Unknown Object (File)
Fri, Nov 22, 10:09 AM
Unknown Object (File)
Mon, Nov 18, 10:38 PM
Unknown Object (File)
Tue, Nov 5, 1:34 PM
Unknown Object (File)
Oct 3 2024, 2:03 PM
Unknown Object (File)
Oct 1 2024, 7:25 PM
Unknown Object (File)
Oct 1 2024, 1:31 AM
Unknown Object (File)
Sep 30 2024, 8:19 PM
Subscribers

Details

Summary

The problem is that g_access() must be called with the GEOM topology lock held.
And that gives a false impression that the lock is indeed held across the call.
But this isn't always true because many classes, ZVOL being one of the many,
need to drop the lock. It's either to perform an I/O on the first open or to
acquire a different lock (like in g_mirror_access).

That, of course, can break many assumptions. For example, g_slice_access()
adds an extra exclusive count on the first open. As described above, an
underlying geom may drop the topology lock and that would open a race with
another thread that would also request another extra exclusive count.
In general, two consumers may be granted incompatible accesses.

To avoid this problem the code is changed to mark a geom with special flag before
calling its access method and clear the flag afterwards.
If another thread sees that flag, then it means that the topology lock has been dropped
(either by the geom in question or downstream from it), so it is not safe to make another access call.
So, the second thread would use g_topology_sleep() to wait until the flag is cleared
and only then would it proceed with the access.

P.S.
This is a more general solution to bug 225960.
The first attempt was D14458.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/geom/geom.h
165 ↗(On Diff #39792)

Perhaps change these to hex values while you're here?

sys/geom/geom_subr.c
982 ↗(On Diff #39792)

I would consider clearing ACCESS_WAIT here rather than in the IN_ACCESS loop. Multiple threads may block on IN_ACCESS, and it looks wrong for the first awoken thread to clear ACCESS_WAIT on behalf of all threads. I don't think this affects correctness, though.

sys/geom/geom.h
165 ↗(On Diff #39792)

Will do as the flags are too close to 0x10 now.

sys/geom/geom_subr.c
982 ↗(On Diff #39792)

Yes, the code should be clearer with the suggested change.

Improvements suggested by Mark.

avg marked 4 inline comments as done.Mar 4 2018, 1:58 PM
This revision is now accepted and ready to land.Mar 4 2018, 5:18 PM

This is not an unreasonable workaround to serialize the access routines.
Since they can be de-facto open calls, one wonders if we need to separate out the 'access' bit from the 'open' bit somehow. I don't have any good ideas for doing that since that separation will either be cumbersome, invoke indirect calls, or otherwise introduce races with I/O that will want to happen once access() returns.

Looks viable to me. But I agree with Warner that it would possibly be good to have access call asynchronous, so that they could scale, and single misbehaving device would not block all the system. But that may be no so trivial.

This revision was automatically updated to reflect the committed changes.