Page MenuHomeFreeBSD

Create a union GEOM module
ClosedPublic

Authored by mckusick on Oct 27 2021, 5:56 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Mar 11, 6:03 AM
Unknown Object (File)
Mon, Mar 11, 6:03 AM
Unknown Object (File)
Mon, Mar 11, 6:03 AM
Unknown Object (File)
Mon, Mar 11, 6:03 AM
Unknown Object (File)
Mon, Mar 11, 6:03 AM
Unknown Object (File)
Mon, Mar 11, 6:03 AM
Unknown Object (File)
Thu, Mar 7, 10:12 PM
Unknown Object (File)
Feb 15 2024, 7:10 AM
Subscribers

Details

Summary

The gunion utility is used to track changes to a read-only disk on a writable disk. Logically, a writable disk is placed over a read-only disk. Write requests are intercepted and stored on the writable disk. Read requests are first checked to see if they have been written on the top (writable disk) and if found are returned. If they have not been written on the top disk, then they are read from the lower disk.

The gunion utility can be especially useful if you have a large disk with a corrupted filesystem that you are unsure of how to repair. You can use gunion to place another disk over the corrupted disk and then attempt to repair the filesystem. If the repair fails, you can revert all the changes in the upper disk and be back to the unchanged state of the lower disk thus allowing you to try another approach to repairing it. If the repair is successful you can request that all the writes recorded on the top disk be written to the lower disk.

Another use of the gunion utility is to try out upgrades to your system. Place the upper disk over the disk holding your filesystem that is to be upgraded and then run the upgrade on it. If it works, commit it; if it fails, revert the upgrade.

Further details can be found in the gunion.8 manual page included in the diffs for this review.

Test Plan

Ask Peter Holm to run the relevant part of his test suite on gunion disks.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

lib/geom/union/geom_union.c
30

I think we no longer add the tag to new files.

38

Extra blank line.

sys/geom/union/g_union.c
176

Don't you want to set an error in this case as well?

252

physpath != NULL
strcmp() != 0

253

Why strndup? Can you handle non-null terminated string there?

274

This looks very strange and is not style(9) compliant.

341

It might be better to use M_NOWAIT and error out if there is not enough memory.

364

This return; is not needed.

I have responded to your individual comments. I will upload a new set of diffs when I have finished a few other changes. Notably, I noticed that I failed to put rwlock mutexes around the writemap.

lib/geom/union/geom_union.c
30

Dropped the tag.

38

Dropped the blank line.

sys/geom/union/g_union.c
176

The error message is provided by gctl_get_provider() so we just need to return. I added a comment to note this fact.

252

Fixed.

253

From the strndup manual page: ``The strndup() function copies at most len characters from the string str always NUL terminating the copied string.'' So the pathname may be shortened but will always be null terminated.

The string being saved with strndup is used in BIO_GETATTR. I just cut/pasted this from the nop GEOM. And looking at other GEOM providers it appears that nop is doing BIO_GETATTR differently than all the rest. So I have switched to using a more standard approach which gets rid of the need for strndup.

274

Left over from debugging, now cleaned up.

341

I tried that initially, but on 2Tb disks it was consistently failing. The system just needed some time to get together the needed memory. Is there some way I can give the system some time to try to pull together the needed memory but avoid waiting forever or worse panic'ing?

364

Deleted.

mckusick marked 7 inline comments as done.

Corrections suggested by kib. Add locking to access the writemap. Prevent access to or mount of union while a commit is being done.

sys/geom/union/g_union.c
615

Why is this needed?
Isn't the fact that there is already a writer on the lower provider provides sufficient serialization?

738

Imagine that something gets writelock right after we rel-ed it. Then this check is basically nop.

867

There, you set the bit indicating that the upper layer has the copy of the block, and then drop the sc lock. Now, assume that a read comes in on other thread (it seems you allow this with G_PF_DIRECT flags) which is reading this block. Then, since it sees the bit set in bitmap, it reads garbage or old content from the upper layer.

Am I missing something?

I have added comments / questions to your three latest (helpful) comments. I would also like your feedback on my question to your earlier comment at line 340 of g_union.c.

sys/geom/union/g_union.c
615

I put it in to give a more specific error message. You are correct that the access checks below will catch the error but give a less useful error message.

738

The goal is to block the opening of a device while a commit is in progress. So once a commit (write lock) starts we EBUSY anything other than a close. Can I just check the write flag without doing a full atomic lock/unlock here?

867

Here I assume that the client of the disk is managing the coordination of the reads and writes. For example, if UFS is running on this disk, then if there are two users of the filesystem with one writing a block and another reading it, then there will be a buffer associated with the block. That buffer will be exclusively locked while the write is in progress, so the reader will block until the write completes and the buffer is unlocked.
The filesystem write bio will not complete until our bio completes and our bio completion is propagated to the filesystem bio, The reader will then get the buffer (and presumably skip the read since it should have current data). Do you see a scenario where this is incorrect?

I do not see any way to fix this if the above analysis is incorrect. I could delay setting the map until the I/O completes, but that just means the read could get stale data from before the write completed.

Peter has gotten a crash that may be caused by this lack of coordination:

https://people.freebsd.org/~pho/stress/log/log0198.txt

But I have not been able to figure out what caused the failure in that crash.

sys/geom/union/g_union.c
341

2TB means that you need to pre-allocate 512MB of bitsets. This is a lot, and probably makes it reasonable to re-evaluate the decision to do eager allocation of the leaf bitset pages. This is probably the best way around the problem.

I think the only change would be to stop requiring direct send, but for g_union purposes it should be fine IMO.

738

I do not see any additional protection that would the locked read get over unlocked. Basically the race is same, the condition we checked might become false right after unlock.

867

With current settings, reader gets so called out-of-thin-air data: the content of upper block was never written neither to lower nor to gunion geoms.

I think returning lower block there is fine, the reader is racing with the writer, so returning lower block means that from our point of view, reader won the race. Or you could allocate two bits per block in the bitmap, instead of one, and use the other bit as an indicator of write in progress. Reader would sleep waiting for the in progress bit to clear.

I agree with kib that there is no need to serialize read requests (that need to choose whether to read from lower or upper) with write requests (that change the target of future reads from lower to upper)... allowing a pending read from the lower layer to complete after a write changes the target of future reads of that sector to be the upper layer is fine. The application must be prepared for the read to return either the old data or the new data in this case because the order of processing of these parallel async operations is not guaranteed.

I also agree that allocating all of the bitmap memory up front means allocating a lot of memory for large devices, but allocating the memory during write means that mounting a gunion as a read/write file system would be deadlock-prone, since such a configuration would often need to allocate memory in order to reclaim memory, and this memory would need to stay allocated indefinitely rather than just for the duration of the write operation. Probably best to keep the up-front bitmap allocation for now and do something fancier later if optimizing the memory usage is needed.

There was also some concern about races in g_union_access(), but this method is always called with the global g_topology_lock() held, so additional MP locking should not be necessary in g_union_access().

I also noticed that the entire commit operation, including all of the copying of data from upper to lower, is also done while holding the global g_topology_lock(), and that seems like it can be too much work to do while holding this global lock that blocks all geom configuration changes. Hopefully it would be straightforward to drop the g_topology_lock() around the copy-back I/O loop in g_union_ctl_commit(), but I didn't look into that in detail.

These changes address comments from all three reviewers.

Fix race condition and a bug turned up by Peter Holm.

Serialize overlapping reads and writes. Because a read may be split into several pieces spread between the upper and lower disks they must be done without partial writes being done part way through.

The global g_topology_lock() is now dropped while copying from the upper to the lower layer during a commit operation.

This update reflects some minor cleanups that were reported by Peter Holm during his testing.

This version has passed Peter's tests so should be ready to commit.

This revision is now accepted and ready to land.Feb 28 2022, 6:07 PM
This revision was automatically updated to reflect the committed changes.