Page MenuHomeFreeBSD

Create a union GEOM module
Needs ReviewPublic

Authored by mckusick on Oct 27 2021, 5:56 PM.

Details

Reviewers
kib
chs
pho
Summary

The gunion utility is used to track changes to a read-only disk on a writable disk. Logically, a writable disk is placed over a read-only disk. Write requests are intercepted and stored on the writable disk. Read requests are first checked to see if they have been written on the top (writable disk) and if found are returned. If they have not been written on the top disk, then they are read from the lower disk.

The gunion utility can be especially useful if you have a large disk with a corrupted filesystem that you are unsure of how to repair. You can use gunion to place another disk over the corrupted disk and then attempt to repair the filesystem. If the repair fails, you can revert all the changes in the upper disk and be back to the unchanged state of the lower disk thus allowing you to try another approach to repairing it. If the repair is successful you can request that all the writes recorded on the top disk be written to the lower disk.

Another use of the gunion utility is to try out upgrades to your system. Place the upper disk over the disk holding your filesystem that is to be upgraded and then run the upgrade on it. If it works, commit it; if it fails, revert the upgrade.

Further details can be found in the gunion.8 manual page included in the diffs for this review.

Test Plan

Ask Peter Holm to run the relevant part of his test suite on gunion disks.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

lib/geom/union/geom_union.c
29

I think we no longer add the tag to new files.

37

Extra blank line.

sys/geom/union/g_union.c
175

Don't you want to set an error in this case as well?

251

physpath != NULL
strcmp() != 0

252

Why strndup? Can you handle non-null terminated string there?

273

This looks very strange and is not style(9) compliant.

340

It might be better to use M_NOWAIT and error out if there is not enough memory.

363

This return; is not needed.

I have responded to your individual comments. I will upload a new set of diffs when I have finished a few other changes. Notably, I noticed that I failed to put rwlock mutexes around the writemap.

lib/geom/union/geom_union.c
29

Dropped the tag.

37

Dropped the blank line.

sys/geom/union/g_union.c
175

The error message is provided by gctl_get_provider() so we just need to return. I added a comment to note this fact.

251

Fixed.

252

From the strndup manual page: ``The strndup() function copies at most len characters from the string str always NUL terminating the copied string.'' So the pathname may be shortened but will always be null terminated.

The string being saved with strndup is used in BIO_GETATTR. I just cut/pasted this from the nop GEOM. And looking at other GEOM providers it appears that nop is doing BIO_GETATTR differently than all the rest. So I have switched to using a more standard approach which gets rid of the need for strndup.

273

Left over from debugging, now cleaned up.

340

I tried that initially, but on 2Tb disks it was consistently failing. The system just needed some time to get together the needed memory. Is there some way I can give the system some time to try to pull together the needed memory but avoid waiting forever or worse panic'ing?

363

Deleted.

mckusick marked 7 inline comments as done.

Corrections suggested by kib. Add locking to access the writemap. Prevent access to or mount of union while a commit is being done.

sys/geom/union/g_union.c
615

Why is this needed?
Isn't the fact that there is already a writer on the lower provider provides sufficient serialization?

738

Imagine that something gets writelock right after we rel-ed it. Then this check is basically nop.

867

There, you set the bit indicating that the upper layer has the copy of the block, and then drop the sc lock. Now, assume that a read comes in on other thread (it seems you allow this with G_PF_DIRECT flags) which is reading this block. Then, since it sees the bit set in bitmap, it reads garbage or old content from the upper layer.

Am I missing something?

I have added comments / questions to your three latest (helpful) comments. I would also like your feedback on my question to your earlier comment at line 340 of g_union.c.

sys/geom/union/g_union.c
615

I put it in to give a more specific error message. You are correct that the access checks below will catch the error but give a less useful error message.

738

The goal is to block the opening of a device while a commit is in progress. So once a commit (write lock) starts we EBUSY anything other than a close. Can I just check the write flag without doing a full atomic lock/unlock here?

867

Here I assume that the client of the disk is managing the coordination of the reads and writes. For example, if UFS is running on this disk, then if there are two users of the filesystem with one writing a block and another reading it, then there will be a buffer associated with the block. That buffer will be exclusively locked while the write is in progress, so the reader will block until the write completes and the buffer is unlocked.
The filesystem write bio will not complete until our bio completes and our bio completion is propagated to the filesystem bio, The reader will then get the buffer (and presumably skip the read since it should have current data). Do you see a scenario where this is incorrect?

I do not see any way to fix this if the above analysis is incorrect. I could delay setting the map until the I/O completes, but that just means the read could get stale data from before the write completed.

Peter has gotten a crash that may be caused by this lack of coordination:

https://people.freebsd.org/~pho/stress/log/log0198.txt

But I have not been able to figure out what caused the failure in that crash.

sys/geom/union/g_union.c
340

2TB means that you need to pre-allocate 512MB of bitsets. This is a lot, and probably makes it reasonable to re-evaluate the decision to do eager allocation of the leaf bitset pages. This is probably the best way around the problem.

I think the only change would be to stop requiring direct send, but for g_union purposes it should be fine IMO.

738

I do not see any additional protection that would the locked read get over unlocked. Basically the race is same, the condition we checked might become false right after unlock.

867

With current settings, reader gets so called out-of-thin-air data: the content of upper block was never written neither to lower nor to gunion geoms.

I think returning lower block there is fine, the reader is racing with the writer, so returning lower block means that from our point of view, reader won the race. Or you could allocate two bits per block in the bitmap, instead of one, and use the other bit as an indicator of write in progress. Reader would sleep waiting for the in progress bit to clear.