Page MenuHomeFreeBSD

Do not leak B_BARRIER.
ClosedPublic

Authored by kib on Sep 21 2020, 3:46 PM.

Details

Summary

Normally when a buffer with B_BARRIER is written, the flag is cleared by g_vfs_strategy() when creating bio. But in some cases FFS buffer might not reach g_vfs_strategy(), for instance when copy-on-write reports an error like ENOSPC. In this case buffer is returned to dirty queue and might be written later by other means. Among then bdwrite() reasonably asserts that B_BARRIER is not set.

See https://people.freebsd.org/~pho/stress/log/kostik1316.txt which was the base for my analysis.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib requested review of this revision.Sep 21 2020, 3:46 PM
This revision is now accepted and ready to land.Sep 21 2020, 3:53 PM

As long as the buffer remains locked until it is successfully written, this should be fine.

Would it make sense to put this clearing of B_BARRIER someplace central like in bufdone() rather than in these specific error paths?
If we don't clear this flag in bufdone() then it would be good to assert in bufdone() that the flag is not set, to catch such mistakes earlier.

As long as the buffer remains locked until it is successfully written, this should be fine.

Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.

In D26511#590188, @chs wrote:

Would it make sense to put this clearing of B_BARRIER someplace central like in bufdone() rather than in these specific error paths?
If we don't clear this flag in bufdone() then it would be good to assert in bufdone() that the flag is not set, to catch such mistakes earlier.

I do not like doing unconditional clearing in bufdone(), the flag is too specific for fs that use it, so it should be managed by fs. I added the assert.
It might be that brelse/bqrelse should also assert that the flag is cleared, but there I am not that sure.

Add assert in bufdone().

This revision now requires review to proceed.Tue, Sep 22, 8:49 AM

It might be that brelse/bqrelse should also assert that the flag is cleared, but there I am not that sure.

I was thinking about asserting in brelse/bqrelse too... this flag is currently only ever set in the process of writing the buffer, so the buffer should always go through bufdone before getting to brelse, but if someone started setting the flag without calling bwrite then an assertion in brelse would hopefully be a hint that that isn't how it's supposed to be used.

Add asserts in brelse/bqrelse.

In D26511#590229, @kib wrote:

As long as the buffer remains locked until it is successfully written, this should be fine.

Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.

We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.

In D26511#590229, @kib wrote:

As long as the buffer remains locked until it is successfully written, this should be fine.

Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.

We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.

I still do not understand what do you mean by 'use'. In the kostik1316 dump, most probable scenario was that babarrierwrite() for the inode block failed with ENOSPC, and then bufdone() does brelse() on the buffer. So it is unlocked, but this is somewhat unrelated to the issue of leaking B_BARRIER.

In D26511#590951, @kib wrote:
In D26511#590229, @kib wrote:

As long as the buffer remains locked until it is successfully written, this should be fine.

Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.

We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.

I still do not understand what do you mean by 'use'. In the kostik1316 dump, most probable scenario was that babarrierwrite() for the inode block failed with ENOSPC, and then bufdone() does brelse() on the buffer. So it is unlocked, but this is somewhat unrelated to the issue of leaking B_BARRIER.

The block being written with the barrier is a newly allocated block of inodes. The write is done asynchronously and the cylinder group is then updated to reflect that the additional inodes are available. The reason for the barrier is so that the cylinder group buffer with the newly expanded inode map cannot be written before the newly allocated set of inodes.

Since the incore cylinder group includes the newly created inodes, some other thread can come along and try to use one of those newly allocated inodes. but, it will block on the inode buffer until its write has completed.

The bug in this instance is that there is an assumption that the write cannot fail. Clearly this is a bad assumption. So, the correct fix is to not depend on the barrier write, but rather to create a callback that updates the cylinder group once the write of the new inodes has completed successfully.

I completed a test of D26511.77362.diff without seeing any problems.

In D26511#590951, @kib wrote:
In D26511#590229, @kib wrote:

As long as the buffer remains locked until it is successfully written, this should be fine.

Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.

We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.

I still do not understand what do you mean by 'use'. In the kostik1316 dump, most probable scenario was that babarrierwrite() for the inode block failed with ENOSPC, and then bufdone() does brelse() on the buffer. So it is unlocked, but this is somewhat unrelated to the issue of leaking B_BARRIER.

The block being written with the barrier is a newly allocated block of inodes. The write is done asynchronously and the cylinder group is then updated to reflect that the additional inodes are available. The reason for the barrier is so that the cylinder group buffer with the newly expanded inode map cannot be written before the newly allocated set of inodes.

Since the incore cylinder group includes the newly created inodes, some other thread can come along and try to use one of those newly allocated inodes. but, it will block on the inode buffer until its write has completed.

The bug in this instance is that there is an assumption that the write cannot fail. Clearly this is a bad assumption. So, the correct fix is to not depend on the barrier write, but rather to create a callback that updates the cylinder group once the write of the new inodes has completed successfully.

In the specific case of kostik1316, doing that would deadlock the machine. Problem is that CoW write returned ENOSPC, which means that there is no way to correctly free space on the volume. I am not sure what do to there. Most likely any other write would also return ENOSPC, so in fact consistency of the volume is not too badly broken if we just fail there. If we start tracking the write as a dependency for cg write, then all that buffers add to the dirty space which eventually hang buffer subsystem.

I believe (based on Peter testing) that just erroring write allows us to unmount.

In D26511#591360, @kib wrote:

The block being written with the barrier is a newly allocated block of inodes. The write is done asynchronously and the cylinder group is then updated to reflect that the additional inodes are available. The reason for the barrier is so that the cylinder group buffer with the newly expanded inode map cannot be written before the newly allocated set of inodes.

Since the incore cylinder group includes the newly created inodes, some other thread can come along and try to use one of those newly allocated inodes. but, it will block on the inode buffer until its write has completed.

The bug in this instance is that there is an assumption that the write cannot fail. Clearly this is a bad assumption. So, the correct fix is to not depend on the barrier write, but rather to create a callback that updates the cylinder group once the write of the new inodes has completed successfully.

In the specific case of kostik1316, doing that would deadlock the machine. Problem is that CoW write returned ENOSPC, which means that there is no way to correctly free space on the volume. I am not sure what do to there. Most likely any other write would also return ENOSPC, so in fact consistency of the volume is not too badly broken if we just fail there. If we start tracking the write as a dependency for cg write, then all that buffers add to the dirty space which eventually hang buffer subsystem.

I believe (based on Peter testing) that just erroring write allows us to unmount.

What we are trying to do where the barrier write is being used in UFS is expand the number of available inodes. It is OK to abandon the write of the zero'ed out inode block as the expansion can be put off to later. But, we must not update the cylinder group to say that these new inodes are available if we have not been able to zero them out. As it is currently written, the cylinder group is updated but the on-disk inodes are not zero'ed out. If the unmount succeeds in writing out the dirty buffers we will end up with a corrupted filesystem because fsck_ffs will try to check the non-zero'ed out inodes and raise numerous errors trying to correct the inconsistencies that arise from the random data in the uninitialzed inode block.

So, for filesystem consistency the cylinder group cannot be updated until and unless the zero'ed out inodes have been successfully written. That update should be done as a callback attached to the zeroed out inode buffer. That fix should be done separately from this bug fix.

This revision was not accepted when it landed; it landed in state Needs Review.Thu, Oct 8, 10:41 PM
Closed by commit rS366551: Do not leak B_BARRIER. (authored by kib). · Explain Why
This revision was automatically updated to reflect the committed changes.