Page MenuHomeFreeBSD

Fix interface between nfsclient and vnode pager.
ClosedPublic

Authored by kib on Thu, Oct 3, 8:18 AM.

Details

Summary

This patch has three changes squashed together for the review.

  1. In vm_fault(), perform the check for the object size under the vnode lock, if the object has vnode type. The object lock is not enough to guarantee tht the internal fs data is consistent, so when we relock the vnode and try to page-in the faulted page, it is possible to get surprising results. I do not believe that this is an issue for the local filesystems, due to the way writes are usually structured.
  1. Ensure that nfsclient always calls vnode_pager_setsize() with the vnode exclusively locked. This ensures that page fault always can find the backing page if the object size check succeeded.

The main offender breaking the interface in nfsclient is nfs_loadattrcache(), which is used whenever server responded with updated attributes, which can happen on non-changing operations as well. Also, iod threads only have buffers locked (and even that is LK_KERNPROC), but they still may call nfs_loadattrcache() on RPC response.

Instead of immediately calling vnode_pager_setsize() if server response indicated changed file size, but the vnode is not exclusively locked, set a new node flag NVNSETSZSKIP. When the vnode exclusively locked, or when we can temporary upgrade the lock to exclusive, call vnode_pager_setsize(), by providing the nfsclient VOP_LOCK() implementation.

  1. Uncomment assert in vnode_pager_setsize() and require that all calls are done with the vnode exclusively locked.

The previous version of the patched kernel was used on NFS-booted machine and survived buildworld/buildkernel and nightly periodic run.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib created this revision.Thu, Oct 3, 8:18 AM
peterj added a subscriber: peterj.Fri, Oct 4, 4:09 AM

This patch fixes a reproduceable NFS deadlock (deadlres_td_sleep_q panic) I was getting on my diskless Rock64 with recent kernels.

markj added a comment.Fri, Oct 4, 10:10 PM

I have a tangential question regarding the vnode size: in vn_lseek()'s L_XTND case, we use VOP_GETATTR to get the file size. Why is it not sufficient to use the size from the pager field, under the shared vnode lock?

kib added a comment.Sat, Oct 5, 10:01 AM

I have a tangential question regarding the vnode size: in vn_lseek()'s L_XTND case, we use VOP_GETATTR to get the file size. Why is it not sufficient to use the size from the pager field, under the shared vnode lock?

This would only work if v_object != NULL, otherwise you still have to fall back to VOP_GETATTR().

Other than the one inline comment, it looks good to me. However, I don't know enough about the locking/vm stuff
to say if nfs_lock() is correct?
Delaying doing the vnode_pager_setsize() until a lock operation on the vnode sounds reasonable to me.

I have not found a problem with this doing testing so far.

I'm probably not the correct guy to do the actual review, given my lack of knowledge w.r.t. the lock/vm code,
but if no one else wants to click "reviewed", I will do so.

There are still a couple of places in ncl_write() that call vnode_pager_setsize() while holding the NFS node lock.
(Those also hold the LK_EXCLUSIVE vnode lock.) They should always be growing the size of the file, but if
there is a problem with doing those calls with the NFS node lock held, the lock can be easily changed to an "sx" lock.

Thanks for doing this. I would have never figured it out, rick

sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

lktype != LK_UPGRADE is checked twice. Did you mean one of them to be
lktype != LK_DOWNGRADE?

kib added inline comments.Sun, Oct 6, 8:42 AM
sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

It should have been LK_TRYUPGRADE. I am checking for the ops that cause exclusive locked vnode on success.

Hmm, I might handle LK_DOWNGRADE as well, in fact. But this is all theoretical right now, because I think upgrade/downgrade are not issued for NFS vnodes.

kib updated this revision to Diff 62958.Sun, Oct 6, 8:42 AM

Second LK_UPGRADE should have been LK_TRYUPGRADE.

Added a reply to the inline comment.

sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

There is one case of LK_UPGRADE and no LK_DOWNGRADEs (except the one your
nfs_lock code adds) of LK_TRYUPGRADEs in the NFS client.

I now realize that the LK_DOWNGRADE case would imply that there is already
a LK_EXCLUSIVE lock held, so the handling of that would be at the beginning,
before doing the lock op, I think?
But, as you say, this it theoretical at this time, since there aren't any LK_DOWNGRADEs.

kib added inline comments.Sun, Oct 6, 6:43 PM
sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

LK_UPGRADE/DOWNGRADE come mostly from namei(9) when handling the last path element and locking request from the caller.

LK_DOWNGRADE means that the current lock is exclusive indeed, but NVNSETSZSKIP is not synchronized by the vnode lock, so it could be set precisely because other thread did not locked the vnode. And since we own the lock exclusive, it might be good to call vnode_pager_setsize().

But in fact, the only critical moment where we must not miss the call is when vget() is called from vm_fault(), see onfault detection below. It is that de-synchronization which caused the reported SIGSEGV, I believe.

emaste added a subscriber: emaste.Tue, Oct 8, 2:56 PM
markj added a comment.Tue, Oct 8, 10:02 PM

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

sys/vm/vm_fault.c
842 ↗(On Diff #62958)

I would note here that the vnode case is handled later, after the vnode is locked and just before pagein.

kib marked an inline comment as done.Wed, Oct 9, 1:20 AM

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

Are you fine with both vm changes commits with 'Reviewed by: you' ?

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

This is because we actually have pages beyond EOF on the object' queue. Main offender is UFS, where negative logical block numbers are used for indirect blocks and extended data blocks. Then the negative indexes of the blocks are converted into very large pindexes, which cannot be equal to pindex of any data page, due to sign extension.

kib updated this revision to Diff 63070.Wed, Oct 9, 1:20 AM
kib added a subscriber: pho.

Add comment in vm_fault.c.

kib added a comment.Wed, Oct 9, 1:21 AM

Peter, could you, please, test this ? Whole test suite run is required, just NFS part is not enough.

markj added a comment.Wed, Oct 9, 1:28 AM
In D21883#479546, @kib wrote:

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

Are you fine with both vm changes commits with 'Reviewed by: you' ?

Yes.

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

This is because we actually have pages beyond EOF on the object' queue. Main offender is UFS, where negative logical block numbers are used for indirect blocks and extended data blocks. Then the negative indexes of the blocks are converted into very large pindexes, which cannot be equal to pindex of any data page, due to sign extension.

I see, thanks. I knew about the negative block numbers but did not remember that they are used in VMIO.

alc added inline comments.Wed, Oct 9, 3:11 AM
sys/vm/vm_fault.c
1039–1043 ↗(On Diff #63070)

Can you please elaborate on how this delayed size check helps? The object's size field is only changed with object lock held. Yes? And, I believe that the object lock is held continuously from the earlier size check until this new one. In other words, how does the size field change between the earlier check and this new one?

kib added a comment.Wed, Oct 9, 8:29 AM
In D21883#479546, @kib wrote:

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

Are you fine with both vm changes commits with 'Reviewed by: you' ?

Yes.

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

This is because we actually have pages beyond EOF on the object' queue. Main offender is UFS, where negative logical block numbers are used for indirect blocks and extended data blocks. Then the negative indexes of the blocks are converted into very large pindexes, which cannot be equal to pindex of any data page, due to sign extension.

I see, thanks. I knew about the negative block numbers but did not remember that they are used in VMIO.

After thinking some more, there are even more obvious cases when vm_page_alloc() is performed over the object size. Most implementation of VOP_WRITE() call vnode_pager_setsize() only after the buffer for the write is created. Sometimes vnode_pager_setsize() is only set after successful bwrite() of that buffer.

sys/vm/vm_fault.c
1039–1043 ↗(On Diff #63070)

Yes, of course there is no issues with the consistency for the vm_fault() itself. The problem is in inconsistency of user-visible file vs. mapping state. Since for instance iod's finish io without taking the NFS vnode lock, it might be possible for userspace to see larger (extended) file while the vnode_pager_setsize() call from iod is blocked on the object lock, thus vm_fault() effectively sees shorter file. The result is a spurious SIGSEGV that was reported.

kib updated this revision to Diff 63169.Fri, Oct 11, 9:32 PM

Instead of unconditionally moving the object size check for vnodes, ensure that the vnode is locked at the existing check place. Only do the locking for filesystems that require it, by a flag on the object.

Apparently skipping the check for resident pages exposes pages outside the EOF which should not be touched.

The new revision makes changes in the vm part that I don't understand, so I am afraid I can't
review them.

sys/fs/nfsclient/nfs_clport.c
570 ↗(On Diff #63169)

It might be nice to have a comment here noting that the function
unlocks the NFS node, but only when nsizep == NULL.
(I find I have to look at the function closely to notice this.)
But it is just a suggestion.

kib updated this revision to Diff 63181.Sat, Oct 12, 4:20 PM
kib marked an inline comment as done.

Add comment trying to explain convoluted ncl_pager_setsize() interface.

kib updated this revision to Diff 63208.Sun, Oct 13, 7:13 AM

Rebase.

Thanks for the comment.

I've run this version of the patch through a test cycle and have not seen problems.
(Of course, the previous patch worked for my testing, as well.)

kib updated this revision to Diff 63285.Tue, Oct 15, 7:29 AM

Rebase.
Disable assert for ZFS, I do not intend to fix it now.

kib added a comment.Tue, Oct 15, 7:29 AM

This version passes stress2, according to Peter' report.

kib added a comment.Wed, Oct 16, 7:18 AM

Alan. Mark, could you please look at the sys/vm part of the patch ? It was changed, and I think it was simplified comparing to the original version. Now the change is that vm_fault() path can take the vnode lock earlier, comparing with previous more delicate (and apparently buggy) move of the size check.

markj accepted this revision.Wed, Oct 16, 7:33 PM
In D21883#481629, @kib wrote:

Alan. Mark, could you please look at the sys/vm part of the patch ? It was changed, and I think it was simplified comparing to the original version. Now the change is that vm_fault() path can take the vnode lock earlier, comparing with previous more delicate (and apparently buggy) move of the size check.

The VM bits look good to me.

This revision is now accepted and ready to land.Wed, Oct 16, 7:33 PM
kib added a comment.Sat, Oct 19, 7:07 PM

I intend to commit this patch set tomorrow or at Monday.