Page MenuHomeFreeBSD

NFS remove: Send the remove RPC when a vnode is in-use with more than 1 link.
Needs ReviewPublic

Authored by bdrewery on May 6 2016, 9:00 PM.

Details

Summary

Put another way, only do the silly rename on a file in-use if it has 1 link
count and hasn't been renamed yet.

Before this if a vnode had more than 1 link and the directory it was in was
removed with 'rm -rf', then the silly rename would rename the file and the
'rm' would fail since FTS didn't know about the newly renamed file and would
not remove the .nfs file before trying the rmdir(). This is still the case
when there is only 1 link on an in-use vnode, but now if there is more than 1
link then the file is not renamed and the remove RPC is sent, thus the 'rm
-rf' can succeed.

This manifested by hardlinking an executable from a temporary directory into
/, and then running the / version while removing the temporary version.

Sponsored by: EMC / Isilon Storage Division

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 3608
Build 3648: arc lint + arc unit

Event Timeline

bdrewery updated this revision to Diff 15999.May 6 2016, 9:00 PM
bdrewery retitled this revision from to NFS remove: Send the remove RPC when a vnode is in-use with more than 1 link..
bdrewery updated this object.
bdrewery edited the test plan for this revision. (Show Details)
bdrewery added reviewers: rmacklem, kib.
bdrewery added a subscriber: cem.
kib edited edge metadata.May 7 2016, 8:16 PM

I really do not understand the patch. Putting aside any description of the situation, look at the conditions:
if left-most part of the if() condition does not hold, i.e. vrefcnt(vp) != 1, I claim that vrefcnt(vp) > 1. This is because VOP_REMOVE() is always preceeded by vget(), typically done by namei().

As the consequence, IMO the patch is no-op.

kib added a comment.May 7 2016, 8:18 PM
In D6254#133087, @kib wrote:

As the consequence, IMO the patch is no-op.

Sorry, my bad, of course it is not nop, it ignores the n_sillyrename. I do not understand this even more.

rmacklem edited edge metadata.May 7 2016, 9:21 PM

The code has been this way (without this patch) for over 20 years.
(I just looked in the old client and the same code is in FreeBSD4.)
It basically allows the remove to happen if the file's link cnt is
greater than one and it has already been silly renamed.

Since the link cnt is > 1, it would seem that the data won't go away
and, as such, the remove should happen. So why does the code
only do this if it has already been silly renamed?

  • I honestly don't know, but I am concerned that there might have been some application that ran across multiple clients (removing the different hard links) that broke when the name is unlinked instead of silly renamed.
    • The NFS remove takes the file handle for the directory and a name, so it won't remove the silly renamed file name unless it specifies that one.
    • Does this break any real application people care about? I have no idea. When I did the new NFS client, I carefully retained these semantics, so that the new NFS client was "bug compatible with the old one".

The VOP_GETATTR() could get stale cached attributes, so I would at least recommend
that the cached attributes be invalidated before the VOP_GETATTR() call so it
gets an up to date link cnt from the server.
Maybe:

if (vp->v_type == VDIR)
     return (EPERM);
if (vrefcnt(vp) > 1) {
     mtx_lock(&np->n_mtx);    *** You can just crib these lines from near the end of nfs_remove().
     np->n_attrstamp = 0;
     mtx_unlock(&np->n_mtx);
}
if (vrefcnt(vp) == 1 || .....

Sorry, I have no idea how to modify the patch within phabricator, rick.

Maybe these guys have more insight into why a file with multiple
hard links still gets renamed in the unpatched code?

kib added a comment.May 7 2016, 9:39 PM

I looked at the CSRG repository, and the latest revision of nfs/nfs_vnops.c already has the va_nlink > 1 code.

The change was apparently introduced by the rev. 68653 by mckusick which has the commit message "massive update to incorporate version 3 protocol from Rick Macklem", see https://kib.kiev.ua/viewvc/csrg/sys/nfs/nfs_vnops.c?r1=68539&r2=68653 .

kib added a comment.May 7 2016, 9:44 PM

Well, assume that we do allow to remove a file which has nlink == 1 and sillyrenamed. Then, since it is sillyrenamed (and v_usecnt > 1), there are open file descriptors or mappings for that file on the client. On the other hand, server does not know this and removal of the file would destroy the data, and io starts getting ESTALE.

If nlink > 1, then removal of the sillyrenamed name entry does not really removes file on server, so the NFS filehandle is still actual and the client io rpcs for open file or mappings would be still satisfied.

And do you expect me to remember why I coded it the way I did 30years ago?

If I understood the proposed patch, the case of va_nlink == 1 wasn't in dispute (ie. silly rename it).

The case this patch changes is va_nlink > 1 && vrefcnt(vp) > 1
--> without this patch, the code still silly renames it
however
--> with this patch, the remove goes ahead (assuming the file won't go away, since va_nlink > 1)

The potential problems I saw with this is that va_nlink may not still be > 1 when the server receives
the remove.
1 - the va_nlink attribute is a stale cached value
This is fixed by the variant of the patch I suggested.
2 - another client removes the other hard link(s).
This one is very oddball case and may not actually be an issue. If I was to guess why I coded it the
way I did 30years ago, it might be a concern that this could happen and the principal that doing
the silly rename retains the data (and I considered retaining the data the safe alternative?).

dfr edited edge metadata.May 8 2016, 8:23 AM

I think the patch is probably ok. As you point out, it does open a race where some other client can remove the file while we still have it open but its not clear how likely that is in the real world.

Off topic, we should implement the NFSv4 OPEN4_RESULT_PRESERVE_UNLINKED flag which would allow us to safely turn off sillyrename entirely.

kib added a comment.May 8 2016, 8:42 AM
In D6254#133138, @dfr wrote:

I think the patch is probably ok.

I disagree.

As I noted previously, the patched condition is equivalent to

else if (vrefcnt(vp) == 1 || vattr.va_nlink > 1) {

i.e. just disabling the sillyrename check.

Are we fine with it ? I am not, since it breaks the bit of POSIX semantic for the single-opened file which has the only link left, provided by the previous sillyrename.

And if the majority do think that the patch is right and can be committed, the simplified form above should be used instead.

I don't know whether or not the semantic change is a good idea,
but I do agree with kib@ that the simpler form of the if conditional
is more readable and semantically equivalent to the proposed patch:

.. if (vrefcnt(vp) == 1 || (VOP_GETATTR(vp, &vattr, cnp->cn_cred) == 0 && vattr.va_nlink > 1)) {

I do think it would be a good idea to flush the attribute cache before the VOP_GETATTR()
call for the vrefcnt(vp) > 1 case, as I noted before.

Re: OPEN4_RESULT_PRESERVE_UNLINKED
This is probably not the place to discuss this, so if anyone wants to
discuss it further, I'd suggest an email thread on freebsd-fs@.

Imagine the scenario:

  • client opens file getting OPEN4_RESULT_PRESERVE_UNLINKED in reply
  • client removes file
  • server crashes/reboots
  • client recovers Open during grace period after reboot

--> Now the file data must still be accessible via the file handle.

  • client closes file, server must delete data

As such, for FreeBSD, the only way I can see of implementing this would
be close to what sillyrename is:

  • create a hard link to the file in a "hidden" location on the file system upon open (Maybe some directory with a name that starts with .nfs in the root dir of the FS?)
  • on close, delete the hard link

OR

  • at end of grace period after reboot, delete the hard link for opens not reclaimed.

--> Making this work might take some trick like creating the hard link with the file

 handle encoded in the file's name. (A file handle can be up to 128 bytes, so the
encoding can't just be hex digits in ascii, but it can be done. I did this in the packrat
stuff that never went into head, but still lives in a projects svn tree.)

I suppose doing the above in the server is cleaner than sillyrename, but it isn't much different
and it is a fair amount of work.

In D6254#133139, @kib wrote:
In D6254#133138, @dfr wrote:

I think the patch is probably ok.

I disagree.
As I noted previously, the patched condition is equivalent to

else if (vrefcnt(vp) == 1 || vattr.va_nlink > 1) {

i.e. just disabling the sillyrename check.

True. That was an oversight on my part.

jhb edited edge metadata.May 9 2016, 6:36 PM

I suspect that the issue is something like this:

  • a file has two hard links on the NFS server 'a' and 'b'
  • A process has 'a' open
  • This client tries to rm one of 'a' and 'b'
  • Right now this always does a sillyrename until the fd for 'a' is closed keeping the data valid
  • In the proposed change, it would not do a sillyrename.
  • If this same client tries to rm 'b', then the old code would just rm 'b' (already a sillyrename). The new code would now do a sillyrename of 'b'.
  • If a different client removes 'b', then the old code would have sillyrenamed 'a' and the file data would still be valid, but the new code would just remove 'b' and the open fd for 'a' would get I/O errors.

On the one hand, this is not too different from having another client remove both 'a' and 'b' (you always lose in this case). If we were fine with that though for local removes then we wouldn't bother implementing sillyrename at all.

I think the difference is that sillyrename on all removes (current code) means that _any_ remove on the same client will still provide POSIX-like semantics, but that with the patch, only if the client happens to remove the last link will we guarantee that the data is preserved. Thus, sillyrename is only reliably useful in the current code. The applied patch makes sillyrename unreliable, and I think you are better off either removing it entirely or having it work reliably, not an inconsistent state in the middle.

One thing that you might want to consider is doing this optionally,
based on a sysctl. That way, if anyone runs into difficulties due to
the subtle change in semantics, they can disable it.

You can also decide if you want to enable it by default or not?

This is what I do when I'm concerned that the change in semantics
might result in a POLA violation/regression.

I didn't realize that the .nfs file was restricted from being deleted itself but I see that now. It means there is no workaround possible for my problem (such as in FTS) except to not do it. Sure a sysctl could work but I don't like adding in a hack for something that is seen as broken.

There's a lot of talk of multiple clients. Perhaps I don't understand the context here but with the current code a client on a different system could remove the .nfs file as well and cause the client with the open reference to go stale. So given that, I don't see why multiple clients should be in the discussion. It's outside of our control. Same for someone removing the file from the server. Related to the OPEN4_RESULT_PRESERVE_UNLINKED concern, it's outside of our control as a client.

Next, I've been told multiple times "don't do that" for parallel removal of the same files over NFS. Doing so can result in ESTALE errors in rm(1) that causes rm(1) to return non-zero status even with -f. We should pick a policy and stick with it - parallel removals OK or NOT OK.

A stateless network FS is going to be racy, if we only sillyrename on the last link, it opens some races with *other clients* but I don't think it does so in a real world scenario or in a case that we can control. There's already the server and other client problem of them removing the .nfs file with the current code, so my change won't really harm anything further IMO. My proposed change would also retain the .nfs file while it is in-use on our client.

kib added a comment.May 25 2016, 11:23 AM

I didn't realize that the .nfs file was restricted from being deleted itself but I see that now. It means there is no workaround possible for my problem (such as in FTS) except to not do it. Sure a sysctl could work but I don't like adding in a hack for something that is seen as broken.

In fact it is not, not .nfs is restricted. Note that n_sillyrename is the NFS node attribute, it is attached to node and affects removal of any name of the corresponding server node. It might be that idea of only restricting removal to the .nfs placeholder is a good idea.

But even that does not help in your case, and implementing this isn't that easy since you need to store reference to dvp (might be NFS handle would be enough) and track renames. It is doable.