Page MenuHomeFreeBSD

d_off support for multiple filesystems
ClosedPublic

Authored by jack_gandi.net on Nov 9 2018, 9:55 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Sep 17, 6:12 PM
Unknown Object (File)
Tue, Sep 17, 7:04 AM
Unknown Object (File)
Tue, Sep 17, 7:04 AM
Unknown Object (File)
Tue, Sep 17, 7:00 AM
Unknown Object (File)
Tue, Sep 17, 6:36 AM
Unknown Object (File)
Tue, Sep 17, 6:36 AM
Unknown Object (File)
Tue, Sep 17, 6:36 AM
Unknown Object (File)
Tue, Sep 17, 6:36 AM

Details

Summary

The d_off field has been added to the dirent structure recently. Currently filesystems don't support this feature. Support has been added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs. A stub implementation is available for cd9660, nandfs, udf and pseudofs but hasn't been tested.

Motivation for this feature: our usecase is for a userspace nfs server (nfs-ganesha) with zfs. At the moment we cache direntry offsets by calling lseek once per entry, with this patch we can get the offset directly from getdirentries calls which provides a significant speedup.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

this is a rough check I used for testing. Basically we want the d_off value to be the lseek value *after* reading a given entry. To check for lseek we must only iterate over getdirentries one entry at a time, otherwise the lseek offset will be too far off.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
2546

This probably needs a #ifdef FreeBSD or a #ifndef illumos (same for all zfs changes)

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
2546

The same statement exists "as is" in the other branch of the if statement (14 lines above) though

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
2546

You are right, never mind

overlooked a zfsctl_snapdir_readdir vfs call used for zfs snapshots.

Maybe a sweep with opengrok is a good idea to see where we should be using this.

Can you please check lib/libc/sys/getdirentries.c line 60:

dst->d_off = 0;	/* nothing uses it yet, so safe for now */

The main feasibility test is in-kernel nfs server.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
2547

() are excessive.

2572

Write comments as the full sentences, i.e. upper-case the first word, end with dot.

In D17917#382778, @pfg wrote:

Maybe a sweep with opengrok is a good idea to see where we should be using this.

I don't know about opengrok, but this is a new feature only affecting a small part of the ABI so it seems pretty well contained.

Can you please check lib/libc/sys/getdirentries.c line 60:

dst->d_off = 0;	/* nothing uses it yet, so safe for now */

This is compatibility code called in __cvt_dirents_from11, it only called for kernels previous to the 64bit inode switch (see below). In this case there's no way userspace can expect anything meaningful in d_off.

lib/libc/sys/getdirentries.c line 87:

if (__getosreldate() >= INO64_FIRST)
    return (__sys_getdirentries( ... ))

.
.
.

__cvt_dirents_from11( ... )
In D17917#382778, @pfg wrote:

Maybe a sweep with opengrok is a good idea to see where we should be using this.

I don't know about opengrok, but this is a new feature only affecting a small part of the ABI so it seems pretty well contained.

opengrok is a web based grep, see src.illumos.org, just for curiosity ;).

Can you please check lib/libc/sys/getdirentries.c line 60:

dst->d_off = 0;	/* nothing uses it yet, so safe for now */

This is compatibility code called in __cvt_dirents_from11, it only called for kernels previous to the 64bit inode switch (see below). In this case there's no way userspace can expect anything meaningful in d_off.

lib/libc/sys/getdirentries.c line 87:

if (__getosreldate() >= INO64_FIRST)
    return (__sys_getdirentries( ... ))

.
.
.

__cvt_dirents_from11( ... )

Ahh, OK. I only read the comment in a quick sweep., thanks for checking.

jack_gandi.net edited the summary of this revision. (Show Details)
jack_gandi.net marked 3 inline comments as done.
jack_gandi.net marked an inline comment as not done.
jack_gandi.net added inline comments.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
2547

I agree, for the sake of symmetry I also changed the above statement.

It looks fine for me implementation-wise, but I never know about the intended semantic for d_off.

Also note that e.g. devfs/fdescfs/procfs are never going to be exported by nfs server, so the addition of d_off handling there is not strictly necessary. Although, since it is already done, there is no reason to undo.

This revision is now accepted and ready to land.Nov 9 2018, 5:35 PM

While you're at it, would you mind documenting d_off in getdirentries.2? The field is included in the structure definition, but its semantics are not defined there.

I can't seem to remember how to do inline comments (if I ever knew), so I'll stick it here...

I doubt it will matter, but for udf, the last argument to udf_uiodir() is the "offset cookie".
The code sets it to 1 and 2 for "." and "..", since it notes that the real offset is zero and NFS doesn't like that.
Then, for subsequent entries it sets it to "ds->this_off". If you look in udf_getfid(),
it sets "ds->this_off" to "ds->offset + ds->off", so I think your "else" case is correct,
but might be more consistent if you used "ds->this_off".
(SInce I don't even know what UDF is, I suspect no one exports it via NFS and it doesn't matter anyhow?)

The other changes look pretty much the same as what I did a couple of years ago for a patch that never
made it into head (I never tested the UDF case, so I can't say I got it right using "1", "2" and "ds->this_off"?).

Once this goes in, I can look at using d_off in the kernel nfsd. (I have old patches for that stuff too.)

Added to the getdirentries.2 man page. Also make udf_readdir more consistent as requested. Apparently udf is used for DVDs and optical disks so it shouldn't matter anyway.

This revision now requires review to proceed.Nov 12 2018, 9:49 AM

Nice change. Can you bump the .Dd of the man page as this a content change. Thanks!

lib/libc/sys/getdirentries.2
94 ↗(On Diff #50309)

What does the distance mean ? If I take it literally as the number of bytes which should be read to get from current dirent to the next one, it is false, see UDF. If the interpretation should be as the value passed to lseek(2) to position the read at the next dirent (I dare to say 'cookie'), then it has not much to do with any byte count.

jack_gandi.net added inline comments.
lib/libc/sys/getdirentries.2
94 ↗(On Diff #50309)

You are correct in most cases it isn't equivalent to d_reclen and is entirely calculated by the underlying filesystem. I took the 'distance' term from the linux manpage and I feel it is correct here because it fits the mathematical meaning of 'distance' (see https://en.wikipedia.org/wiki/Metric_(mathematics) ). Cleary it shouldn't be confused for byte count because it's not, if you like I can change the term to 'offset' or 'cookie'

lib/libc/sys/getdirentries.2
94 ↗(On Diff #50309)

No, what is missing is not the juggling of the terms. What can user do with the value from d_off ? Man page must answer this question to be useful.

If the only answer is for d_off to serve as the argument to lseek(2), just state it. I wondered for long time if there are any other uses or restrictions.

Assuming you resolve the man page issue that kib@ has commented on, this looks ok to me.

This revision is now accepted and ready to land.Nov 12 2018, 3:49 PM

Added an inline suggestion related to the man page update.

lib/libc/sys/getdirentries.2
94 ↗(On Diff #50309)

You might also mention that d_off is meant to be compatible with the
directory_offset_cookie in an NFS Readdir reply, since that is what you
are using it for. (Just a possible suggestion. I'll leave it up to you.)

jack_gandi.net marked an inline comment as done.
This revision now requires review to proceed.Nov 12 2018, 4:44 PM

Also please add an implementation note to mention readdir plus NFS RPC response, as suggested by Rick.

lib/libc/sys/getdirentries.2
92 ↗(On Diff #50321)

I still do not like/cannot agree with this terminology. What are the units for this metric ?

Traditionally, we use the terms 'handle' when describing some abstract value which allows to manipulate an object. The handle from d_off allows lseek(2) to position the directory descriptor to the location for returning the next dirent, is this correct ? I do not see a reason why the term handle would be not appropriate there.

My intent when adding the d_off field was that it be used for this purpose, so I am glad to see someone actually making it happen.

Some filesystems use a B-tree or other non-linear structure for their directories, so my thought was that d_off be treated as a cookie that would have to be passed to seekdir (rather than lseek) to get back to the next entry. So I am not entirely sure that describing it as an lseek offset is the correct semantics. Passing it as a cookie to seekdir rather than as an offset to lseek would allow us to support a wider range of filesystems. Of course, at the moment seekdir is just using it as an offset to lseek, so if that is what Linux does, then we should probably not diverge from that.

Some filesystems use a B-tree or other non-linear structure for their directories, so my thought was that d_off be treated as a cookie that would have to be passed to seekdir (rather than lseek) to get back to the next entry.

I agree that d_off should be an arbitrary cookie, but don't understand the suggestion for seekdir vs lseek.

So I am not entirely sure that describing it as an lseek offset is the correct semantics. Passing it as a cookie to seekdir rather than as an offset to lseek would allow us to support a wider range of filesystems.

I believe for OneFS we would be fine interpreting defining d_off as a valid lseek offset.

Of course, at the moment seekdir is just using it as an offset to lseek, so if that is what Linux does, then we should probably not diverge from that.

Right — seekdir is just some additional complexity and seatbelt around lseek() to validate that inputs match outputs from telldir(3).

OK, keeping it as currently specified is fine. I was concerned that using it as an lseek parameter might be an issue for ZFS,and OneFS, but given that it appears that these filesystems are fine with that tells me that the above implementation works fine.

This revision is now accepted and ready to land.Nov 12 2018, 10:49 PM
In D17917#383719, @kib wrote:

Also please add an implementation note to mention readdir plus NFS RPC response, as suggested by Rick.

I'm not sure what you mean by this, should I add a new section to the manpage?

lib/libc/sys/getdirentries.2
94 ↗(On Diff #50309)

@kib It's stated that it can be used with lseek(2) in the next chunk (lines 149-151) already, along with the other lseek compatible values. It's probably not a good idea to mention nfs in the manpage as the d_off field can potentially have other uses and nfs is just one of many applications.

In D17917#383719, @kib wrote:

Also please add an implementation note to mention readdir plus NFS RPC response, as suggested by Rick.

I'm not sure what you mean by this, should I add a new section to the manpage?

Yes, this is most suitable for 'IMPLEMENTATION NOTES section.

lib/libc/sys/getdirentries.2
94 ↗(On Diff #50309)

Formally, the next chunk does not state what you get when you lseek(2) to the d_off.

I do not see a problem with reformulating this chunk like this:
The d_off entry returns a handle which can be used with lseek(2) to position the directory descriptor to the start of the entry next to the current.

"Cookie" or "handle" are definitely good words here. The part "to position the directory descriptor to the start of the entry next to the current" could be simplified a bit to "to position the directory descriptor to the next entry" since the position cannot be in the middle of an entry.

The man page share/man/man5/dir.5 incorrectly states that d_off is /* directory offset of entry */; this must be "of next entry".

This revision now requires review to proceed.Nov 14 2018, 10:42 AM
In D17917#383965, @kib wrote:
In D17917#383719, @kib wrote:

Also please add an implementation note to mention readdir plus NFS RPC response, as suggested by Rick.

I'm not sure what you mean by this, should I add a new section to the manpage?

Yes, this is most suitable for 'IMPLEMENTATION NOTES section.

I've added an example of the NFS usage, although after looking at other man (see pipe(2)) it looks like IMPLEMENTATION NOTES is used for details about the implementation rather than how the feature is used.

"Cookie" or "handle" are definitely good words here. The part "to position the directory descriptor to the start of the entry next to the current" could be simplified a bit to "to position the directory descriptor to the next entry" since the position cannot be in the middle of an entry.

The man page share/man/man5/dir.5 incorrectly states that d_off is /* directory offset of entry */; this must be "of next entry".

Finally settled for cookie. sorry for the double upload. I felt like handle sounded like a ressource/allocated which it is not.

Also please provide the full attributions required in the commit message. 'Submitted by: <your name and email>', 'Sponsored by: <name>' (if needed).

lib/libc/sys/getdirentries.2
156 ↗(On Diff #50397)

Start a new sentence from the new line.

This revision is now accepted and ready to land.Nov 14 2018, 11:07 AM
This revision now requires review to proceed.Nov 14 2018, 11:29 AM

Note the request for attributes/metadata for the change. I will commit after you provide the info.

In D17917#384287, @kib wrote:

Note the request for attributes/metadata for the change. I will commit after you provide the info.

Uploaded in the latest diff in the commit message, if you can't see it here it is also:

Submitted by:  Jack Halford <jack@gandi.net>
Sponsored by:  Gandi.net

I was trying to svn commit on my own as described on the wiki, no wonder it didn't work.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 14 2018, 2:18 PM
This revision was automatically updated to reflect the committed changes.