Page MenuHomeFreeBSD

Export extattr_*vp() functions to allow to use it in the linuxulator in future
Needs ReviewPublic

Authored by fsu on Nov 27 2017, 11:45 AM.

Details

Reviewers
cem
pfg
kib
rwatson
Summary

These functions will be used when linuxulator xattr syscalls support will be added.

Test Plan

None.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

fsu created this revision.Nov 27 2017, 11:45 AM
mjg added a subscriber: mjg.Nov 27 2017, 9:04 PM

I think the 'td' argument should be dropped first. It is presumably always curthread. Part of it is that this change pushes the arg num to 7, which is more than what is passable in registers in sysv abi on amd64.

Add kib and rwatson: this is not area really my area.

cem accepted this revision.Nov 28 2017, 3:24 AM

Patch is missing context, but I've just been looking at this file so it's fairly fresh in my mind.

Change looks good to me.

In D13267#276812, @mjg wrote:

I think the 'td' argument should be dropped first. It is presumably always curthread. Part of it is that this change pushes the arg num to 7, which is more than what is passable in registers in sysv abi on amd64.

Is it more expensive to spill one argument to the stack, or to re-read td from curthread later? It is used to check td_ucred.

sys/sys/extattr.h
82 ↗(On Diff #35848)

Whatever this #else is, the Linux extattr compat code will need to depend on it too.

This revision is now accepted and ready to land.Nov 28 2017, 3:24 AM
cem added a comment.Nov 28 2017, 3:25 AM

@kib , the more interesting change is D13209.

kib added a comment.Nov 28 2017, 9:56 AM

Addition of the uio_seg to the functions interfaces is fine.

I suspect that the current implementation of the syscalls is deadlock-prone. For instance, if the caller specifies a buffer for the attrs backed by the file mapping, and buffer is not yet faulted in (or pages were paged out by the page daemon), then uiomove() call from ffs_*extattr* can fault and try to get the vnode lock for that file. This results in potentially wrong order or recursion of the vnode locks.

Since the current ffs implementation limits extattr to two file blocks, it might be simplest to pre-allocate the memory and copy userspace buffers before/after the VOP calls. In other words, the functions should always call the VOPs with UIO_SYSSPACE. Otherwise, vn_io_fault approach must be taken.

kib added a comment.Nov 28 2017, 9:59 AM
In D13267#276922, @cem wrote:

Is it more expensive to spill one argument to the stack, or to re-read td from curthread later? It is used to check td_ucred.

First, the amd64 calling conventions do not force the spill, the first integer argument is always passed in GPR. If td is consistently passed around for all calls as the first arg, %rdi de-facto becomes the TLS pointer.

Second, curthread access is `movq %gs:0, <something>` which requires the segment register prefix. On some older CPUs this caused one more cycle in the frontend unconditionally. I am not sure about latest CPUs.

cem added a comment.EditedNov 28 2017, 4:37 PM
In D13267#276992, @kib wrote:

First, the amd64 calling conventions do not force the spill, the first integer argument is always passed in GPR. If td is consistently passed around for all calls as the first arg, %rdi de-facto becomes the TLS pointer.

AMD64 calling convention only has six GPR arguments and this function now takes seven parameters. The additional argument forces one to the stack. mjg suggested getting rid of td in order to prevent that.

fsu updated this revision to Diff 36236.Dec 5 2017, 12:47 PM

Updated after discussion with @rwatson in the:
https://reviews.freebsd.org/D13209

Changes:

  • The ability of vnode locking control was added.
This revision now requires review to proceed.Dec 5 2017, 12:47 PM
kib added a comment.Dec 5 2017, 1:39 PM

As I already noted, this change is around the architecturally wrong code. uiomove(9) with UIO_USERSPACE must not be performed while a vnode is locked.

kern/vfs_extattr.c
172 ↗(On Diff #36236)

Please extract this into separate change.

210 ↗(On Diff #36236)

Why did you reordered the lines ?

373 ↗(On Diff #36236)

This assert is not needed. VOP stub does the check several lines below.

fsu updated this revision to Diff 36292.Dec 6 2017, 10:42 AM
fsu marked an inline comment as done.

Switch extattr_*_vp() to use input buffers from kernelspace.

kern/vfs_extattr.c
172 ↗(On Diff #36236)

Sorry, did not got what you want, that I should to do with this if().
Do this change should be moved to another phabricator review?

210 ↗(On Diff #36236)

It was copy-paste from vfs_vnops.c:vn_extattr_set().
So changed.

kib added inline comments.Dec 6 2017, 3:33 PM
kern/vfs_extattr.c
172 ↗(On Diff #36236)

Provide separate revision and commit it.

237 ↗(On Diff #36292)

This approach is not going to work, either. If user directly controls the size of allocation, it is trivial to deadlock or panic the kernel.

It would be nice to ditch the requirement for exclusive vnode lock acquisition in VFS for extended attribute writes. But I'm not sure the filesystems would approve of that..?

kib added a comment.Dec 6 2017, 4:18 PM

It would be nice to ditch the requirement for exclusive vnode lock acquisition in VFS for extended attribute writes. But I'm not sure the filesystems would approve of that..?

UFS relies on the vnode lock in exclusive mode to ensure inode modifications consistency, including the extattr state.

fsu added a comment.Dec 7 2017, 7:57 AM

Ok, here the buffer size checks in the extattr_*_vp() are moved out from vnode locks:
https://reviews.freebsd.org/D13405

fsu updated this revision to Diff 36335.Dec 7 2017, 8:20 AM
fsu set the repository for this revision to rS FreeBSD src repository.

It is difficult to find a value to restrict extattr value size. Because, as I know, in case of ntfs and hfs, the extattr sizes are "infinity".
So I decided to use the same check for uio size, so we will have two checks one on extattr_*vp() layer and one on upper layer on system call pass.
My opinio is that, the check in the extattr_*vp() layer is needed because this function will be exported as part of kernel interface.

kib added a comment.Dec 7 2017, 9:52 AM
In D13267#280049, @thisisadrgreenthumb_gmail.com wrote:

It is difficult to find a value to restrict extattr value size. Because, as I know, in case of ntfs and hfs, the extattr sizes are "infinity".

It is not difficult, it is impossible. In other words, no size checks would allow to avoid the issue with the userspace access under the vnode (and/or buffer) lock by preallocating malloced buffer and copying into it.

So I decided to use the same check for uio size, so we will have two checks one on extattr_*vp() layer and one on upper layer on system call pass.
My opinio is that, the check in the extattr_*vp() layer is needed because this function will be exported as part of kernel interface.

No. I already pointed you to the vn_io_fault() code. Read the long description in the comment before vn_io_fault_doio() in vfs_vnops.c.

fsu added a comment.Dec 15 2017, 8:27 AM

It is too difficult to me to implement it now, so, seems like, I have a questions to @kib.
As I currently understand, what we want to achive to is, not allow page faults to happen during extattr operations, is insert some sort of logic, like happen between vn_io_fault() and vn_read/vn_write or VOP_READ/VOP_WRITE.
It should be inserted in the extattr_*vp() between uio initialization and VOP_*EXTATTR call.

Difficulties:

  • The code part is not so compact for path vn_io_fault() -> vn_io_fault1() -> VOP_*, may be it is possible to consider implement more generic version of vn_io_fault() not just for read()/write(), but for another operations where uio is required?
  • The logic depends of sysctl variables, ok, here I just need find the way, how to read it in the kernel.
  • As, I told I want to add vnode locking control to extattr_*_vp() functions, I am not clearly understand will be it possible to do vnode lock in case of do_vn_io_fault() returns true. Should be vnode locked before vn_rangelock_*() call or after?

So, @kib, could you please comment it somehow, and it would be great, if you can provide some basic description about how to implement it, or entry point for starting.

kib added a comment.Dec 17 2017, 1:48 PM
In D13267#282203, @fsu wrote:

It is too difficult to me to implement it now, so, seems like, I have a questions to @kib.
As I currently understand, what we want to achive to is, not allow page faults to happen during extattr operations, is insert some sort of logic, like happen between vn_io_fault() and vn_read/vn_write or VOP_READ/VOP_WRITE.

No, not quite. We cannot avoid pagefaults during uiomove(), but we might opt-out and postpone handling of them. vn_io_fault() works by setting the thread state so that page faults always results in EFAULT, even if it can be handled, and prefaulting pages for the next io attempt. This way, we do not enter the real pagefault handler and do not recurse on VFS/VM while in uiomove().

It should be inserted in the extattr_*vp() between uio initialization and VOP_*EXTATTR call.
Difficulties:

  • The code part is not so compact for path vn_io_fault() -> vn_io_fault1() -> VOP_*, may be it is possible to consider implement more generic version of vn_io_fault() not just for read()/write(), but for another operations where uio is required?

Look at vn_io_fault_doio() which implements the actual call to the VOP() after all necessary setup was made. What is needed is mostly to assign new kind value and handle it before the call to vn_io_fault1() and inside vn_io_fault_doio() by using the corresponding VOP.

Extattr VOPs would need to replace uiomove(9) with vn_io_fault_uiomove(9), which is quite straightforward.

  • The logic depends of sysctl variables, ok, here I just need find the way, how to read it in the kernel.

I do not understand this question.

  • As, I told I want to add vnode locking control to extattr_*_vp() functions, I am not clearly understand will be it possible to do vnode lock in case of do_vn_io_fault() returns true. Should be vnode locked before vn_rangelock_*() call or after?

Rangelocks are before vnode locks. Do you intend to use rangelocks to guarantee the extattr consistency on write ? You would need to use separate ranglelock or negative range to reuse existing rangelock.

So, @kib, could you please comment it somehow, and it would be great, if you can provide some basic description about how to implement it, or entry point for starting.

fsu updated this revision to Diff 37456.Jan 3 2018, 12:06 PM

Ok, let's try to proceed...
Here is very initial version of vfs_extattr.c, where the io faults are handled.
It even somehow does not works with ufs, but works with ext2fs, I did not dig in this problem for now, because it is needed to figure out the direction and next steps for extattrs topic, the testing procedure will include all available filesystems with extattrs support and will be developed later.
So, @kib, could you please review it and provide the comments.