These functions will be used when linuxulator xattr syscalls support will be added.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
I think the 'td' argument should be dropped first. It is presumably always curthread. Part of it is that this change pushes the arg num to 7, which is more than what is passable in registers in sysv abi on amd64.
Patch is missing context, but I've just been looking at this file so it's fairly fresh in my mind.
Change looks good to me.
Is it more expensive to spill one argument to the stack, or to re-read td from curthread later? It is used to check td_ucred.
sys/sys/extattr.h | ||
---|---|---|
82 ↗ | (On Diff #35848) | Whatever this #else is, the Linux extattr compat code will need to depend on it too. |
Addition of the uio_seg to the functions interfaces is fine.
I suspect that the current implementation of the syscalls is deadlock-prone. For instance, if the caller specifies a buffer for the attrs backed by the file mapping, and buffer is not yet faulted in (or pages were paged out by the page daemon), then uiomove() call from ffs_*extattr* can fault and try to get the vnode lock for that file. This results in potentially wrong order or recursion of the vnode locks.
Since the current ffs implementation limits extattr to two file blocks, it might be simplest to pre-allocate the memory and copy userspace buffers before/after the VOP calls. In other words, the functions should always call the VOPs with UIO_SYSSPACE. Otherwise, vn_io_fault approach must be taken.
First, the amd64 calling conventions do not force the spill, the first integer argument is always passed in GPR. If td is consistently passed around for all calls as the first arg, %rdi de-facto becomes the TLS pointer.
Second, curthread access is `movq %gs:0, <something>` which requires the segment register prefix. On some older CPUs this caused one more cycle in the frontend unconditionally. I am not sure about latest CPUs.
AMD64 calling convention only has six GPR arguments and this function now takes seven parameters. The additional argument forces one to the stack. mjg suggested getting rid of td in order to prevent that.
Updated after discussion with @rwatson in the:
https://reviews.freebsd.org/D13209
Changes:
- The ability of vnode locking control was added.
As I already noted, this change is around the architecturally wrong code. uiomove(9) with UIO_USERSPACE must not be performed while a vnode is locked.
kern/vfs_extattr.c | ||
---|---|---|
172 ↗ | (On Diff #36236) | Please extract this into separate change. |
210 ↗ | (On Diff #36236) | Why did you reordered the lines ? |
373 ↗ | (On Diff #36236) | This assert is not needed. VOP stub does the check several lines below. |
Switch extattr_*_vp() to use input buffers from kernelspace.
kern/vfs_extattr.c | ||
---|---|---|
172 ↗ | (On Diff #36236) | Sorry, did not got what you want, that I should to do with this if(). |
210 ↗ | (On Diff #36236) | It was copy-paste from vfs_vnops.c:vn_extattr_set(). |
It would be nice to ditch the requirement for exclusive vnode lock acquisition in VFS for extended attribute writes. But I'm not sure the filesystems would approve of that..?
UFS relies on the vnode lock in exclusive mode to ensure inode modifications consistency, including the extattr state.
Ok, here the buffer size checks in the extattr_*_vp() are moved out from vnode locks:
https://reviews.freebsd.org/D13405
It is difficult to find a value to restrict extattr value size. Because, as I know, in case of ntfs and hfs, the extattr sizes are "infinity".
So I decided to use the same check for uio size, so we will have two checks one on extattr_*vp() layer and one on upper layer on system call pass.
My opinio is that, the check in the extattr_*vp() layer is needed because this function will be exported as part of kernel interface.
It is not difficult, it is impossible. In other words, no size checks would allow to avoid the issue with the userspace access under the vnode (and/or buffer) lock by preallocating malloced buffer and copying into it.
So I decided to use the same check for uio size, so we will have two checks one on extattr_*vp() layer and one on upper layer on system call pass.
My opinio is that, the check in the extattr_*vp() layer is needed because this function will be exported as part of kernel interface.
No. I already pointed you to the vn_io_fault() code. Read the long description in the comment before vn_io_fault_doio() in vfs_vnops.c.
It is too difficult to me to implement it now, so, seems like, I have a questions to @kib.
As I currently understand, what we want to achive to is, not allow page faults to happen during extattr operations, is insert some sort of logic, like happen between vn_io_fault() and vn_read/vn_write or VOP_READ/VOP_WRITE.
It should be inserted in the extattr_*vp() between uio initialization and VOP_*EXTATTR call.
Difficulties:
- The code part is not so compact for path vn_io_fault() -> vn_io_fault1() -> VOP_*, may be it is possible to consider implement more generic version of vn_io_fault() not just for read()/write(), but for another operations where uio is required?
- The logic depends of sysctl variables, ok, here I just need find the way, how to read it in the kernel.
- As, I told I want to add vnode locking control to extattr_*_vp() functions, I am not clearly understand will be it possible to do vnode lock in case of do_vn_io_fault() returns true. Should be vnode locked before vn_rangelock_*() call or after?
So, @kib, could you please comment it somehow, and it would be great, if you can provide some basic description about how to implement it, or entry point for starting.
No, not quite. We cannot avoid pagefaults during uiomove(), but we might opt-out and postpone handling of them. vn_io_fault() works by setting the thread state so that page faults always results in EFAULT, even if it can be handled, and prefaulting pages for the next io attempt. This way, we do not enter the real pagefault handler and do not recurse on VFS/VM while in uiomove().
It should be inserted in the extattr_*vp() between uio initialization and VOP_*EXTATTR call.
Difficulties:
- The code part is not so compact for path vn_io_fault() -> vn_io_fault1() -> VOP_*, may be it is possible to consider implement more generic version of vn_io_fault() not just for read()/write(), but for another operations where uio is required?
Look at vn_io_fault_doio() which implements the actual call to the VOP() after all necessary setup was made. What is needed is mostly to assign new kind value and handle it before the call to vn_io_fault1() and inside vn_io_fault_doio() by using the corresponding VOP.
Extattr VOPs would need to replace uiomove(9) with vn_io_fault_uiomove(9), which is quite straightforward.
- The logic depends of sysctl variables, ok, here I just need find the way, how to read it in the kernel.
I do not understand this question.
- As, I told I want to add vnode locking control to extattr_*_vp() functions, I am not clearly understand will be it possible to do vnode lock in case of do_vn_io_fault() returns true. Should be vnode locked before vn_rangelock_*() call or after?
Rangelocks are before vnode locks. Do you intend to use rangelocks to guarantee the extattr consistency on write ? You would need to use separate ranglelock or negative range to reuse existing rangelock.
So, @kib, could you please comment it somehow, and it would be great, if you can provide some basic description about how to implement it, or entry point for starting.
Ok, let's try to proceed...
Here is very initial version of vfs_extattr.c, where the io faults are handled.
It even somehow does not works with ufs, but works with ext2fs, I did not dig in this problem for now, because it is needed to figure out the direction and next steps for extattrs topic, the testing procedure will include all available filesystems with extattrs support and will be developed later.
So, @kib, could you please review it and provide the comments.