Details

Reviewers

rmacklem
kib
imp

Group Reviewers

manpages

Commits

rG1eaa36523cb9: fspacectl(2): Clarifies the return values

Summary

rmacklem@ spotted two things in the system call:

Upon returning from a successful operation, vop_stddeallocate can update rmsr.r_offset to a value greater than file size. This behavior, although being harmless, can be confusing.
The EINVAL return value for rqsr.r_offset + rqsr.r_len > OFF_MAX is undocumented.

This commit has the following changes:

vop_stddeallocate and shm_deallocate to bound the the affected area further by the file size.
The EINVAL case for rqsr.r_offset + rqsr.r_len > OFF_MAX is documented.
The fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9)'s return len is explicitly documented the be the value 0, and the return offset is restricted to be the smallest of off + len and current file size suggested by kib@. This semantic allows callers to interact better with potential file size growth after the call.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

khng created this revision.Aug 18 2021, 6:34 PM

Herald added a reviewer: manpages. · View Herald TranscriptAug 18 2021, 6:34 PM

Herald added a subscriber: imp. · View Herald Transcript

khng requested review of this revision.Aug 18 2021, 6:34 PM

Harbormaster completed remote builds in B41115: Diff 93898.Aug 18 2021, 6:34 PM

I do not understand the code change. IMO you should copy out offset always. It is the lower layer that must provide right offset matching len.

imp accepted this revision.Aug 18 2021, 6:50 PM

This revision is now accepted and ready to land.Aug 18 2021, 6:50 PM

In D31604#712546, @kib wrote:

I do not understand the code change. IMO you should copy out offset always. It is the lower layer that must provide right offset matching len.

To clarify this, here was a discussion between me and rmacklem@: https://lists.freebsd.org/archives/freebsd-fs/2021-August/000343.html .

In D31604#712579, @khng wrote:

In D31604#712546, @kib wrote:

I do not understand the code change. IMO you should copy out offset always. It is the lower layer that must provide right offset matching len.

To clarify this, here was a discussion between me and rmacklem@: https://lists.freebsd.org/archives/freebsd-fs/2021-August/000343.html .

I read fs@ of course. My point is that there is a bug manifesting as r_offset advancing too much. You are covering it instead of fixing, also making interface unnecessarily ugly.

In vop_stddeallocate, bound the affected area to be within file size.

This revision now requires review to proceed.Aug 18 2021, 7:57 PM

Harbormaster completed remote builds in B41117: Diff 93900.Aug 18 2021, 7:57 PM

khng planned changes to this revision.Aug 18 2021, 7:57 PM

khng edited the summary of this revision. (Show Details)

vop_stddeallocate: Bound the incoming length to be within the current file size.

Harbormaster completed remote builds in B41120: Diff 93904.Aug 18 2021, 8:17 PM

kib added inline comments.Aug 18 2021, 9:18 PM

sys/kern/vfs_default.c
1186–1191	Is it possible that you get len < 0 during the loop execition? I think not, but anyway IMO it is better to change the len calculation as you did but not add the len <= check, and keep loop condition alone. Then do if (len < 0) len = 0; right after the loop, at out: label. Perhaps add one-line comment that this handles the case of offset behind EOF.

Suggestion from kib@.

Harbormaster completed remote builds in B41126: Diff 93911.Aug 18 2021, 9:32 PM

khng marked an inline comment as done.Aug 18 2021, 9:32 PM

Oops.

Harbormaster completed remote builds in B41127: Diff 93912.Aug 18 2021, 9:34 PM

kib accepted this revision.Aug 18 2021, 10:16 PM

kib added inline comments.

sys/kern/vfs_default.c
1184	Unrelated to the review. I do not think this check and break under it does anything useful for the case when vn_deallocate_impl() is called without IO_NODELOCKED. It is probably reverse, the check and premature return would eat more CPU then.

This revision is now accepted and ready to land.Aug 18 2021, 10:16 PM

rmacklem added inline comments.Aug 19 2021, 12:38 AM

sys/kern/vfs_default.c
1147	Won't this break when offset > va.va_size? Don't you need something like if (va.va_size > (uint64_t)offset) len = omin(va.va_size - offset, *ap->a_len); else len = 0;

I was hoping to have a discussion about the semantics of returned
r_offset when done (r_len == 0) over on freebsd-fs@, but since we
are here.

I don't have a strong opinion on what the correct reply is.

I did note that returning with r_offset > file size was a little weird.
I also think that setting r_offset = file size when the input r_offset > file size is also a little weird.

Another possible semantic would be to return r_offset == 0
when returned r_len == 0 (ie. when done)?

khng marked an inline comment as done.Aug 19 2021, 6:53 AM

khng added inline comments.

sys/kern/vfs_default.c
1147	when offset > va.va_size, len goes negative and would be handled after the loop.

rmacklem added inline comments.Aug 19 2021, 2:18 PM

sys/kern/vfs_default.c
1147	Righto. I missed the change you have at line#1187-1189.

In D31604#712705, @rmacklem wrote:

I was hoping to have a discussion about the semantics of returned
r_offset when done (r_len == 0) over on freebsd-fs@, but since we
are here.

I don't have a strong opinion on what the correct reply is.

I did note that returning with r_offset > file size was a little weird.

I also think that setting r_offset = file size when the input r_offset > file size is also a little weird.

Another possible semantic would be to return r_offset == 0
when returned r_len == 0 (ie. when done)?

IMO the main priority there is to make consumer able to non-ambiguously determine what parts of the file changed. In particular, if nothing changed, it should be clearly indicated as well.

Right now the result range is defined in the man page as 'is updated to contain the unprocessed operation range after the system call returns'. From this interpretation, keeping offset intact but len set to zero is fine.

Regardless of the outcome of this conversation, I believe that result range should be explicitly documented, with precise explanation of the returned values.

I think Kostik makes a good point, in that the results should
indicate what range was zeroed.

The case where r_offset >= file size when the
call is done is essentially a "no-op" and just returns
with r_len == 0.
--> I think the current code in main gets it right by just

leaving r_offset unchanged, which indicates no bytes
were zeroed.

The case where r_offset < file size but r_offset + r_len > file size
zeros to EOF, setting r_len == 0.
--> I think the current code in main get this right too, by

returning r_offset == file size indicating how many bytes were
zeroed.

For NFS, the file's size can change at any time (by another client
doing a write/truncate), but the RPC will reply with the file's size
immediately after the Deallocate, which can be used to set r_offset
upon success to be compatible with the above.

Maybe you could post on freebsd-fs@ with your preferred
semantics and see if others have an opinion.
At this point my preferred semantics would be the above,
which is what is already in main.

The man page changes look ok to me, but I agree with Kostik
that what is returned needs to be documented.

In D31604#713068, @rmacklem wrote:

Maybe you could post on freebsd-fs@ with your preferred
semantics and see if others have an opinion.
At this point my preferred semantics would be the above,
which is what is already in main.

The man page changes look ok to me, but I agree with Kostik
that what is returned needs to be documented.

Just posted it on freebsd-fs@. My opinion is to document the case that for r_len == 0 (i.e. a complete and successful operation), if rqsr.r_offset is greater than EOF then it is left intact and copied into rmsr.r_offset as well; if rqsr.r_offset is within EOF then it is not going to be set to a value beyond EOF, so that it caters the OpenZFS case as well. (For OpenZFS, zfs_freesp works by deallocating from the ending part all the way to the starting part of the specified free range.). In that case all implementations so far is going to fit into the bracket.

EDIT: The another approach I talked about is to define the result of rmsr.r_offset in a way that allows callers to know the effective range that the system call processed in case rmsr.r_len == 0. Callers can exactly know the number of bytes being handled by comparing rqsr.r_offset and rmsr.r_offset. By subtracting rqsr.r_offset from rmsr.r_offset, callers know what part of the processed range is within EOF boundaries. If rmsr.r_offset - rqsr.r_offset is zero that indicates the rqsr range is totally beyond EOF, otherwise if rmsr.r_offset - rqsr.r_offset != rqsr.r_len parts of the operation falls outside of EOF. Both approaches are compatible with the current interpretation regarding unprocessed range in the manual page.

Adopt the semantic suggested by kib@, which is compatible with the current documentation for callers.
- The semantic is documented in fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9) explicitly.
shm_deallocate is fixed in the same way as vop_stddeallocate, plus return the off and len in all cases.

This revision now requires review to proceed.Aug 21 2021, 4:57 PM

Harbormaster completed remote builds in B41152: Diff 94002.Aug 21 2021, 4:58 PM

khng edited the summary of this revision. (Show Details)Aug 21 2021, 5:01 PM

Spotted some fspacectl.2 inconsistencies.

Harbormaster completed remote builds in B41153: Diff 94004.Aug 21 2021, 5:04 PM

offset -= len -> offset += len in vop_stddeallocate.

Harbormaster completed remote builds in B41154: Diff 94008.Aug 21 2021, 10:52 PM

offset -= len -> offset += len in shm_deallocate.

Harbormaster completed remote builds in B41155: Diff 94009.Aug 21 2021, 10:54 PM

kib added inline comments.Aug 21 2021, 11:10 PM

sys/kern/uipc_shm.c
1952	Is it possible that len < 0 there? I am more about the correction of off in the line below, I do not quite understand why is it needed.
sys/kern/vfs_default.c
1147	I strongly suggest casting va_size to off_t.

Move the SHM size checking block to beginning
suggested va_size casting

Harbormaster completed remote builds in B41158: Diff 94015.Aug 22 2021, 6:44 AM

khng added inline comments.Aug 22 2021, 6:44 AM

sys/kern/uipc_shm.c
1952	I moved the block to above to make it clearer. The purpose is to convert the operation to a no-op if offset passed to the call is beyond SHM size.

Oops I picked up a stale patch smh.

Harbormaster completed remote builds in B41159: Diff 94016.Aug 22 2021, 6:46 AM

kib added inline comments.Aug 23 2021, 9:41 AM

sys/kern/uipc_shm.c
1907–1909	off_t is signed, so overflow is undefined. You cannot check/assert for overflow this way

uipc_shm: Make the checking more obvious.

sys/kern/uipc_shm.c
1907–1909	`off` is vm_ooffset_t type which is unsigned, only `len` is signed. So the addition arithmetic promotes len to unsigned, which is not a problem here. Regardless, I changed `len` to unsigned and cast it to off_t when checking if the operation happens completely beyond the EOF.

Harbormaster completed remote builds in B41165: Diff 94041.Aug 23 2021, 9:59 AM

I suggest you to push the code change now but work some more on the man page clarifications.

lib/libc/sys/fspacectl.2
71	I am not sure that I understand what the 'successful completion without an unprocessed part' is. I want to see a clear explanation of the following: how to find which part of file was zeroed for corner cases, explanation how they are handled. For instance, is it true that r_offset is always in range of file size? If yes, then together with the specification which part of file is cleared ([rmsr->r_offset; rqsr->r_offset]? is this true) it would be much earier to understand.

This revision is now accepted and ready to land.Aug 24 2021, 1:27 AM

khng added inline comments.Aug 24 2021, 5:50 AM

lib/libc/sys/fspacectl.2
71	`rqsr.r_offset` need not to be in range of file size. while for `rmsr.r_offset` it is `omin(rqsr.r_offset + rqsr.r_len, file_sz)`, so the returned r_offset is never beyond the EOF offset. So yes, the range to be cleared is `[rmsr->r_offset; rqsr->r_offset)`. For unprocessed part what I mean is if the call is interrupted due to internal error/pending signal.

Closed by commit rG1eaa36523cb9: fspacectl(2): Clarifies the return values (authored by khng). · Explain WhyAug 24 2021, 9:19 AM

This revision was automatically updated to reflect the committed changes.

khng added a commit: rG1eaa36523cb9: fspacectl(2): Clarifies the return values.

fspacectl(2): Clarifies the return values
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 94093

lib/libc/sys/fspacectl.2

share/man/man9/VOP_DEALLOCATE.9

share/man/man9/vn_deallocate.9

sys/kern/uipc_shm.c

sys/kern/vfs_default.c

fspacectl(2): Clarifies the return valuesClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 94093

lib/libc/sys/fspacectl.2

share/man/man9/VOP_DEALLOCATE.9

share/man/man9/vn_deallocate.9

sys/kern/uipc_shm.c

sys/kern/vfs_default.c

fspacectl(2): Clarifies the return values
ClosedPublic
Actions

Revision Contents
Changeset List