Paths

Table of Contentst

Differential D431

sed(1): If input's last line doesn't have newline, don't add one
ClosedPublic
Actions

Authored by dumbbell on Jul 17 2014, 9:13 PM.

Details

Reviewers

jilles

Summary

This is reported in PR 160745. This patch fixes the problem and the behavior matches GNU sed's one.

Test Plan

Check with GNU sed: printf "Hello" | gsed -e ''
Output unmodified compared to input.
Check with FreeBSD's sed: printf "Hello" | sed -e ''
Newline added to the output.

A more complete testcase is available in the PR.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

dumbbell updated this revision to Diff 749.Jul 17 2014, 9:13 PM

dumbbell retitled this revision from to sed(1): If input's last line doesn't have newline, don't add one.

dumbbell updated this object.

dumbbell edited the test plan for this revision. (Show Details)

I asked for an exp-run with this patch:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191933

Although there is a change to tests/, I see new breakage in tests/regress.multitest.out/7.8. This test prints the last line of a non-empty file followed by /dev/null; this used to output an empty line but now outputs nothing. Perhaps this is correct, but if so the expected result should be changed.

I think the #defines like ps are confusing but I don't immediately see a way to improve it.

Is it expected that 'x' and similar functions preserve the "append newline" flag from the old pattern space?

This revision now requires changes to proceed.Jul 20 2014, 2:35 PM

Another thing, it may be useful to document this extension to POSIX (processing binary data) in the man page. Per POSIX, sed input files must be text files (zero or more lines ending in '\n' of at most {LINE_MAX} and not containing '\0').

In D431#4, @jilles wrote:

Although there is a change to tests/, I see new breakage in tests/regress.multitest.out/7.8. This test prints the last line of a non-empty file followed by /dev/null; this used to output an empty line but now outputs nothing. Perhaps this is correct, but if so the expected result should be changed.

This test is already broken without my change. I studied this a bit, but I'm not sure what to think:

The sed documentation from the Opengroup website speaks about "the last line of input", without talking about file boundaries.
Our 7.8 testcase expects this behavior.
In GNU sed, its documentation tells that the '$' address "matches the last line of the last file of input, or the last line of each file when the -i or -s options are specified". However the real behavior (in GNU sed 4.4.2 from ports) seems to match the Opengroup description.
Our sed(1) implements what's said in GNU sed's documentation.

I think the #defines like ps are confusing but I don't immediately see a way to improve it.

You mean, the "#define psanl PS.append_newline" line? I don't particularily like that kind of shortcut either, but I reproduced what's already there, without trying to improve the situation.

Is it expected that 'x' and similar functions preserve the "append newline" flag from the old pattern space?

I didn't bring much thought in this case, because I admit I never used these commands. I just checked what GNU sed does here and his behavior matches our current sed(1). The patch keeps it.

In D431#7, @jilles wrote:

Another thing, it may be useful to document this extension to POSIX (processing binary data) in the man page. Per POSIX, sed input files must be text files (zero or more lines ending in '\n' of at most {LINE_MAX} and not containing '\0').

Your comments made me look again at the Opengroup page. In the EXTENDED DESCRIPTION, I previously overlooked a paragraph which states "Whenever the pattern space is written to standard output or a named file, sed will immediately follow it with a newline character". Obviously, this patch breaks that statement.

Now I'm confused :) My use case is mostly the handling of files containing plain text, but without a newline on the last line (such as the files written by Eclipse). I consider them text files, but that doesn't match the definition of POSIX. I personnaly prefer the behavior of GNU sed here, but that's not what the standard says.

In D431#8, @dumbbell wrote:

In D431#4, @jilles wrote:

Although there is a change to tests/, I see new breakage in tests/regress.multitest.out/7.8. This test prints the last line of a non-empty file followed by /dev/null; this used to output an empty line but now outputs nothing. Perhaps this is correct, but if so the expected result should be changed.

This test is already broken without my change. I studied this a bit, but I'm not sure what to think:

The sed documentation from the Opengroup website speaks about "the last line of input", without talking about file boundaries.

Our 7.8 testcase expects this behavior.

In GNU sed, its documentation tells that the '$' address "matches the last line of the last file of input, or the last line of each file when the -i or -s options are specified". However the real behavior (in GNU sed 4.4.2 from ports) seems to match the Opengroup description.

Our sed(1) implements what's said in GNU sed's documentation.

According to XBD 3.397 Text File, an empty file (or /dev/null) is a valid text file with 0 lines. Therefore, the last line of input may not be in the last file (if the last file is empty).

Example: echo a | sed -ne '$p' /dev/stdin /dev/null should print a.

As a result, sed must read the next byte of input before processing the last line of a file. A file containing at least one byte of data contains at least one line (or a "line" without newline).

I think "last line of the last file" does not make much sense: that line may not exist even if there are lines.

I wonder what happens if a file ends with a "line" without newline, and a subsequent file contains data. GNU sed adds a newline there, but your patched sed does not. Example: printf a >fa; printf b >fb; sed -e '' fa fb | hd

Is it expected that 'x' and similar functions preserve the "append newline" flag from the old pattern space?

I didn't bring much thought in this case, because I admit I never used these commands. I just checked what GNU sed does here and his behavior matches our current sed(1). The patch keeps it.

Hmm, I guess GNU sed does that to make it less likely a "line" without newline ends up somewhere else than at the end of a file. Fine with me.

In D431#9, @dumbbell wrote:

In D431#7, @jilles wrote:

Another thing, it may be useful to document this extension to POSIX (processing binary data) in the man page. Per POSIX, sed input files must be text files (zero or more lines ending in '\n' of at most {LINE_MAX} and not containing '\0').

Your comments made me look again at the Opengroup page. In the EXTENDED DESCRIPTION, I previously overlooked a paragraph which states "Whenever the pattern space is written to standard output or a named file, sed will immediately follow it with a newline character". Obviously, this patch breaks that statement.

Now I'm confused :) My use case is mostly the handling of files containing plain text, but without a newline on the last line (such as the files written by Eclipse). I consider them text files, but that doesn't match the definition of POSIX. I personnaly prefer the behavior of GNU sed here, but that's not what the standard says.

POSIX says the application shall ensure the input files are text files. If not, the application is outside POSIX territory and behaviour may not be as POSIX prescribes. So as per the standard, both the behaviour before and after this patch are acceptable.

Changes in this new patch:

Change how lastline() is implemented: instead of just checking if there are any input after the current one, check if those following files have actual content (eg. they could be all empty).
If the non-newline-terminated last line of the current file isn't the last line of the whole stream (ie. following files have more content), force a newline anyway to not mix this line with the next one.

Both changes match GNU sed behavior.

As a side effect of the lastline() improvement, the "tests/regress.multitest.out/7.8" test now passes, because the new implementation corresponds to the definition of "last line" in the Open Group sed specification.

In D431#10, @jilles wrote:

According to XBD 3.397 Text File, an empty file (or /dev/null) is a valid text file with 0 lines. Therefore, the last line of input may not be in the last file (if the last file is empty).

That's what I understand too.

As a result, sed must read the next byte of input before processing the last line of a file. A file containing at least one byte of data contains at least one line (or a "line" without newline).

I wonder what happens if a file ends with a "line" without newline, and a subsequent file contains data. GNU sed adds a newline there, but your patched sed does not. Example: printf a >fa; printf b >fb; sed -e '' fa fb | hd

I updated the patch with a new function (called next_files_have_lines(), maybe a bit long), called by lastline(). this function checks if following files have actual content. I use lastline() now when deciding if the a newline should be appended.

This fixes the "tests/regress.multitest.out/7.8" test (now it matches Open Group definition of last line) and your test case. This brings our sed closer to GNU sed.

Now I'm confused :) My use case is mostly the handling of files containing plain text, but without a newline on the last line (such as the files written by Eclipse). I consider them text files, but that doesn't match the definition of POSIX. I personnaly prefer the behavior of GNU sed here, but that's not what the standard says.

POSIX says the application shall ensure the input files are text files. If not, the application is outside POSIX territory and behaviour may not be as POSIX prescribes. So as per the standard, both the behaviour before and after this patch are acceptable.

Oh I see, that makes perfect sense.

Thank you for your help so far!