Why should text files end with a newline?

2018-06-04 02:13:43

I assume everyone here is familiar with the adage that all text files should end with a newline. I've known of this "rule" for years but I've always wondered — why?

Because that's how the POSIX standard defines a line :

3.206 Line A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

Therefore, lines not ending in a newline character aren't considered actual lines. That's why some programs have problems processing the last line of a file if it isn't newline terminated.

There's at least one hard advantage to this guideline when working on a terminal emulator: All Unix tools expect this convention and work with it. For instance, when concatenating files with cat , a file terminated by newline will have a different effect than one without:

$ more a.txt
foo$ more b.txt
bar
$ more c.txt
baz
$ cat *.txt
foobar
baz

And, as the previous example also demonstrates, when displaying the file on the command line (eg via more ), a newline-terminated file results in a correct display. An improperly terminated file might be garbled (second line).

For consistency, it's very helpful to follow this rule – doing otherwise will incur extra work when dealing with the default Unix tools.

Now, on non POSIX compliant systems (nowadays that's mostly Windows), the point is moot: files don't generally end with a newline, and the (informal) definition of a line might for instance be “text that is separated by newlines” (note the emphasis). This is entirely valid. However, for structured data (eg programming code) it makes parsing minimally more complicated: it generally means that parsers have to be rewritten. If a parser was originally written with the POSIX definition in mind, then it might be easier to modify the token stream rather than the parser — in other words, add an “artificial newline” token to the end of the input.

Each line should be terminated in a newline character, including the last one. Some programs have problems processing the last line of a file if it isn't newline terminated.

GCC warns about it not because it can't process the file, but because it has to as part of the standard.

The C language standard says A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character.

Since this is a "shall" clause, we must emit a diagnostic message for a violation of this rule.

This is in section 2.1.1.2 of the ANSI C 1989 standard. Section 5.1.1.2 of the ISO C 1999 standard (and probably also the ISO C 1990 standard).

Reference: The GCC/GNU mail archive.

This answer is an attempt at a technical answer rather than opinion.

If we want to be POSIX purists, we define a line as:

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

Source: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

An incomplete line as:

A sequence of one or more non- <newline> characters at the end of the file.

Source: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_195

A text file as:

A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.

Source: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_397

A string as:

A contiguous sequence of bytes terminated by and including the first null byte.

Source: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_396

From this then, we can derive that the only time we will potentially encounter any type of issues are if we deal with the concept of a line of a file or a file as a text file (being that a text file is an organization of zero or more lines, and a line we know must terminate with a <newline>).

Case in point: wc -l filename .

From the wc 's manual we read:

A line is defined as a string of characters delimited by a <newline> character.

What are the implications to JavaScript, HTML, and CSS files then being that they are text files?

In browsers, modern IDEs, and other front-end applications there are no issues with skipping EOL at EOF. The applications will parse the files properly. It has to since not all Operating Systems conform to the POSIX standard, so it would be impractical for non-OS tools (eg browsers) to handle files according to the POSIX standard (or any OS-level standard).

As a result, we can be relatively confident that EOL at EOF will have virtually no negative impact at the application level - regardless if it is running on a UNIX OS.

At this point we can confidently say that skipping EOL at EOF is safe when dealing with JS, HTML, CSS on the client-side. Actually, we can state that minifying any one of these files, containing no <newline> is safe.

We can take this one step further and say that as far as NodeJS is concerned it too cannot adhere to the POSIX standard being that it can run in non-POSIX compliant environments.

What are we left with then? System level tooling.

This means the only issues that may arise are with tools that make an effort to adhere their functionality to the semantics of POSIX (eg definition of a line as shown in wc ).

Even so, not all shells will automatically adhere to POSIX. Bash for example does not default to POSIX behavior. There is a switch to enable it: POSIXLY_CORRECT .

Food for thought on the value of EOL being <newline>: http://www.rfc-editor.org/EOLstory.txt

Staying on the tooling track, for all practical intents and purposes, let's consider this:

Let's work with a file that has no EOL. As of this writing the file in this example is a minified JavaScript with no EOL.

curl http://cdnjs.cloudflare.com/ajax/libs/AniJS/0.5.0/anijs-min.js -o x.js
curl http://cdnjs.cloudflare.com/ajax/libs/AniJS/0.5.0/anijs-min.js -o y.js

$ cat x.js y.js > z.js

-rw-r--r--  1 milanadamovsky   7905 Aug 14 23:17 x.js
-rw-r--r--  1 milanadamovsky   7905 Aug 14 23:17 y.js
-rw-r--r--  1 milanadamovsky  15810 Aug 14 23:18 z.js

Notice the cat file size is exactly the sum of its individual parts. If the concatenation of JavaScript files is a concern for JS files, the more appropriate concern would be to start each JavaScript file with a semi-colon.

As someone else mentioned in this thread: what if you want to cat two files whose output becomes just one line instead of two? In other words, cat does what it's supposed to do.

The man of cat only mentions reading input up to EOF, not <newline>. Note that the -n switch of cat will also print out a non- <newline> terminated line (or incomplete line) as a line - being that the count starts at 1 (according to the man .)

-n Number the output lines, starting at 1.

Now that we understand how POSIX defines a line , this behavior becomes ambiguous, or really, non-compliant.

Understanding a given tool's purpose and compliance will help in determining how critical it is to end files with an EOL. In C, C++, Java (JARs), etc... some standards will dictate a newline for validity - no such standard exists for JS, HTML, CSS.

For example, instead of using wc -l filename one could do awk '{x++}END{ print x}' filename , and rest assured that the task's success is not jeopardized by a file we may want to process that we did not write (eg a third party library such as the minified JS we curl d) - unless our intent was truly to count lines in the POSIX compliant sense.

Conclusion

There will be very few real life use cases where skipping EOL at EOF for certain text files such as JS, HTML, and CSS will have a negative impact - if at all. If we rely on <newline> being present, we are restricting the reliability of our tooling only to the files that we author and open ourselves up to potential errors introduced by third party files.

Moral of the story: Engineer tooling that does not have the weakness of relying on EOL at EOF.

Feel free to post use cases as they apply to JS, HTML and CSS where we can examine how skipping EOL has an adverse effect.

链接地址: http://www.djcxy.com/p/13520.html

上一篇: 如何使用print（）打印类或类的对象？

下一篇: 为什么文本文件以换行符结束？