Using awk to remove the Byte
How would an awk
script (presumably a one-liner) for removing a BOM look like?
Specification:
NR > 1
) #FE #FF
or #FF #FE
, remove those and print the rest Try this:
awk 'NR==1{sub(/^xefxbbxbf/,"")}{print}' INFILE > OUTFILE
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
awk 'NR==1{sub(/^xefxbbxbf/,"")}1' INFILE > OUTFILE
1
is the shortest condition that always evaluates to true, so each record is printed.
Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Bytes | Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8
Thus, you can see how xefxbbxbf
corresponds to EF BB BF
UTF-8
BOM bytes from the above table.
Using GNU sed
(on Linux or Cygwin):
# Removing BOM from all text files in current directory:
sed -i '1 s/^xefxbbxbf//' *.txt
On FreeBSD:
sed -i .bak '1 s/^xefxbbxbf//' *.txt
Advantage of using GNU or FreeBSD sed
: the -i
parameter means "in place", and will update files without the need for redirections or weird tricks.
On Mac:
This awk
solution in another answer works, but the sed
command above does not work. At least on Mac (Sierra) sed
documentation does not mention supporting hexadecimal escaping ala xef
.
A similar trick can be achieved with any program by piping to the sponge
tool from moreutils:
awk '…' INFILE | sponge INFILE
Not awk, but simpler:
tail -c +4 UTF8 > UTF8.nobom
To check for BOM:
hd -n 3 UTF8
If BOM is present you'll see: 00000000 ef bb bf ...
上一篇: 为什么会忽略关闭标签?
下一篇: 使用awk删除字节