Using awk to remove the Byte

How would an awk script (presumably a one-liner) for removing a BOM look like?

Specification:

  • print every line after the first ( NR > 1 )
  • for the first line: If it starts with #FE #FF or #FF #FE , remove those and print the rest

  • Try this:

    awk 'NR==1{sub(/^xefxbbxbf/,"")}{print}' INFILE > OUTFILE
    

    On the first record (line), remove the BOM characters. Print every record.

    Or slightly shorter, using the knowledge that the default action in awk is to print the record:

    awk 'NR==1{sub(/^xefxbbxbf/,"")}1' INFILE > OUTFILE
    

    1 is the shortest condition that always evaluates to true, so each record is printed.

    Enjoy!

    -- ADDENDUM --

    Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

    Bytes         |  Encoding Form
    --------------------------------------
    00 00 FE FF   |  UTF-32, big-endian
    FF FE 00 00   |  UTF-32, little-endian
    FE FF         |  UTF-16, big-endian
    FF FE         |  UTF-16, little-endian
    EF BB BF      |  UTF-8
    

    Thus, you can see how xefxbbxbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.


    Using GNU sed (on Linux or Cygwin):

    # Removing BOM from all text files in current directory:
    sed -i '1 s/^xefxbbxbf//' *.txt
    

    On FreeBSD:

    sed -i .bak '1 s/^xefxbbxbf//' *.txt
    

    Advantage of using GNU or FreeBSD sed : the -i parameter means "in place", and will update files without the need for redirections or weird tricks.

    On Mac:

    This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala xef .

    A similar trick can be achieved with any program by piping to the sponge tool from moreutils:

    awk '…' INFILE | sponge INFILE
    

    Not awk, but simpler:

    tail -c +4 UTF8 > UTF8.nobom
    

    To check for BOM:

    hd -n 3 UTF8
    

    If BOM is present you'll see: 00000000 ef bb bf ...

    链接地址: http://www.djcxy.com/p/34676.html

    上一篇: 为什么会忽略关闭标签?

    下一篇: 使用awk删除字节