How do I grep for all non
I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
grep -e "[x{00FF}-x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
egrep "[x{00FF}-x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
You can use the command:
grep --color='auto' -P -n "[x80-xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^x00-x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented features.
Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.
So the first solution for instance would become:
grep --color='auto' -P -n '[^x00-x7F]' file.xml
(which basically greps for any character outside of the hexadecimal ASCII range: from x00 up to x7F)
On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre
installed via Homebrew, the following will work just as well:
pcregrep --color='auto' -n '[^x00-x7F]' file.xml
Any pros or cons that anyone can think off?
The following works for me:
grep -P "[x80-xFF]" file.xml
Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P
option in my grep allows the use of xdd
escapes in character classes to accomplish what you want.
上一篇: PHP语法错误,意外的'['
下一篇: 我如何grep所有非