Why are regular expressions so controversial?

When exploring regular expressions (otherwise known as RegEx-es), there are many individuals who seem to see regular expressions as the Holy Grail. Something that looks so complicated - just must be the answer to any question. They tend to think that every problem is solvable using regular expressions.

On the other hand, there are also many people who try to avoid regular expressions at all cost. They try to find a way around regular expressions and accept additional coding just for the sake of it, even if a regular expressions would be a more compact solution.

Why are regular expressions considered so controversial? Is there widespread misunderstandings about how they work? Or could it be a broad belief that regular expressions are generally slow?


I don't think people object to regular expressions because they're slow, but rather because they're hard to read and write, as well as tricky to get right. While there are some situations where regular expressions provide an effective, compact solution to the problem, they are sometimes shoehorned into situations where it's better to use an easy-to-read, maintainable section of code instead.


Making Regexes Maintainable

A major advance toward demystify the patterns previously referred to as “regular expressions” is Perl's /x regex flag — sometimes written (?x) when embedded — that allows whitespace (line breaking, indenting) and comments. This seriously improves readability and therefore maintainability. The white space allow for cognitive chunking, so you can see what groups with what.

Modern patterns also now support both relatively numbered and named backreferences now. That means you no longer need to count capture groups to figure out that you need $4 or 7 . This helps when creating patterns that can be included in further patterns.

Here is an example a relatively numbered capture group:

$dupword = qr{ b (?: ( w+ ) (?: s+ g{-1} )+ ) b }xi;
$quoted  = qr{ ( ["'] ) $dupword  1 }x;

And here is an example of the superior approach of named captures:

$dupword = qr{ b (?: (?<word> w+ ) (?: s+ k<word> )+ ) b }xi;
$quoted  = qr{ (?<quote> ["'] ) $dupword  g{quote} }x;

Grammatical Regexes

Best of all , these named captures can be placed within a (?(DEFINE)...) block, so that you can separate out the declaration from the execution of individual named elements of your patterns. This makes them act rather like subroutines within the pattern.
A good example of this sort of “grammatical regex” can be found in this answer and this one. These look much more like a grammatical declaration.

As the latter reminds you:

… make sure never to write line‐noise patterns. You don't have to, and you shouldn't. No programming language can be maintainable that forbids white space, comments, subroutines, or alphanumeric identifiers. So use all those things in your patterns.

This cannot be over-emphasized. Of course if you don't use those things in your patterns, you will often create a nightmare. But if you do use them, though, you need not.

Here's another example of a modern grammatical pattern, this one for parsing RFC 5322: use 5.10.0;

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) @ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? [ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   ] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [x21-x5ax5e-x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: . (?&atext)+)*)

     (?<text>            [x01-x09x0bx0cx0e-x7f])
     (?<quoted_pair>      (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [x21x23-x5bx5d-x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [x21-x27x2a-x5bx5d-x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         ( (?: (?&FWS)? (?&ccontent))* (?&FWS)? ) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [x01-x08x0bx0cx0e-x1fx7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            x0d x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [x20x09])
   )

   (?&address)

}x;

Isn't that remarkable — and splendid? You can take a BNF-style grammar and translate it directly into code without losing its fundamental structure!

If modern grammatical patterns still aren't enough for you, then Damian Conway's brilliant Regexp::Grammars module offers an even cleaner syntax, with superior debugging, too. Here's the same code for parsing RFC 5322 recast into a pattern from that module:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? < <addr_spec> > <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> @ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? [ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [x21-x5ax5e-x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: . <.atext>+)*

    <token: text>            [x01-x09x0bx0cx0e-x7f]
    <token: quoted_pair>      <.text>

    <token: qtext>           <.NO_WS_CTL> | [x21x23-x5bx5d-x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [x21-x27x2a-x5bx5d-x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         ( (?: <.FWS>? <.ccontent>)* <.FWS>? )
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [x01-x08x0bx0cx0e-x1fx7f]

    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            x0d x0a
    <token: DQUOTE>          "
    <token: WSP>             [x20x09]

    }x;

};


while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper %/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

There's a lot of good stuff in the perlre manpage, but these dramatic improvements in fundamental regex design features are by no means limited to Perl alone. Indeed the pcrepattern manpage may be an easier read, and covers the same territory.

Modern patterns have almost nothing in common with the primitive things you were taught in your finite automata class.


Regexes are a great tool, but people think "Hey, what a great tool, I will use it to do X!" where X is something that a different tool is better for (usually a parser). It is the standard using a hammer where you need a screwdriver problem.

链接地址: http://www.djcxy.com/p/76874.html

上一篇: HTML5中的SVG标记可以省略自己

下一篇: 为什么正则表达式如此引起争议?