Better regex syntax ideas
I need some help to complete my idea about regexes.
Introduction
There was a question about better syntax for regexes on SE, but I don't think I'd use the fluent syntax. It's surely nice for newbies, but in case of a complicated regex, you replace a line of gibberish by a whole page of slightly better gibberish. I like the approach by Martin Fowler, where a regex gets composed of smaller pieces. His solution is readable, but hand-made; he proposes a smart way to build a complicated regex instead of a class supporting it.
I'm trying to make it to a class using something like (see his example first)
final MyPattern pattern = MyPattern.builder()
.caseInsensitive()
.define("numberOfPoints", "d+")
.define("numberOfNights", "d+")
.define("hotelName", ".*")
.define(' ', "s+")
.build("score `numberOfPoints` for `numberOfNights` nights? at `hotelName`");
MyMatcher m = pattern.matcher("Score 400 FOR 2 nights at Minas Tirith Airport");
System.out.println(m.group("numberOfPoints")); // prints 400
where fluent syntax is used for combining regexes extended as follows:
`name`
creates a named group `:name`
creates a non-capturing group (?:
... )
`-name`
creates a backreference ~ @#%
") are allowed +
or (
would be extremely confusing, so it's not allowed define('#', "\")
for matching backslashes could make the pattern much readable s
or w
The named patterns serves as a sort of local variables helping to decompose a complicated expression into small and easy to understand pieces. A proper naming pattern makes often a comment unnecessary.
Questions
The above shouldn't be hard to implement (I did already most of it) and could be really useful, I hope. Do you think so?
However, I'm not sure how it should behave inside of brackets, sometimes it's meaningful to use the definitions and sometimes not, eg in
.define(' ', "s") // a blank character
.define('~', "/**[^*]+*/") // an inline comment (simplified)
.define("something", "[ ~d]")
expanding the space to s
makes sense, but expanding the tilde doesn't. Maybe there should be a separate syntax to define own character classes somehow?
Can you think of some examples where the named pattern are very useful or not useful at all? I'd need some border cases and some ideas for improvement.
Reaction to tchrist's answer
Comments to his objections
I looks like you don't like Java. I'd be happy to see some syntax improvements there, but there's nothing I can do about it. I'm looking for something working with current Java.
RFC 5322
Your example can be easily written using my syntax:
final MyPattern pattern = MyPattern.builder()
.define(" ", "") // ignore spaces
.useForBackslash('#') // (1): see (2)
.define("address", "`mailbox` | `group`")
.define("WSP", "[u0020u0009]")
.define("DQUOTE", """)
.define("CRLF", "rn")
.define("DIGIT", "[0-9]")
.define("ALPHA", "[A-Za-z]")
.define("NO_WS_CTL", "[u0001-u0008u000bu000cu000e-u001fu007f]") // No whitespace control
...
.define("domain_literal", "`CFWS`? #[ (?: `FWS`? `dcontent`)* `FWS`? #] `CFWS1?") // (2): see (1)
...
.define("group", "`display_name` : (?:`mailbox_list` | `CFWS`)? ; `CFWS`?")
.define("angle_addr", "`CFWS`? < `addr_spec` `CFWS`?")
.define("name_addr", "`display_name`? `angle_addr`")
.define("mailbox", "`name_addr` | `addr_spec`")
.define("address", "`mailbox` | `group`")
.build("`address`");
Disadvantages
While rewriting your example I encountered the following issues:
xdd
escape sequences udddd
must be used On the bright side: - Ignoring spaces is no problem - Comments are no problem - The readability is good
And most important: It's plain Java and uses the existing regex-engine as is.
Named Capture Examples
Can you think of some examples where the named pattern are very useful or not useful at all?
In answer to your question, here is an example where named patterns are especially useful. It's a Perl or PCRE pattern for parsing an RFC 5322 mail address. First, it's in /x
mode by virtue of (?x)
. Second, it separates out the definitions from the invocation; the named group address
is the thing that does the full recursive-descent parse. Its definition follows it in the non-executing (?DEFINE)…)
block.
(?x) # allow whitespace and comments
(?&address) # this is the capture we call as a "regex subroutine"
# the rest is all definitions, in a nicely BNF-style
(?(DEFINE)
(?<address> (?&mailbox) | (?&group))
(?<mailbox> (?&name_addr) | (?&addr_spec))
(?<name_addr> (?&display_name)? (?&angle_addr))
(?<angle_addr> (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
(?<group> (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
(?<display_name> (?&phrase))
(?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*)
(?<addr_spec> (?&local_part) @ (?&domain))
(?<local_part> (?&dot_atom) | (?"ed_string))
(?<domain> (?&dot_atom) | (?&domain_literal))
(?<domain_literal> (?&CFWS)? [ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
] (?&CFWS)?)
(?<dcontent> (?&dtext) | (?"ed_pair))
(?<dtext> (?&NO_WS_CTL) | [x21-x5ax5e-x7e])
(?<atext> (?&ALPHA) | (?&DIGIT) | [!#$%&'*+-/=?^_`{|}~])
(?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?)
(?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
(?<dot_atom_text> (?&atext)+ (?: . (?&atext)+)*)
(?<text> [x01-x09x0bx0cx0e-x7f])
(?<quoted_pair> (?&text))
(?<qtext> (?&NO_WS_CTL) | [x21x23-x5bx5d-x7e])
(?<qcontent> (?&qtext) | (?"ed_pair))
(?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
(?&FWS)? (?&DQUOTE) (?&CFWS)?)
(?<word> (?&atom) | (?"ed_string))
(?<phrase> (?&word)+)
# Folding white space
(?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
(?<ctext> (?&NO_WS_CTL) | [x21-x27x2a-x5bx5d-x7e])
(?<ccontent> (?&ctext) | (?"ed_pair) | (?&comment))
(?<comment> ( (?: (?&FWS)? (?&ccontent))* (?&FWS)? ) )
(?<CFWS> (?: (?&FWS)? (?&comment))*
(?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
# No whitespace control
(?<NO_WS_CTL> [x01-x08x0bx0cx0e-x1fx7f])
(?<ALPHA> [A-Za-z])
(?<DIGIT> [0-9])
(?<CRLF> x0d x0a)
(?<DQUOTE> ")
(?<WSP> [x20x09])
)
I strongly suggest not reïnventing a perfectly good wheel. Start with becoming PCRE-compatible. If you wish to go beyond basic Perl5 patterns like the RFC5322-parser above, there's always Perl6 patterns to draw upon.
It really, really pays to do research into existing practice and literature before haring off on an open-ended R&D mission. These problems have all long ago been solved, sometimes quite elegantly.
Improving Java Regex Syntax
If you truly want better regex syntax ideas for Java, you must first address these particular flaws in Java's regexes:
"foo".matches(pattern)
to use a better pattern library, partly but not solely because of final
classes that are not overridable. Of these, the first 3 have been addressed in several JVM languages, including both Groovy and Scala; even Clojure goes part-way there.
The second set of 3 steps will be tougher, but are absolutely mandatory. The last one, the absence of even the most basic Unicode support in regexes, simply kills Java for Unicode work. This is complety inexcusable this late in the game. I can provide plenty of examples if need be, but you should trust me, because I really do know what I'm talking about here.
Only once you have accomplished all these should you be worried about fixing up Java's regexes so they can catch up with the current state of the art in pattern matching. Until and unless you take care of these past oversights, you can't begin to look to the present, let alone to the future.
I think that perhaps a Regular Expression isn't really what is desired after-all, but rather something such as a Parser-Combinator library (that can work on characters and/or include regular-expressions within it's constructs).
That is, step beyond the realm of regular expressions (as irregularly as they may be implemented -- tchrist definitely enjoys the Perl implementation ;-) and into context-free grammars -- or at least those that can represented in LL(n), preferably with minimal backtracking.
Scala: The Magic Begind Parse-Combinators Note how it looks quite similar to BCNF. Has a nice introduction.
Haskel: Parsec Ditto.
Some examples in Java are JParsec and JPC.
Java, as a language, however, is not as conducive to such seamless DSL extensions as some competitors ;-)
链接地址: http://www.djcxy.com/p/50716.html上一篇: 使用Selenium RC设置Hudson以运行用C#编写的测试
下一篇: 更好的正则表达式语法思想