Regular expression for regular expressions?
Possible Duplicate:
Is there a regular expression to detect a valid regular expression?
Regular expression for finding a regular expression?
I have an application that enables the user to input a regular expression. How do I check against any input of regular expressions and make sure they are valid ones, because if they are not there there will be preg_match errors?
I don't want to use the '@' before preg_match, so if there's a way to check the validity of the user input of regular expressions that'd be great.
The regular expression system of PHP seems to be rather too complicated for me to come up with a regular expression for them.
preg_match()
returns FALSE
if an error occurred.
preg_match
on an empty string You can either use Ajax to validate real time, or validate after form submit.
You can also try to validate by feeding the expression to javascript regexp engine, but js regexp syntax is not 100% compatible with the php one.
Mathematically, it is impossible to validate a regular expression using a regular expression. This is so because (formal) regular expressions can only recognize regular languages. A language is any set of strings. For example, the set of all decimal numbers is a language (which by the way can be described using a regular expression); the set of all valid regular expressions is also a language. Regular languages are languages that require only fixed finite memory (not a function of the input size) to be recognized.
The language that contains all valid regular expressions is not a regular language; hence it is impossible to recognize a regular expression using a regular expression.
To understand this, notice that regular expressions contain parentheses in them that must match. Hence, if an "(" has occurred, a ")" must occur later on. This is impossible to describe with a machine that has only fixed finite memory. For, if there were a way to do it, and your regular expression had a finite memory of K different states (for some integer K), an expression with K opening parentheses followed by K closing parentheses, though a valid regular expression would have been unable to be recognized by that machine -- a contradiction (notice that in formal languages, our assumption is that text processing occurs one character at a time, from left to right, which is the same for applied regular expressions). We call languages such as the one that describes regular expressions context-free and not regular.
(It is trivial to prove that regular expressions do not form a regular language using the Pumping Lemma)
So, there is a fundamental computer science problem in recognizing regular expressions using regular expressions: It is mathematically impossible to do so.
Regular languages are possible to be recognized by finite-state automata, ie machines with finite states but without memory. To overcome your problem, you need to add some memory which is dependant on the input size. Regular expressions, as they are context-free (fortunately they're not some obscure, hard-to-recognize type of language) can be recognized in linear time using a push-down automaton. This is a "for" loop that goes through the expression one token (usually a character) at a time and keeps track of what it's seen on a stack, ie it "pushes" data that it laters "pops" in a first-in-last-out fashion. (Example of data pushed to the stack: "I need to remember to find a matching `)' later on!"; you can "push" this as many times as you need; you can "pop" it later, when you need to check if you actually needed to have matched an opening parenthesis previously).
Of course, writing your own recognition engine for regular expressions would be a bit of an overhead -- but if you want to do it, you should know the above limitations. It would be more wise to employ an already existing mechanism to do it -- I suspect you could give that job to a regular expression library or a language that is more keen on handling regular expressions such as Perl; but the @-method doesn't sound like too bad of an idea after all: It may be slow, but your users may input terribly slow regular expressions anyway; and it may be a bad practice, but in your case it seems the best possible solution available.
Some related articles in Wikipedia:
I hope this helped!
Letting users submit regular expressions is almost certainly a Bad Idea.
Some expressions are very expensive. Try this:
preg_match('/(.*){1,32000}[bc]/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
and that's just 30 characters of input! They don't all look like that either: /^(?:(d+)|::)*$/
is also exponential-time in PCRE.
上一篇: 测试一个字符串是否为正则表达式
下一篇: 正则表达式的正则表达式?