Regex ranges and groups to DFA implemented as table

I am currently toying around with the conversion from a regex (no capture groups no backtracking) to a table driven DFA. I implemented this by creating a NFA from the Regex and then converting the NFA to a DFA. I currently handle groups such as "[az]" with an naive implementation by just replacing the group with "(a|b|...|y|z)" which works and the resulting DFA table is still reasonable sized. Same goes with negative groups such as "[^abc]" which will be replaced with "(u0000|u0001|...)" excluding the escaped versions of abc but this results in giant tables.

How do I implement groups and ranges so that the table handles them "elegant" rather than brute force by putting all characters in the table?


The table you are constructing has as many columns as there are distinct choices to reach another state. Once a certain character yields a unique result, it has to have it's own entry in the table and hence the table is irreducible. So you have to construct the entire table and remove duplicate columns by grouping them. Say for instance that a and b always yield the same transition, then you can group them in [ab] .

If you wish to be more constructive, identify equivalent symbols beforehand. You can accomplish this by looping over all states and initially setting the different groups as one big group containing everything. Next for each state split each group in as many groups as there are transitions from the current state. Split them according to the transitions and remove singletons, no need to take them into account because they are atomic.

链接地址: http://www.djcxy.com/p/74800.html

上一篇: 将短信与数千个正则表达式进行高效匹配

下一篇: 正则表达式的范围和组以DFA的形式实现