Find simplest regular expression matching all given strings

2018-06-25 05:32:26

Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?

It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR of substrings:

foo(a|b|cd)bar should match fooabar , foobbar and foocdbar .

Examples

Given the set of strings h_q1_a , h_q1_b , h_q1_c , h_p2_a , h_p2_b , h_p2_c , the desired output of the algorithm would be h_(q1|p2)_(a|b|c) .

Given the set of strings h_q1_a , h_q1_b , h_p2_a , the desired output of the algorithm would be h_(q1_(a|b)|p2_a) . Note that h_(q1|p2)_(a|b) would not be correct because that expand to 4 strings, including h_p2_b , which was not in the original set of strings.

Use case

I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.

(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)

You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:

h_q1_a
h_q1_b
h_q1_c
h_p2_a
h_p2_b
h_p2_c

will generate a finite machine that most probably look like this:

      [h_]         <-level 0
     /   
  [q1]  [p2]       <-level 1
        /
      [_]          <-level 2
      /  
     /    
    a    b  c      <-level 3

Now for every level/depth of the trie all the stings (if multiple) will go under OR brackets, so

h_(q1|p2)_(a|b|c)
L0   L1  L2  L3

There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction ( | ) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)

The procedure is simple, if you have access to standard FSM algorithms:

Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.

Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.

Minimize the DFA.

DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.

For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.

链接地址: http://www.djcxy.com/p/70640.html

上一篇: Tomcat 7上运行的Web应用程序中的NoClassDefFoundError

下一篇: 找到匹配所有给定字符串的最简单正则表达式