Techniques for extracting regular expressions out of a labeled data set

Let's suppose I have a data set of several hundred thousand strings (which happen to be natural language sentences, if it matters) which are each tagged with a certain "label". Each sentence is tagged with exactly one label, and there are about 10 labels, each with approximately 10% of the data set belonging to them. There is a high degree of similarity to the structure of sentences within a label.

I know the above sounds like a classical example of a machine learning problem, but I want to ask a slightly different question. Are there any known techniques for programatically generating a set of regular expressions for each label, which can successfully classify the training data while still generalizing to future test data?

I would be very happy with references to the literature; I realize that this will not be a straightforward algorithm :)

PS: I know that the normal way to do classification is with machine learning techniques like an SVM or such. I am, however, explicitly looking for a way to generate regular expressions. (I would be happy with with machine learning techniques for generating the regular expressions, just not with machine learning techniques for doing the classification itself!)


So far as I know, this is the subject of current research in evolutionary computation.

Here are some examples:

See slides 40-44 at

https://cs.byu.edu/sites/default/files/Ken_De_Jong_slides.pdf (slides exist as of the posting of this answer).

Also, see

http://www.citeulike.org/user/bartolialberto/article/10710768

for a more detailed review of a system presented at GECCO 2012.


This problem is usually framed as how to generate finite automata from sets of strings, rather than regular expressions, though you can obviously generate REs from FAs since they are equivalent.

If you search around for automata induction , you should be able to find quite a lot of literature on this topic, including GA approaches.


Note: May this would help someway. This below function generates RegEx pattern for a given value of a and b . Where a and b both are alpha-strings. And the function would generate a fair RegEx pattern to match the range between a and b . The function would take only first three chars to produce the pattern and produces a result that might be something like starts-with() function in some language with hint of a general RegEx favor.

A simple VB.NET example

Public Function GetRangePattern(ByVal f_surname As String, ByVal l_surname As String) As String
        Dim f_sn, l_sn As String
        Dim mnLength% = 0, mxLength% = 0, pdLength% = 0, charPos% = 0
        Dim fsn_slice$ = "", lsn_slice$ = ""
        Dim rPattern$ = "^"
        Dim alphas As New Collection
        Dim tmpStr1$ = "", tmpStr2$ = "", tmpStr3$ = ""

        '///init local variables
        f_sn = f_surname.ToUpper.Trim
        l_sn = l_surname.ToUpper.Trim

        '///do null check
        If f_sn.Length = 0 Or l_sn.Length = 0 Then
            Return "-!ERROR!-"
        End If

        '///return if both equal
        If StrComp(f_sn, l_sn, CompareMethod.Text) = 0 Then
            Return "^" & f_sn & "$"
        End If

        '///return if 1st_name present in 2nd_name
        If InStr(1, l_sn, f_sn, CompareMethod.Text) > 0 Then
            tmpStr1 = f_sn
            tmpStr2 = l_sn.Replace(f_sn, vbNullString)
            If Len(tmpStr2) > 1 Then
                tmpStr3 = "[A-" & tmpStr2.Substring(1) & "]*"
            Else
                tmpStr3 = tmpStr2 & "*"
            End If
            tmpStr1 = "^" & tmpStr1 & tmpStr3 & ".*$"
            tmpStr1 = tmpStr1.ToUpper
            Return tmpStr1
        End If

        '///initialize alphabets
        alphas.Add("A", CStr(Asc("A")))
        alphas.Add("B", CStr(Asc("B")))
        alphas.Add("C", CStr(Asc("C")))
        alphas.Add("D", CStr(Asc("D")))
        alphas.Add("E", CStr(Asc("E")))
        alphas.Add("F", CStr(Asc("F")))
        alphas.Add("G", CStr(Asc("G")))
        alphas.Add("H", CStr(Asc("H")))
        alphas.Add("I", CStr(Asc("I")))
        alphas.Add("J", CStr(Asc("J")))
        alphas.Add("K", CStr(Asc("K")))
        alphas.Add("L", CStr(Asc("L")))
        alphas.Add("M", CStr(Asc("M")))
        alphas.Add("N", CStr(Asc("N")))
        alphas.Add("O", CStr(Asc("O")))
        alphas.Add("P", CStr(Asc("P")))
        alphas.Add("Q", CStr(Asc("Q")))
        alphas.Add("R", CStr(Asc("R")))
        alphas.Add("S", CStr(Asc("S")))
        alphas.Add("T", CStr(Asc("T")))
        alphas.Add("U", CStr(Asc("U")))
        alphas.Add("V", CStr(Asc("V")))
        alphas.Add("W", CStr(Asc("W")))
        alphas.Add("X", CStr(Asc("X")))
        alphas.Add("Y", CStr(Asc("Y")))
        alphas.Add("Z", CStr(Asc("Z")))

        '///populate max-min length values
        mxLength = f_sn.Length
        If l_sn.Length > mxLength Then
            mnLength = mxLength
            mxLength = l_sn.Length
        Else
            mnLength = l_sn.Length
        End If
        '///padding values
        pdLength = mxLength - mnLength
        f_sn = f_sn.PadRight(mxLength, "A")
        'f_sn = f_sn.PadRight(mxLength, "~")
        l_sn = l_sn.PadRight(mxLength, "Z")
        'l_sn = l_sn.PadRight(mxLength, "~")

        '///get a range like A??-B??
        If f_sn.Substring(0, 1).ToUpper <> l_sn.Substring(0, 1).ToUpper Then
            fsn_slice = f_sn.Substring(0, 3).ToUpper
            lsn_slice = l_sn.Substring(0, 3).ToUpper
            tmpStr1 = fsn_slice.Substring(0, 1) & fsn_slice.Substring(1, 1) & "[" & fsn_slice.Substring(2, 1) & "-Z]"
            tmpStr2 = lsn_slice.Substring(0, 1) & lsn_slice.Substring(1, 1) & "[A-" & lsn_slice.Substring(2, 1) & "]"
            tmpStr3 = "^(" & tmpStr1 & "|" & tmpStr2 & ").*$"
            Return tmpStr3
        End If

        '///looping charwise
        For charPos = 0 To mxLength
            fsn_slice = f_sn.Substring(charPos, 1)
            lsn_slice = l_sn.Substring(charPos, 1)
            If StrComp(fsn_slice, lsn_slice, CompareMethod.Text) = 0 Then
                rPattern = rPattern & fsn_slice
            Else
                'rPattern = rPattern & "("
                If charPos < mxLength Then
                    Try
                        If Asc(fsn_slice) < Asc(lsn_slice) Then
                            tmpStr1 = fsn_slice & "[" & f_sn.Substring(charPos + 1, 1) & "-Z" & "]|"
                            If CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) < CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) Then
                                tmpStr2 = "[" & CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) & "-" & CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) & "]|"
                            Else
                                tmpStr2 = vbNullString
                            End If
                            tmpStr3 = lsn_slice & "[A-" & l_sn.Substring(charPos + 1, 1) & "]"
                            rPattern = rPattern & "(" & tmpStr1 & tmpStr2 & tmpStr3 & ").*$"
                            'MsgBox("f_sn:= " & f_sn & " -- l_sn:= " & l_sn & vbCr & rPattern)
                            Exit For
                        Else
                            Return "-#ERROR#-"
                        End If
                    Catch ex As Exception
                        Return "-|ERROR|-" & ex.Message
                    End Try
                End If
            End If
        Next charPos
        Return rPattern
    End Function

And it is called as

?GetRangePattern("ABC","DEF")

produces this

"^(AB[C-Z]|DE[A-F]).*$"
链接地址: http://www.djcxy.com/p/59836.html

上一篇: 如何处理REST中的陈旧数据?

下一篇: 从标注的数据集中提取正则表达式的技术