Techniques for extracting regular expressions out of a labeled data set
Let's suppose I have a data set of several hundred thousand strings (which happen to be natural language sentences, if it matters) which are each tagged with a certain "label". Each sentence is tagged with exactly one label, and there are about 10 labels, each with approximately 10% of the data set belonging to them. There is a high degree of similarity to the structure of sentences within a label.
I know the above sounds like a classical example of a machine learning problem, but I want to ask a slightly different question. Are there any known techniques for programatically generating a set of regular expressions for each label, which can successfully classify the training data while still generalizing to future test data?
I would be very happy with references to the literature; I realize that this will not be a straightforward algorithm :)
PS: I know that the normal way to do classification is with machine learning techniques like an SVM or such. I am, however, explicitly looking for a way to generate regular expressions. (I would be happy with with machine learning techniques for generating the regular expressions, just not with machine learning techniques for doing the classification itself!)
So far as I know, this is the subject of current research in evolutionary computation.
Here are some examples:
See slides 40-44 at
https://cs.byu.edu/sites/default/files/Ken_De_Jong_slides.pdf (slides exist as of the posting of this answer).
Also, see
http://www.citeulike.org/user/bartolialberto/article/10710768
for a more detailed review of a system presented at GECCO 2012.
This problem is usually framed as how to generate finite automata from sets of strings, rather than regular expressions, though you can obviously generate REs from FAs since they are equivalent.
If you search around for automata induction , you should be able to find quite a lot of literature on this topic, including GA approaches.
Note: May this would help someway. This below function generates RegEx pattern for a given value of a
and b
. Where a
and b
both are alpha-strings. And the function would generate a fair RegEx
pattern to match the range between a
and b
. The function would take only first three chars to produce the pattern and produces a result
that might be something like starts-with()
function in some language with hint of a general RegEx favor.
A simple VB.NET example
Public Function GetRangePattern(ByVal f_surname As String, ByVal l_surname As String) As String
Dim f_sn, l_sn As String
Dim mnLength% = 0, mxLength% = 0, pdLength% = 0, charPos% = 0
Dim fsn_slice$ = "", lsn_slice$ = ""
Dim rPattern$ = "^"
Dim alphas As New Collection
Dim tmpStr1$ = "", tmpStr2$ = "", tmpStr3$ = ""
'///init local variables
f_sn = f_surname.ToUpper.Trim
l_sn = l_surname.ToUpper.Trim
'///do null check
If f_sn.Length = 0 Or l_sn.Length = 0 Then
Return "-!ERROR!-"
End If
'///return if both equal
If StrComp(f_sn, l_sn, CompareMethod.Text) = 0 Then
Return "^" & f_sn & "$"
End If
'///return if 1st_name present in 2nd_name
If InStr(1, l_sn, f_sn, CompareMethod.Text) > 0 Then
tmpStr1 = f_sn
tmpStr2 = l_sn.Replace(f_sn, vbNullString)
If Len(tmpStr2) > 1 Then
tmpStr3 = "[A-" & tmpStr2.Substring(1) & "]*"
Else
tmpStr3 = tmpStr2 & "*"
End If
tmpStr1 = "^" & tmpStr1 & tmpStr3 & ".*$"
tmpStr1 = tmpStr1.ToUpper
Return tmpStr1
End If
'///initialize alphabets
alphas.Add("A", CStr(Asc("A")))
alphas.Add("B", CStr(Asc("B")))
alphas.Add("C", CStr(Asc("C")))
alphas.Add("D", CStr(Asc("D")))
alphas.Add("E", CStr(Asc("E")))
alphas.Add("F", CStr(Asc("F")))
alphas.Add("G", CStr(Asc("G")))
alphas.Add("H", CStr(Asc("H")))
alphas.Add("I", CStr(Asc("I")))
alphas.Add("J", CStr(Asc("J")))
alphas.Add("K", CStr(Asc("K")))
alphas.Add("L", CStr(Asc("L")))
alphas.Add("M", CStr(Asc("M")))
alphas.Add("N", CStr(Asc("N")))
alphas.Add("O", CStr(Asc("O")))
alphas.Add("P", CStr(Asc("P")))
alphas.Add("Q", CStr(Asc("Q")))
alphas.Add("R", CStr(Asc("R")))
alphas.Add("S", CStr(Asc("S")))
alphas.Add("T", CStr(Asc("T")))
alphas.Add("U", CStr(Asc("U")))
alphas.Add("V", CStr(Asc("V")))
alphas.Add("W", CStr(Asc("W")))
alphas.Add("X", CStr(Asc("X")))
alphas.Add("Y", CStr(Asc("Y")))
alphas.Add("Z", CStr(Asc("Z")))
'///populate max-min length values
mxLength = f_sn.Length
If l_sn.Length > mxLength Then
mnLength = mxLength
mxLength = l_sn.Length
Else
mnLength = l_sn.Length
End If
'///padding values
pdLength = mxLength - mnLength
f_sn = f_sn.PadRight(mxLength, "A")
'f_sn = f_sn.PadRight(mxLength, "~")
l_sn = l_sn.PadRight(mxLength, "Z")
'l_sn = l_sn.PadRight(mxLength, "~")
'///get a range like A??-B??
If f_sn.Substring(0, 1).ToUpper <> l_sn.Substring(0, 1).ToUpper Then
fsn_slice = f_sn.Substring(0, 3).ToUpper
lsn_slice = l_sn.Substring(0, 3).ToUpper
tmpStr1 = fsn_slice.Substring(0, 1) & fsn_slice.Substring(1, 1) & "[" & fsn_slice.Substring(2, 1) & "-Z]"
tmpStr2 = lsn_slice.Substring(0, 1) & lsn_slice.Substring(1, 1) & "[A-" & lsn_slice.Substring(2, 1) & "]"
tmpStr3 = "^(" & tmpStr1 & "|" & tmpStr2 & ").*$"
Return tmpStr3
End If
'///looping charwise
For charPos = 0 To mxLength
fsn_slice = f_sn.Substring(charPos, 1)
lsn_slice = l_sn.Substring(charPos, 1)
If StrComp(fsn_slice, lsn_slice, CompareMethod.Text) = 0 Then
rPattern = rPattern & fsn_slice
Else
'rPattern = rPattern & "("
If charPos < mxLength Then
Try
If Asc(fsn_slice) < Asc(lsn_slice) Then
tmpStr1 = fsn_slice & "[" & f_sn.Substring(charPos + 1, 1) & "-Z" & "]|"
If CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) < CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) Then
tmpStr2 = "[" & CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) & "-" & CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) & "]|"
Else
tmpStr2 = vbNullString
End If
tmpStr3 = lsn_slice & "[A-" & l_sn.Substring(charPos + 1, 1) & "]"
rPattern = rPattern & "(" & tmpStr1 & tmpStr2 & tmpStr3 & ").*$"
'MsgBox("f_sn:= " & f_sn & " -- l_sn:= " & l_sn & vbCr & rPattern)
Exit For
Else
Return "-#ERROR#-"
End If
Catch ex As Exception
Return "-|ERROR|-" & ex.Message
End Try
End If
End If
Next charPos
Return rPattern
End Function
And it is called as
?GetRangePattern("ABC","DEF")
produces this
"^(AB[C-Z]|DE[A-F]).*$"
链接地址: http://www.djcxy.com/p/59836.html
上一篇: 如何处理REST中的陈旧数据?
下一篇: 从标注的数据集中提取正则表达式的技术