Regular expression to search for Gadaffi
I'm trying to search for the word Gadaffi. What's the best regular expression to search for this?
My best attempt so far is:
b[KG]h?add?af?fi$b
But I still seem to be missing some journals. Any suggestions?
Update: I found a pretty extensive list here: http://blogs.abcnews.com/theworldnewser/2009/09/how-many-different-ways-can-you-spell-gaddafi.html
The answer below matches all the 30 variants:
Gadaffi Gadafi Gadafy Gaddafi Gaddafy Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi Kaddafi Kadhafi Kazzafi Khadaffy Khadafy Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi Qadthafi Qathafi Quathafi Qudhafi Kad'afi
b[KGQ]h?add?h?af?fib
Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).
Btw, why is there a $
at the end of the regex?
Btw, nice article on the topic:
Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader's name spelled so many different ways?.
EDIT
To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D
b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]b
Easy... (Qadaffi|Khadafy|Qadafi|
... )
... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.
Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.
One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')
G310, K310, Q310
Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.
<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');
$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";
$wordArray = preg_split('/[s,.;-]+/',$text);
foreach ($wordArray as $item){
$rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
if ($rate > 1){
$matches[] = $item;
}
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>
A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.
链接地址: http://www.djcxy.com/p/13398.html上一篇: 为什么匹配的子字符串在JavaScript中返回“undefined”?
下一篇: 正则表达式来搜索Gadaffi