Finding groups of similar strings in a large set of strings
I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.
As an example let's say the input list is on the left below, and the output groups are on the right.
Input Output
----------------- -----------------
Jane Doe Mr Philip Roberts
Mr Philip Roberts Phil Roberts
Foo McBar Philip Roberts
David Jones
Phil Roberts Foo McBar
Davey Jones =>
John Smith David Jones
Philip Roberts Dave Jones
Dave Jones Davey Jones
Jonny Smith
Jane Doe
John Smith
Jonny Smith
Does anybody know of any ways to solve this reasonably efficiently?
The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.
An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists
Does anybody have any thoughts/pointers?
UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.
Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.
Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:
http://matpalm.com/resemblance/
The problem you're trying to solve is a typical clusterization problem.
Start with simple K-Means algorithm and use Levenshtein distance as a function for calculating distance between elements and clusters centers.
BTW, algorithm for Levenshtein distance calculation is implemented in Apache Commons StringUtils - StringUtils.getLevenshteinDistance
The main problem of K-Means is that you should specify the number of clusters (subgroups in your terms). So, you'll have 2 options: improve K-Means with some euristic or use another clusterization algorithm which doesn't require specifying clusters number (but that algorithm can show worse performance and can be very difficult in implemenation if you decide to implement it yourself).
如果我们正在谈论实际的代名词,比较他们的metaphone(的开始)可能会有所帮助:
MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts
FLPRBRTS: Philip Roberts
FMKBR: Foo McBar
TFTJNS: David Jones
TFJNS: Dave Jones
TFJNS: Davey Jones
JNT: Jane Doe
JNSM0: John Smith
JNSM0: Jonny Smith
链接地址: http://www.djcxy.com/p/61372.html
上一篇: 数据挖掘中分类和聚类的区别?
下一篇: 在一大组字符串中查找类似字符串的组