Alternative to Levenshtein and Trigram

Say I have the following two strings in my database:

(1) 'Levi Watkins Learning Center - Alabama State University'
(2) 'ETH Library'

My software receives free text inputs from a data source, and it should match those free texts to the pre-defined strings in the database (the ones above).

For example, if the software gets the string 'Alabama University' , it should recognize that this is more similar to (1) than it is to (2) .

At first, I thought of using a well-known string metric like Levenshtein-Damerau or Trigrams, but this leads to unwanted results as you can see here:

http://fuzzy-string.com/Compare/Transform.aspx?r=Levi+Watkins+Learning+Center+-+Alabama+State+University&q=Alabama+University

http://fuzzy-string.com/Compare/Transform.aspx?r=ETH+Library&q=Alabama+University

Difference to (1): 37
Difference to (2): 14

(2) wins because it is much shorter than (1) , even though (1) contains both words ( Alabama and University ) of the search string.

I also tried it with Trigrams (using the Javascript library fuzzySet), but I got similar results there.

Is there a string metric that would recognize the similarity of the search string to (1) ?


You could try the Word Mover's Distance https://github.com/mkusner/wmd instead. One brilliant advantage of this algorithm is that it incorporates the implied meanings while computing the differences between words in documents. The paper can be found here


You can try to use normalized levenshtein distance:

Li Yujian, Liu Bo, "A Normalized Levenshtein Distance Metric," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091-1095, June 2007, doi:10.1109/TPAMI.2007.1078 http://www.computer.org/csdl/trans/tp/2007/06/i1091-abs.html

They propose to normalize the levenshtein distance. By doing this, a difference of one character in a sequences of longer two weights more than the same difference when comparing sequences of longer 10.


You should change your approach:

levenshtein Distance is good at calculating similarities in units either they are 'characters' or 'words'.

Conceptually you are considering Alabama and university (2 words) as 2 units and you want to calculate the distance between the words for which levenshtein distance should mean how many words are in between Alabama and University which should be 1.

But, you are trying to apply levenshtein algorithm that is implemented for characters within a word. This implementation will only work for matching the single words NOT sentences.

Its better you should implement your own levenshtein algorithm (using BK-Tree) for 'words' on the top and within each match, you again match the each word using levenshtein for 'characters'.

your result for (1) should be a match with distance 1 with that algorithm and No match for (2).

链接地址: http://www.djcxy.com/p/75720.html

上一篇: 解决O(logn)中的代码的难题

下一篇: 替代Levenshtein和Trigram