我如何测量2个字符串之间的相似度?

给定两个字符串text1text2

public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
     // DO SOMETHING HERE TO COMPARE
}

例子:

  • 第一个字符串:StackOverflow

    第二个字符串:StaqOverflow

    回报率:相似度为91%

    回报可以在%或类似的东西。

  • 第一个字符串:简单的文本测试

    第二个字符串:复杂的文本测试

    返回:这些值可以被认为是相等的

  • 有任何想法吗? 做这个的最好方式是什么?


    有多种不同的方式来做到这一点。 查看维基百科“字符串相似性度量”页面,了解链接到带算法的其他页面的链接。

    然而,我认为这些算法都不考虑声音 - 所以“staq溢出”与“堆栈溢出”类似,尽管第一个在发音方面更为相似。

    我刚刚发现了另一个页面,它提供了更多的选项......特别是,Soundex算法(维基百科)可能更接近你以后的内容。


    Levenshtein距离可能是你正在寻找的。


    以下是我为正在编写的项目编写的一些代码。 我需要知道字符串的相似比和基于字符串的字的相似比。 最后一个,我想知道最小字符串的词相似比(所以如果所有单词存在并匹配较大的字符串,结果将是100%)和较大字符串的字相似比(我称之为RealWordsRatio )。 我使用Levenshtein算法找到距离。 到目前为止,代码尚未优化,但它按预期工作。 希望对你有帮助。

    public static int Compute(string s, string t)
        {
            int n = s.Length;
            int m = t.Length;
            int[,] d = new int[n + 1, m + 1];
    
            // Step 1
            if (n == 0)
            {
                return m;
            }
    
            if (m == 0)
            {
                return n;
            }
    
            // Step 2
            for (int i = 0; i <= n; d[i, 0] = i++)
            {
            }
    
            for (int j = 0; j <= m; d[0, j] = j++)
            {
            }
    
            // Step 3
            for (int i = 1; i <= n; i++)
            {
                //Step 4
                for (int j = 1; j <= m; j++)
                {
                    // Step 5
                    int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
    
                    // Step 6
                    d[i, j] = Math.Min(
                        Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                        d[i - 1, j - 1] + cost);
                }
            }
            // Step 7
            return d[n, m];
        }
    
    double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
        {
            double theResult = 0;
            String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
            String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
            if (Splitted1.Length < Splitted2.Length)
            {
                String[] Temp = Splitted2;
                Splitted2 = Splitted1;
                Splitted1 = Temp;
            }
            int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
            int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.
    
            for (int loop = 0; loop < Splitted1.Length; loop++) 
            {
                for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
                BestWord[loop] = -1;
            }
            int WordsMatched = 0;
            for (int loop = 0; loop < Splitted1.Length; loop++)
            {
                String String1 = Splitted1[loop];
                for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
                {
                    String String2 = Splitted2[loop1];
                    int LevenshteinDistance = Compute(String1, String2);
                    theScores[loop, loop1] = LevenshteinDistance;
                    if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
                }
            }
    
            for (int loop = 0; loop < Splitted1.Length; loop++)
            {
                if (theScores[loop, BestWord[loop]] == 1000) continue;
                for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
                {
                    if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                    if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                    {
                        //The first in order has the advantage of keeping the word in equality
                        if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                        {
                            theScores[loop1, BestWord[loop1]] = 1000;
                            int CurrentBest = -1;
                            int CurrentScore = 1000;
                            for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                            {
                                //Find next bestword
                                if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                                {
                                    CurrentBest = loop2;
                                    CurrentScore = theScores[loop1, loop2];
                                }
                            }
                            BestWord[loop1] = CurrentBest;
                        }
                        else//the latter has a better score
                        {
                            theScores[loop, BestWord[loop]] = 1000;
                            int CurrentBest = -1;
                            int CurrentScore = 1000;
                            for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                            {
                                //Find next bestword
                                if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                                {
                                    CurrentBest = loop2;
                                    CurrentScore = theScores[loop, loop2];
                                }
                            }
                            BestWord[loop] = CurrentBest;
                        }
    
                        loop = -1;
                        break;//recalculate all
                    }
                }
            }
            for (int loop = 0; loop < Splitted1.Length; loop++)
            {
                if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
                else
                {
                    theResult += theScores[loop, BestWord[loop]];
                    if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
                }
            }
            int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
            if(theResult > theLength) theResult = theLength;
            theResult = (1 - (theResult / theLength)) * 100;
            WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
            RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
            return theResult;
        }
    
    链接地址: http://www.djcxy.com/p/75193.html

    上一篇: How can I measure the similarity between 2 strings?

    下一篇: How to ignore the case sensitivity in List<string>