GroupBy on strings (hashcode collision)

I have an array of strings (approx 2000), and I'd like to use IEnumerable.GroupBy to group the equal ones.

The problem is though that there are many hash collisions such as "mysteriously" and "well". This is probably due to the fact that GroupBy uses GetHashCode() , which returns an int, which is too small (or the GetHashCode function for the class String hasn't been implemented nicely).

I guess you could try implementing an overridden GetHashCode function or define a custom IEqualityComparer and use a different hashcode, but isn't there any way to compare them directly or differently? I know it'll take much longer, but at a small amount it's acceptable. How could I fix this?


GroupBy on strings will only group equal strings together, regardless of whether they have the same hash code. Since GroupBy uses a hash table under the hood, lots of different strings with the same hash code may degrade performance slightly, but will still give the correct answer.

To prove this to yourself, note that GroupBy works great even with a custom IEqualityComparer that has a terrible hashing function:

void Main()
{
    var groups = new[] { "a", "a", "b", "b", "c", "c" }.GroupBy(s => s, new BadComparer())
        .Select(g => string.Join(",", g))
        .ToArray();
    Console.WriteLine(string.Join(Environment.NewLine, groups));
    // this prints:
    // a,a
    // b,b
    // c,c      
}

public class BadComparer : IEqualityComparer<string> {
    public bool Equals(string a, string b) { return a == b; }
    public int GetHashCode(string s) { return 0; }
}

Note also that it's important to group by the string itself rather than by its hash code:

myStrings.GroupBy(s => s) // works

myStrings.GroupBy(s => s.GetHashCode()) // doesn't work

链接地址: http://www.djcxy.com/p/39740.html

上一篇: 实现正确的GetHashCode

下一篇: GroupBy对字符串(hashcode碰撞)