Why GetHashCode method needs to do shift in C#
This question already has an answer here:
I couldn't say why they chose this particular hash code implementation, but with regard to this question:
The ShiftAndWrap()
method here is a generic implementation of an algorithm for left-shifting a value by N bits and wrapping the overflow back to the end. So before they do the shift, they first get the left-most N bits so they can then append those onto the end.
So here's what calling ShiftAndWrap()
would look like if we were just working with 8-bit values ( byte
s) and called it with value
= (binary) 11010010 and positions
= 3:
value = 11010010
positions = 3
wrapped = value >> (8 - positions)
= 11010010 >> (8 - 3)
= 11010010 >> 5
= 00000110
result = value << positions | wrapped
= 11010010 << 3 | 00000110
= 10010000 | 00000110
= 10010110
We can see that the return value 10010110
is the result of shifting 11010010
by three bits and wrapping around the result.
As to the question of why they don't just use x ^ y
, I suspect this is because this would mean that Point(N, M)
would always produce the same hash code as Point(M, N)
. By doing a shift on the x
value, we can have a hash code that not only takes into account the x
and y
values, but also their order, whereas x ^ y
would ignore their order.
When doing hashing on a data structure that contains sub-components of the same type, it's common to have the hash function treat each of the sub-components differently so that their position matters. For example, Java uses this hash formula for strings (here ^
denotes an exponent, not XOR):
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
We can see that each character is multiplied by a different power of 31, so that stop
has a different hash code from pots
.
As for why they chose 2
as the number of positions to shift, that might be arbitrary, or they may have done some evaluations to see what degree of shifting would be likely to produce the best distribution.
The point of the HashCode
is to create a distribution so that data structures can allocate the data into certain buckets. Its not meant for equality.
If you look at the internals for the HashSet
you can see that the class uses the HashCode
to identify the correct bucket and then uses the Equals
method to determine equality.
/// <summary>
/// Checks if this hashset contains the item
/// </summary>
/// <param name="item">item to check for containment</param>
/// <returns>true if item contained; false if not</returns>
public bool Contains(T item) {
if (m_buckets != null) {
int hashCode = InternalGetHashCode(item);
// see note at "HashSet" level describing why "- 1" appears in for loop
for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
return true;
}
}
}
// either m_buckets is null or wasn't found
return false;
}
So collisions are fine, it just just there so ensure a relatively equal distribution to allow for faster identification and retrieval. Meaning that, in your case, both of those points will be placed in the same bucket, but their Equals
method will be used to identify them.