Java HashMap performance optimization / alternative
I want to create a large HashMap but the put()
performance is not good enough. Any ideas?
Other data structure suggestions are welcome but I need the lookup feature of a Java Map:
map.get(key)
In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.
Also, does anyone know if using different hash code distributions for the keys could help?
My hashcode method:
byte[] a = new byte[2];
byte[] b = new byte[3];
...
public int hashCode() {
int hash = 503;
hash = hash * 5381 + (a[0] + a[1]);
hash = hash * 5381 + (b[0] + b[1] + b[2]);
return hash;
}
I am using the associative property of addition to ensure that equal objects have the same hashcode. The arrays are bytes with values in the range 0 - 51. Values are only used once in either array. The objects are equal if the a arrays contain the same values (in either order) and the same goes for the b array. So a = {0,1} b = {45,12,33} and a = {1,0} b = {33,45,12} are equal.
EDIT, some notes:
A few people have criticized using a hash map or other data structure to store 26 million entries. I cannot see why this would seem strange. It looks like a classic data structures and algorithms problem to me. I have 26 million items and I want to be able to quickly insert them into and look them up from a data structure: give me the data structure and algorithms.
Setting the initial capacity of the default Java HashMap to 26 million decreases the performance.
Some people have suggested using databases, in some other situations that is definitely the smart option. But I am really asking a data structures and algorithms question, a full database would be overkill and much slower than a good datastructure solution (after all the database is just software but would have communication and possibly disk overhead).
As many people pointed out the hashCode()
method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:
public int hashCode() {
// assume that both a and b are sorted
return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}
public static int powerOf52(byte b, int power) {
int result = b;
for (int i = 0; i < power; i++) {
result *= 52;
}
return result;
}
The arrays are sorted to ensure this methods fulfills the hashCode()
contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts, 100,000 to 2,000,000 was:
168350.17
109409.195
81344.91
64319.023
53780.79
45931.258
39680.29
34972.676
31354.514
28343.062
25562.371
23850.695
22299.22
20998.006
19797.799
18702.951
17702.434
16832.182
16084.52
15353.083
Using the new method gives:
337837.84
337268.12
337078.66
336983.97
313873.2
317460.3
317748.5
320000.0
309704.06
310752.03
312944.5
265780.75
275540.5
264350.44
273522.97
270910.94
279008.7
276285.5
283455.16
289603.25
Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.
One thing I notice in your hashCode()
method is that the order of the elements in the arrays a[]
and b[]
don't matter. Thus (a[]={1,2,3}, b[]={99,100})
will hash to the same value as (a[]={3,1,2}, b[]={100,99})
. Actually all keys k1
and k2
where sum(k1.a)==sum(k2.a)
and sum(k1.b)=sum(k2.b)
will result in collisions. I suggest assigning a weight to each position of the array:
hash = hash * 5381 + (c0*a[0] + c1*a[1]);
hash = hash * 5381 + (c0*b[0] + c1*b[1] + c3*b[2]);
where, c0
, c1
and c3
are distinct constants (you can use different constants for b
if necessary). That should even out things a bit more.
To elaborate on Pascal: Do you understand how a HashMap works? You have some number of slots in your hash table. The hash value for each key is found, and then mapped to an entry in the table. If two hash values map to the same entry -- a "hash collision" -- HashMap builds a linked list.
Hash collisions can kill the performance of a hash map. In the extreme case, if all your keys have the same hash code, or if they have different hash codes but they all map to the same slot, then your hash map turns into a linked list.
So if you're seeing performance problems, the first thing I'd check is: Am I getting a random-looking distribution of hash codes? If not, you need a better hash function. Well, "better" in this case may mean "better for my particular set of data". Like, suppose you were working with strings, and you took the length of the string for the hash value. (Not how Java's String.hashCode works, but I'm just making up a simple example.) If your strings have widely varying lengths, from 1 to 10,000, and are fairly evenly distributed across that range, that this could be a very good hash function. But if your strings are all 1 or 2 characters, this would be a very bad hash function.
Edit: I should add: Every time you add a new entry, HashMap checks if this is a duplicate. When there's a hash collision, it has to compare the incoming key against every key that mapped to that slot. So in the worst case where everything hashes to a single slot, the second key is compared to the first key, the third key is compared to #1 and #2, the fourth key is compared to #1, #2, and #3, etc. By the time you get to key #1 million, you've done over a trillion compares.
@Oscar: Umm, I don't see how that's a "not really". It's more like a "let me clarify". But yes, it's true that if you make a new entry with the same key as an existing entry, that this overwrites the first entry. That's what I meant when I talked about looking for duplicates in the last paragraph: Whenever a key hashes to the same slot, HashMap must check if it's a duplicate of an existing key, or if they are just in the same slot by coincidence of the hash function. I don't know that that's the "whole point" of a HashMap: I would say that the "whole point" is that you can retrieve elements by key quickly.
But anyway, that doesn't affect the "whole point" that I was trying to make: When you have two keys -- yes, different keys, not the same key showing up again -- that map to the same slot in the table, HashMap builds a linked list. Then, because it has to check each new key to see if it is in fact a duplicate of an existing key, each attempt to add a new entry that maps to this same slot must chase the linked list examining each existing entry to see if this is a duplicate of a previously-seen key, or if it is a new key.
Update long after the original post
I just got an up-vote on this answer 6 years after posting which led me to re-read the question.
The hash function given in the question is not a good hash for 26 million entries.
It adds together a[0]+a[1] and b[0]+b[1]+b[2]. He says values of each byte range from 0 to 51, so that gives only (51*2+1)*(51*3+1)=15,862 possible hash values. With 26 million entries, this means an average of about 1639 entries per hash value. That is lots and lots of collisions, requiring lots and lots of sequential searches through linked lists.
The OP says that different orders within array a and array b should be considered equal, ie [[1,2],[3,4,5]].equals([[2,1],[5,3,4]]), and so to fulfill the contract they must have equal hash codes. Okay. Still, there are a lot more than 15,000 possible values. His second proposed hash function is much better, giving a broader range.
Though as someone else commented, it seems inappropriate for a hash function to change other data. It would make more sense to "normalize" the object when it is created, or to have the hash function work from copies of the arrays. Also, using a loop to calculate constants every time through the function is inefficient. As there are only four values here, I would have either written
return a[0]+a[1]*52+b[0]*52*52+b[1]*52*52*52+b[2]*52*52*52*52;
which would cause the compiler to perform the calculation once at compile time; or have 4 static constants defined in the class.
Also, the first draft at a hash function has several calculations that do nothing to add to the range of outputs. Note he first sets hash =503 than multiplies by 5381 before even considering values from the class. So ... in effect he adds 503*5381 to every value. What does this accomplish? Adding a constant to every hash value just burns cpu cycles without accomplishing anything useful. Lesson here: Adding complexity to a hash function is not the goal. The goal is to get a broad range of different values, not just to add complexity for the sake of complexity.
链接地址: http://www.djcxy.com/p/15216.html上一篇: 手段(或其他优化)
下一篇: Java HashMap性能优化/替代