Fastest Java HashSet<Integer> library

2018-06-22 05:09:21

In addition to this quite old post, I need something that will use primitives and give a speedup for an application that contains lots of HashSet s of Integers :

Set<Integer> set = new HashSet<Integer>();

So people mention libraries like Guava, Javalution, Trove, but there is no perfect comparison of those in terms of benchmarks and performance results, or at least good answer coming from good experience. From what I see many recommend Trove's TIntHashSet , but others say it is not that good; some say Guava is supercool and manageable, but I do not need beauty and maintainability, only time execution, so Python's style Guava goes home :) Javalution? I've visited the website, seems too old for me and thus wacky.

The library should provide the best achievable time, memory does not matter.

Looking at "Thinking in Java", there is an idea of creating custom HashMap with int[] as keys. So I would like to see something similar with a HashSet or simply download and use an amazing library.

EDIT (in response to the comments below) So in my project I start from about 50 HashSet<Integer> collections, then I call a function about 1000 times that inside creates up to 10 HashSet<Integer> collections. If I change initial parameters, the numbers may grow up exponentially. I only use add() , contains() and clear() methods on those collections, that is why they were chosen.

Now I'm going to find a library that implements HashSet or something similar, but will do that faster due to autoboxing Integer overhead and maybe something else which I do not know. In fact, I'm using ints as my data comes in and store them in those HashSet s.

Have you tried working with the initial capacity and load factor parameters while creating your HashSet?

HashSet doc

Initial capacity, as you might think, refers to how big will the empty hashset be when created, and loadfactor is a threshhold that determines when to grow the hash table. Normally you would like to keep the ratio between used buckets and total buckets, below two thirds, which is regarded as the best ratio to achieve good stable performance in a hash table.

Dynamic rezing of a hash table

So basically, try to set an initial capacity that will fit your needs (to avoid re-creating and reassigning the values of a hash table when it grows), as well as fiddling with the load factor until you find a sweet spot.

It might be that for your particular data distribution and setting/getting values, a lower loadfactor could help (hardly a higher one will, but your milage may vary).

Trove is an excellent choice.

The reason why it is much faster than generic collections is memory use.

A java.util.HashSet<Integer> uses a java.util.HashMap<Integer, Integer> internally. In a HashMap , each object is contained in an Entry<Integer, Integer> . These objects take estimated 24 bytes for the Entry + 16 bytes for the actual integer + 4 bytes in the actual hash table. This yields 44 bytes, as opposed to 4 bytes in Trove, an up to 11x memory overhead (note that unoccupied entires in the main table will yield a smaller difference in practise).