hashCode() and equals()

I came across this fun bug yesterday.

In our application we use a compound object, which serves as an ID for other objects. Since this ID object never changes, there are tones of IDs and each one is fairly heavy, it makes sense to have only one instance of each ID in the system. To do that we use some kind of a flyweight pattern. We have a cache of unique IDs, and each time we want to get an ID we first attempt to get it from the cache and if it fails we create a new object and add it to the it:

/**
 * The real method recieves several strings as parameters.
 */
private IDObject getID(String id) {

	// tmpID is initialized with new values every time.
	// There is only one tmpID instance to avoid creating many objects.
	tmpID.initialize(id);

	// Check if we already have this ID object
	IDObject cachedID = idCache.get(tmpID);
	if (cachedID == null) {

		// If not, create a new cache item
		cachedID = (IDObject)tmpID.clone();
		idCache.put(cachedID, cachedID);
	}

	return cachedID;
}

This is also a good practice since we have to compare these objects to one another all the time, and the comparison happens in some extremely hot pieces of code. Most of the time the IDs we compare are equal (if the IDs are not equal it’s a bug), and being able to use reference comparison instead of string comparison makes our equals() method very fast:

@Override
public boolean equals(Object other) {

	// Most of the time we will hit this condition
	// and exit quickly
	if (this == other) return true;

	// Faild. Do some slow comparisons...
}

Yesterday I investigated some performance bottlenecks in our system and found out that not only did we have multiple instances of the same ID in the cache,  but we were also comparing strings all the time! As it turns out, as the definition of this ID object changed with time, the implementation of it’s hashCode() method evolved with it, but the implementation of equals() hasn’t. Why is that a problem?

HashMap keeps it’s data in “buckets”. For each entry we put in the hash map, a hash code is computed on the key. This hash is then used to find in which bucket the entry should be kept. For instance, let’s imagine there are 10 possible values for a hash (I’ll be original and say… 0 to 9) and we have two buckets. In this case hashes 0 to 4 will go in the first bucket and 5 to 9 will go in the second. Now it is obvious we have a problem: there are more hashes than buckets! Let’s look at the implementation for HashMap.put() (jre 1.5, comments are mine, some code removed):

public V put(K key, V value) {

	// ...

	// Get the hash for a given key.
	// This call uses k.hashCode() internally
	int hash = hash(k);
	int i = indexFor(hash, table.length);

	// Once you find the bucket for the object,
	// loop on the entries in this table index
	for (Entry<K,V> e = table[i]; e != null; e = e.next) {

		// If the current entry has the same hash as the key
		// and the keys are equal as per equals() method
		// replace the old object with the new one
		if (e.hash == hash && eq(k, e.key)) {
			V oldValue = e.value;
			e.value = value;
			e.recordAccess(this);
			return oldValue;
		}
	}

	// If we got here, we haven't found an existing key
	// so we just add the new one.
	addEntry(hash, k, value, i);
	return null;
}

In order to deal with the fact that there are more hashes than buckets each bucket holds not the entry itself, but a list of entries. On put we traverse the list and compare (remember that equals?) each entry’s key to the new entry’s key[1]. If it’s the same key – Yay! the key already exists and we replace the value with the new one, otherwise it’s a new key.

So what happens if we don’t follow hashCode and equals spec? Assuming we have two objects that are equal, two things may happen:

Equal objects with different hash (faulty hashCode())
In this case, the two objects may end up on different buckets, and the HashMap implementation won’t be able to identify the duplicates. In my use case, this will result in memory bloating because too many ID objects will be created, and the ID comparison will be slow because we won’t be able to count on reference equality to do our work for us.

Same hashCode, but equals is false (faulty equals()) [2]
In that case, the HashMap will fail to identify the duplicate in the lookup stage, and will just append the ID to the existing linked list. Here again we will have gigantic linked list and no optimization on comparison.

Conclusion? Keep your hashCode and equals implementations consistent!

More on that subject

[1] This is why having a good distribution of hashes is so important. If all entries end up in a small number of buckets we traverse some really large linked lists, which is bad for performance.
[2] It is important to note here that this is only a problem if the objects are equal. Two different objects can have the same hashCode. It’s legal, hash code is not unique!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s