Previous Up Next

This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Or you can buy this book on Amazon.com.

Chapter 10  Hashing

In this chapter, I define MyBetterMap, a better implementation of the Map interface than MyLinearMap, and introduce hashing, which makes MyBetterMap more efficient.

10.1  Hashing

To improve the performance of MyLinearMap, we’ll write a new class, called MyBetterMap, that contains a collection of MyLinearMap objects. It divides the keys among the embedded maps, so the number of entries in each map is smaller, which speeds up findEntry and the methods that depend on it.

Here’s the beginning of the class definition:

public class MyBetterMap<K, V> implements Map<K, V> {
    
    protected List<MyLinearMap<K, V>> maps;
    
    public MyBetterMap(int k) {
        makeMaps(k);
    }

    protected void makeMaps(int k) {
        maps = new ArrayList<MyLinearMap<K, V>>(k);
        for (int i=0; i<k; i++) {
            maps.add(new MyLinearMap<K, V>());
        }
    }
}

The instance variable, maps, is a collection of MyLinearMap objects. The constructor takes a parameter, k, that determines how many maps to use, at least initially. Then makeMaps creates the embedded maps and stores them in an ArrayList.

Now, the key to making this work is that we need some way to look at a key and decide which of the embedded maps it should go into. When we put a new key, we choose one of the maps; when we get the same key, we have to remember where we put it.

One possibility is to choose one of the sub-maps at random and keep track of where we put each key. But how should we keep track? It might seem like we could use a Map to look up the key and find the right sub-map, but the whole point of the exercise is to write an efficient implementation of a Map. We can’t assume we already have one.

A better approach is to use a hash function, which takes an Object, any Object, and returns an integer called a hash code. Importantly, if it sees the same Object more than once, it always returns the same hash code. That way, if we use the hash code to store a key, we’ll get the same hash code when we look it up.

In Java, every Object provides a method called hashCode that computes a hash function. The implementation of this method is different for different objects; we’ll see an example soon.

Here’s a helper method that chooses the right sub-map for a given key:

protected MyLinearMap<K, V> chooseMap(Object key) {
    int index = 0;
    if (key != null) { 
        index = Math.abs(key.hashCode()) % maps.size();
    }
    return maps.get(index);
}

If key is null, we choose the sub-map with index 0, arbitrarily. Otherwise we use hashCode to get an integer, apply Math.abs to make sure it is non-negative, then use the remainder operator, \%, which guarantees that the result is between 0 and maps.size()-1. So index is always a valid index into maps. Then chooseMap returns a reference to the map it chose.

We use chooseMap in both put and get, so when we look up a key, we get the same map we chose when we added the key. At least, we should — I’ll explain a little later why this might not work.

Here’s my implementation of put and get:

public V put(K key, V value) {
  MyLinearMap<K, V> map = chooseMap(key);
    return map.put(key, value);
}

public V get(Object key) {
    MyLinearMap<K, V> map = chooseMap(key);
    return map.get(key);
}

Pretty simple, right? In both methods, we use chooseMap to find the right sub-map and then invoke a method on the sub-map. That’s how it works; now let’s think about performance.

If there are n entries split up among k sub-maps, there will be n/k entries per map, on average. When we look up a key, we have to compute its hash code, which takes some time, then we search the corresponding sub-map.

Because the entry lists in MyBetterMap are k times shorter than the entry list in MyLinearMap, we expect the search to be k times faster. But the run time is still proportional to n, so MyBetterMap is still linear. In the next exercise, you’ll see how we can fix that.

10.2  How does hashing work?

The fundamental requirement for a hash function is that the same object should produce the same hash code every time. For immutable objects, that’s relatively easy. For objects with mutable state, we have to think harder.

As an example of an immutable object, I’ll define a class called SillyString that encapsulates a String:

public class SillyString {
    private final String innerString;

    public SillyString(String innerString) {
        this.innerString = innerString;
    }

    public String toString() {
        return innerString;
    }

This class is not very useful, which is why it’s called SillyString, but I’ll use it to show how a class can define its own hash function:

    @Override
    public boolean equals(Object other) {
        return this.toString().equals(other.toString());
    }
    
    @Override
    public int hashCode() {
        int total = 0;
        for (int i=0; i<innerString.length(); i++) {
            total += innerString.charAt(i);
        }
        return total;
    }

Notice that SillyString overrides both equals and hashCode. This is important. In order to work properly, equals has to be consistent with hashCode, which means that if two objects are considered equal — that is, equals returns true — they should have the same hash code. But this requirement only works one way; if two objects have the same hash code, they don’t necessarily have to be equal.

equals works by invoking toString, which returns innerString. So two SillyString objects are equal if their innerString instance variables are equal.

hashCode works by iterating through the characters in the String and adding them up. When you add a character to an int, Java converts the character to an integer using its Unicode code point. You don’t need to know anything about Unicode to understand this example, but if you are curious, you can read more at http://thinkdast.com/codepoint.

This hash function satisfies the requirement: if two SillyString objects contain embedded strings that are equal, they will get the same hash code.

This works correctly, but it might not yield good performance, because it returns the same hash code for many different strings. If two strings contain the same letters in any order, they will have the same hash code. And even if they don’t contain the same letters, they might yield the same total, like "ac" and "bb".

If many objects have the same hash code, they end up in the same sub-map. If some sub-maps have more entries than others, the speedup when we have k maps might be much less than k. So one of the goals of a hash function is to be uniform; that is, it should be equally likely to produce any value in the range. You can read more about designing good hash functions at http://thinkdast.com/hash.

10.3  Hashing and mutation

Strings are immutable, and SillyString is also immutable because innerString is declared to be final. Once you create a SillyString, you can’t make innerString refer to a different String, and you can’t modify the String it refers to. Therefore, it will always have the same hash code.

But let’s see what happens with a mutable object. Here’s a definition for SillyArray, which is identical to SillyString, except that it uses an array of characters instead of a String:

public class SillyArray {
    private final char[] array;

    public SillyArray(char[] array) {
        this.array = array;
    }

    public String toString() {
        return Arrays.toString(array);
    }
    
    @Override
    public boolean equals(Object other) {
        return this.toString().equals(other.toString());
    }
    
    @Override
    public int hashCode() {
        int total = 0;
        for (int i=0; i<array.length; i++) {
            total += array[i];
        }
        System.out.println(total);
        return total;
    }

SillyArray also provides setChar, which makes it possible to modify the characters in the array:

public void setChar(int i, char c) {
    this.array[i] = c;
}

Now suppose we create a SillyArray and add it to a map:

SillyArray array1 = new SillyArray("Word1".toCharArray());
map.put(array1, 1);

The hash code for this array is 461. Now if we modify the contents of the array and then try to look it up, like this:

array1.setChar(0, 'C');
Integer value = map.get(array1);

the hash code after the mutation is 441. With a different hash code, there’s a good chance we’ll go looking in the wrong sub-map. In that case, we won’t find the key, even though it is in the map. And that’s bad.

In general, it is dangerous to use mutable objects as keys in data structures that use hashing, which includes MyBetterMap and HashMap. If you can guarantee that the keys won’t be modified while they are in the map, or that any changes won’t affect the hash code, it might be OK. But it is probably a good idea to avoid it.

10.4  Exercise 8

In this exercise, you will finish off the implementation of MyBetterMap. In the repository for this book, you’ll find the source files for this exercise:

  • MyLinearMap.java contains our solution to the previous exercise, which we will build on in this exercise.
  • MyBetterMap.java contains the code from the previous chapter with some methods you will fill in.
  • MyHashMap.java contains the outline of a hash table that grows when needed, which you will complete.
  • MyLinearMapTest.java contains the unit tests for MyLinearMap.
  • MyBetterMapTest.java contains the unit tests for MyBetterMap.
  • MyHashMapTest.java contains the unit tests for MyHashMap.
  • Profiler.java contains code for measuring and plotting run time versus problem size.
  • ProfileMapPut.java contains code that profiles the Map.put method.

As usual, you should run ant build to compile the source files. Then run ant MyBetterMapTest. Several tests should fail, because you have some work to do!

Review the implementation of put and get from the previous chapter. Then fill in the body of containsKey. HINT: use chooseMap. Run ant MyBetterMapTest again and confirm that testContainsKey passes.

Fill in the body of containsValue. HINT: don’t use chooseMap. Run ant MyBetterMapTest again and confirm that testContainsValue passes. Notice that we have to do more work to find a value than to find a key.

Like put and get, this implementation of containsKey is linear, because it has to search one of the embedded sub-maps. In the next chapter, we’ll see how we can improve this implementation even more.

Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.


Think Data Structures

Think DSP

Think Java

Think Bayes

Think Python 2e

Think Stats 2e

Think Complexity


Previous Up Next