This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Or you can buy this book on Amazon.com.

Chapter 11 HashMap

In the previous chapter, we wrote an implementation of the Map interface that uses hashing. We expect this version to be faster, because the lists it searches are shorter, but the order of growth is still linear.

If there are n entries and k sub-maps, the size of the sub-maps is n/k on average, which is still proportional to n. But if we increase k along with n, we can limit the size of n/k.

For example, suppose we double k every time n exceeds k; in that case the number of entries per map would be less than 1 on average, and pretty much always less than 10, as long as the hash function spreads out the keys reasonably well.

If the number of entries per sub-map is constant, we can search a single sub-map in constant time. And computing the hash function is generally constant time (it might depend on the size of the key, but does not depend on the number of keys). That makes the core Map methods, put and get, constant time.

In the next exercise, you’ll see the details.

11.1 Exercise 9

In MyHashMap.java, I provide the outline of a hash table that grows when needed. Here’s the beginning of the definition:

public class MyHashMap<K, V> extends MyBetterMap<K, V> implements Map<K, V> {

    // average number of entries per sub-map before we rehash
    private static final double FACTOR = 1.0;

    @Override
    public V put(K key, V value) {
        V oldValue = super.put(key, value);

        // check if the number of elements per sub-map exceeds the threshold
        if (size() > maps.size() * FACTOR) {
            rehash();
        }
        return oldValue;
    }
}

MyHashMap extends MyBetterMap, so it inherits the methods defined there. The only method it overrides is put which calls put in the superclass — that is, it calls the version of put in MyBetterMap — and then it checks whether it has to rehash. Calling size returns the total number of entries, n. Calling maps.size returns the number of embedded maps, k.

The constant FACTOR, which is called the load factor, determines the maximum number of entries per sub-map, on average. If n > k * FACTOR, that means n/k > FACTOR, which means the number of entries per sub-map exceeds the threshold, so we call rehash.

Run ant build to compile the source files. Then run ant MyHashMapTest. It should fail because the implementation of rehash throws an exception. Your job is to fill it in.

Fill in the body of rehash to collect the entries in the table, resize the table, and then put the entries back in. I provide two methods that might come in handy: MyBetterMap.makeMaps and MyLinearMap.getEntries. Your solution should double the number of maps, k, each time it is called.

11.2 Analyzing `MyHashMap`

If the number of entries in the biggest sub-map is proportional to n/k, and k grows in proportion to n, several of the core MyBetterMap methods become constant time:

    public boolean containsKey(Object target) {
        MyLinearMap<K, V> map = chooseMap(target);
        return map.containsKey(target);
    }

    public V get(Object key) {
        MyLinearMap<K, V> map = chooseMap(key);
        return map.get(key);
    }

    public V remove(Object key) {
        MyLinearMap<K, V> map = chooseMap(key);
        return map.remove(key);
    }

Each method hashes a key, which is constant time, and then invokes a method on a sub-map, which is constant time.

So far, so good. But the other core method, put, is a little harder to analyze. When we don’t have to rehash, it is constant time, but when we do, it’s linear. In that way, it’s similar to ArrayList.add, which we analyzed in Section 3.2.

For the same reason, MyHashMap.put turns out to be constant time if we average over a series of invocations. Again, the argument is based on amortized analysis (see Section 3.2).

Suppose the initial number of sub-maps, k, is 2, and the load factor is 1. Now let’s see how much work it takes to put a series of keys. As the basic “unit of work”, we’ll count the number of times we have to hash a key and add it to a sub-map.

The first time we call put it takes 1 unit of work. The second time also takes 1 unit. The third time we have to rehash, so it takes 2 units to rehash the existing keys and 1 unit to hash the new key.

Now the size of the hash table is 4, so the next time we call put, it takes 1 unit of work. But the next time we have to rehash, which takes 4 units to rehash the existing keys and 1 unit to hash the new key.

Figure 11.1 shows the pattern, with the normal work of hashing a new key shown across the bottom and extra work of rehashing shown as a tower.

Figure 11.1: Representation of the work done to add elements to a hash table.

As the arrows suggest, if we knock down the towers, each one fills the space before the next tower. The result is a uniform height of 2 units, which shows that the average work per put is about 2 units. And that means that put is constant time on average.

This diagram also shows why it is important to double the number of sub-maps, k, when we rehash. If we only add to k instead of multiplying, the towers would be too close together and they would start piling up. And that would not be constant time.

11.3 The tradeoffs

We’ve shown that containsKey, get, and remove are constant time, and put is constant time on average. We should take a minute to appreciate how remarkable that is. The performance of these operations is pretty much the same no matter how big the hash table is. Well, sort of.

Remember that our analysis is based on a simple model of computation where each “unit of work” takes the same amount of time. Real computers are more complicated than that. In particular, they are usually fastest when working with data structures small enough to fit in cache; somewhat slower if the structure doesn’t fit in cache but still fits in memory; and much slower if the structure doesn’t fit in memory.

Another limitation of this implementation is that hashing doesn’t help if we are given a value rather than a key: containsValue is linear because it has to search all of the sub-maps. And there is no particularly efficient way to look up a value and find the corresponding key (or possibly keys).

And there’s one more limitation: some of the methods that were constant time in MyLinearMap have become linear. For example:

    public void clear() {
        for (int i=0; i<maps.size(); i++) {
            maps.get(i).clear();
        }
    }

clear has to clear all of the sub-maps, and the number of sub-maps is proportional to n, so it’s linear. Fortunately, this operation is not used very often, so for most applications this tradeoff is acceptable.

11.4 Profiling `MyHashMap`

Before we go on, we should check whether MyHashMap.put is really constant time.

Run ant build to compile the source files. Then run ant ProfileMapPut. It measures the run time of HashMap.put (provided by Java) with a range of problem sizes, and plots run time versus problem size on a log-log scale. If this operation is constant time, the total time for n operations should be linear, so the result should be a straight line with slope 1. When I ran this code, the estimated slope was close to 1, which is consistent with our analysis. You should get something similar.

Modify ProfileMapPut.java so it profiles your implementation, MyHashMap, instead of Java’s HashMap. Run the profiler again and see if the slope is near 1. You might have to adjust startN and endMillis to find a range of problem sizes where the run times are more than a few milliseconds, but not more than a few thousand.

When I ran this code, I got a surprise: the slope was about 1.7, which suggests that this implementation is not constant time after all. It contains a “performance bug”.

Before you read the next section, you should track down the error, fix it, and confirm that put is now constant time, as expected.

11.5 Fixing `MyHashMap`

The problem with MyHashMap is in size, which is inherited from MyBetterMap:

    public int size() {
        int total = 0;
        for (MyLinearMap<K, V> map: maps) {
            total += map.size();
        }
        return total;
    }

To add up the total size it has to iterate the sub-maps. Since we increase the number of sub-maps, k, as the number of entries, n, increases, k is proportional to n, so size is linear.

And that makes put linear, too, because it uses size:

    public V put(K key, V value) {
        V oldValue = super.put(key, value);

        if (size() > maps.size() * FACTOR) {
            rehash();
        }
        return oldValue;
    }

Everything we did to make put constant time is wasted if size is linear!

Fortunately, there is a simple solution, and we have seen it before: we have to keep the number of entries in an instance variable and update it whenever we call a method that changes it.

You’ll find my solution in the repository for this book, in MyFixedHashMap.java. Here’s the beginning of the class definition:

public class MyFixedHashMap<K, V> extends MyHashMap<K, V> implements Map<K, V> {

    private int size = 0;

    public void clear() {
        super.clear();
        size = 0;
    }

Rather than modify MyHashMap, I define a new class that extends it. It adds a new instance variable, size, which is initialized to zero.

Updating clear is straightforward; we invoke clear in the superclass (which clears the sub-maps), and then update size.

Updating remove and put is a little more difficult because when we invoke the method on the superclass, we can’t tell whether the size of the sub-map changed. Here’s how I worked around that:

    public V remove(Object key) {
        MyLinearMap<K, V> map = chooseMap(key);
        size -= map.size();
        V oldValue = map.remove(key);
        size += map.size();
        return oldValue;
    }

remove uses chooseMap to find the right sub-map, then subtracts away the size of the sub-map. It invokes remove on the sub-map, which may or may not change the size of the sub-map, depending on whether it finds the key. But either way, we add the new size of the sub-map back to size, so the final value of size is correct.

The rewritten version of put is similar:

    public V put(K key, V value) {
        MyLinearMap<K, V> map = chooseMap(key);
        size -= map.size();
        V oldValue = map.put(key, value);
        size += map.size();

        if (size() > maps.size() * FACTOR) {
            size = 0;
            rehash();
        }
        return oldValue;
    }

We have the same problem here: when we invoke put on the sub-map, we don’t know whether it added a new entry. So we use the same solution, subtracting off the old size and then adding in the new size.

Now the implementation of the size method is simple:

    public int size() {
        return size;
    }

And that’s pretty clearly constant time.

When I profiled this solution, I found that the total time for putting n keys is proportional to n, which means that each put is constant time, as it’s supposed to be.

11.6 UML class diagrams

One challenge of working with the code in this chapter is that we have several classes that depend on each other. Here are some of the relationships between the classes:

MyLinearMap contains a LinkedList and implements Map.
MyBetterMap contains many MyLinearMap objects and implements Map.
MyHashMap extends MyBetterMap, so it also contains MyLinearMap objects, and it implements Map.
MyFixedHashMap extends MyHashMap and implements Map.

To help keep track of relationships like these, software engineers often use UML class diagrams. UML stands for Unified Modeling Language (see http://thinkdast.com/uml). A “class diagram” is one of several graphical standards defined by UML.

In a class diagram, each class is represented by a box, and relationships between classes are represented by arrows. Figure 11.2 shows a UML class diagram for the classes from the previous exercise, generated using the online tool yUML at http://yuml.me/.

Figure 11.2: UML diagram for the classes in this chapter.

Different relationships are represented by different arrows:

Arrows with a solid head indicate HAS-A relationships. For example, each instance of MyBetterMap contains multiple instances of MyLinearMap, so they are connected by a solid arrow.
Arrows with a hollow head and a solid line indicate IS-A relationships. For example, MyHashMap extends MyBetterMap, so they are connected by an IS-A arrow.
Arrows with a hollow head and a dashed line indicate that a class implements an interface; in this diagram, every class implements Map.

UML class diagrams provide a concise way to represent a lot of information about a collection of classes. They are used during design phases to communicate about alternative designs, during implementation phases to maintain a shared mental map of the project, and during deployment to document the design.