   This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

Or you can buy this book on Amazon.com.

# Chapter 11  HashMap

In the previous chapter, we wrote an implementation of the `Map` interface that uses hashing. We expect this version to be faster, because the lists it searches are shorter, but the order of growth is still linear.

If there are n entries and k sub-maps, the size of the sub-maps is n/k on average, which is still proportional to n. But if we increase k along with n, we can limit the size of n/k.

For example, suppose we double k every time n exceeds k; in that case the number of entries per map would be less than 1 on average, and pretty much always less than 10, as long as the hash function spreads out the keys reasonably well.

If the number of entries per sub-map is constant, we can search a single sub-map in constant time. And computing the hash function is generally constant time (it might depend on the size of the key, but does not depend on the number of keys). That makes the core `Map` methods, `put` and `get`, constant time.

In the next exercise, you’ll see the details.

## 11.1  Exercise 9

In `MyHashMap.java`, I provide the outline of a hash table that grows when needed. Here’s the beginning of the definition:

```public class MyHashMap<K, V> extends MyBetterMap<K, V> implements Map<K, V> {

// average number of entries per sub-map before we rehash
private static final double FACTOR = 1.0;

@Override
public V put(K key, V value) {
V oldValue = super.put(key, value);

// check if the number of elements per sub-map exceeds the threshold
if (size() > maps.size() * FACTOR) {
rehash();
}
return oldValue;
}
}
```

`MyHashMap` extends `MyBetterMap`, so it inherits the methods defined there. The only method it overrides is `put` which calls `put` in the superclass — that is, it calls the version of `put` in `MyBetterMap` — and then it checks whether it has to rehash. Calling `size` returns the total number of entries, n. Calling `maps.size` returns the number of embedded maps, k.

The constant `FACTOR`, which is called the load factor, determines the maximum number of entries per sub-map, on average. If `n > k * FACTOR`, that means `n/k > FACTOR`, which means the number of entries per sub-map exceeds the threshold, so we call `rehash`.

Run `ant build` to compile the source files. Then run `ant   MyHashMapTest`. It should fail because the implementation of `rehash` throws an exception. Your job is to fill it in.

Fill in the body of `rehash` to collect the entries in the table, resize the table, and then put the entries back in. I provide two methods that might come in handy: `MyBetterMap.makeMaps` and `MyLinearMap.getEntries`. Your solution should double the number of maps, k, each time it is called.

## 11.2  Analyzing `MyHashMap`

If the number of entries in the biggest sub-map is proportional to n/k, and k grows in proportion to n, several of the core `MyBetterMap` methods become constant time:

```    public boolean containsKey(Object target) {
MyLinearMap<K, V> map = chooseMap(target);
return map.containsKey(target);
}

public V get(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
return map.get(key);
}

public V remove(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
return map.remove(key);
}
```

Each method hashes a key, which is constant time, and then invokes a method on a sub-map, which is constant time.

So far, so good. But the other core method, `put`, is a little harder to analyze. When we don’t have to rehash, it is constant time, but when we do, it’s linear. In that way, it’s similar to `ArrayList.add`, which we analyzed in Section 3.2.

For the same reason, `MyHashMap.put` turns out to be constant time if we average over a series of invocations. Again, the argument is based on amortized analysis (see Section 3.2).

Suppose the initial number of sub-maps, k, is 2, and the load factor is 1. Now let’s see how much work it takes to `put` a series of keys. As the basic “unit of work”, we’ll count the number of times we have to hash a key and add it to a sub-map.

The first time we call `put` it takes 1 unit of work. The second time also takes 1 unit. The third time we have to rehash, so it takes 2 units to rehash the existing keys and 1 unit to hash the new key.

Now the size of the hash table is 4, so the next time we call `put`, it takes 1 unit of work. But the next time we have to rehash, which takes 4 units to rehash the existing keys and 1 unit to hash the new key.

Figure 11.1 shows the pattern, with the normal work of hashing a new key shown across the bottom and extra work of rehashing shown as a tower. Figure 11.1: Representation of the work done to add elements to a hash table.

As the arrows suggest, if we knock down the towers, each one fills the space before the next tower. The result is a uniform height of 2 units, which shows that the average work per `put` is about 2 units. And that means that `put` is constant time on average.

This diagram also shows why it is important to double the number of sub-maps, k, when we rehash. If we only add to k instead of multiplying, the towers would be too close together and they would start piling up. And that would not be constant time.

We’ve shown that `containsKey`, `get`, and `remove` are constant time, and `put` is constant time on average. We should take a minute to appreciate how remarkable that is. The performance of these operations is pretty much the same no matter how big the hash table is. Well, sort of.

Remember that our analysis is based on a simple model of computation where each “unit of work” takes the same amount of time. Real computers are more complicated than that. In particular, they are usually fastest when working with data structures small enough to fit in cache; somewhat slower if the structure doesn’t fit in cache but still fits in memory; and much slower if the structure doesn’t fit in memory.

Another limitation of this implementation is that hashing doesn’t help if we are given a value rather than a key: `containsValue` is linear because it has to search all of the sub-maps. And there is no particularly efficient way to look up a value and find the corresponding key (or possibly keys).

And there’s one more limitation: some of the methods that were constant time in `MyLinearMap` have become linear. For example:

```    public void clear() {
for (int i=0; i<maps.size(); i++) {
maps.get(i).clear();
}
}
```

`clear` has to clear all of the sub-maps, and the number of sub-maps is proportional to n, so it’s linear. Fortunately, this operation is not used very often, so for most applications this tradeoff is acceptable.

## 11.4  Profiling `MyHashMap`

Before we go on, we should check whether `MyHashMap.put` is really constant time.

Run `ant build` to compile the source files. Then run `ant ProfileMapPut`. It measures the run time of `HashMap.put` (provided by Java) with a range of problem sizes, and plots run time versus problem size on a log-log scale. If this operation is constant time, the total time for n operations should be linear, so the result should be a straight line with slope 1. When I ran this code, the estimated slope was close to 1, which is consistent with our analysis. You should get something similar.

Modify `ProfileMapPut.java` so it profiles your implementation, `MyHashMap`, instead of Java’s `HashMap`. Run the profiler again and see if the slope is near 1. You might have to adjust `startN` and `endMillis` to find a range of problem sizes where the run times are more than a few milliseconds, but not more than a few thousand.

When I ran this code, I got a surprise: the slope was about 1.7, which suggests that this implementation is not constant time after all. It contains a “performance bug”.

Before you read the next section, you should track down the error, fix it, and confirm that `put` is now constant time, as expected.

## 11.5  Fixing `MyHashMap`

The problem with `MyHashMap` is in `size`, which is inherited from `MyBetterMap`:

```    public int size() {
int total = 0;
for (MyLinearMap<K, V> map: maps) {
total += map.size();
}
}
```

To add up the total size it has to iterate the sub-maps. Since we increase the number of sub-maps, k, as the number of entries, n, increases, k is proportional to n, so `size` is linear.

And that makes `put` linear, too, because it uses `size`:

```    public V put(K key, V value) {
V oldValue = super.put(key, value);

if (size() > maps.size() * FACTOR) {
rehash();
}
return oldValue;
}
```

Everything we did to make `put` constant time is wasted if `size` is linear!

Fortunately, there is a simple solution, and we have seen it before: we have to keep the number of entries in an instance variable and update it whenever we call a method that changes it.

You’ll find my solution in the repository for this book, in `MyFixedHashMap.java`. Here’s the beginning of the class definition:

```public class MyFixedHashMap<K, V> extends MyHashMap<K, V> implements Map<K, V> {

private int size = 0;

public void clear() {
super.clear();
size = 0;
}
```

Rather than modify `MyHashMap`, I define a new class that extends it. It adds a new instance variable, `size`, which is initialized to zero.

Updating `clear` is straightforward; we invoke `clear` in the superclass (which clears the sub-maps), and then update `size`.

Updating `remove` and `put` is a little more difficult because when we invoke the method on the superclass, we can’t tell whether the size of the sub-map changed. Here’s how I worked around that:

```    public V remove(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
size -= map.size();
V oldValue = map.remove(key);
size += map.size();
return oldValue;
}
```

`remove` uses `chooseMap` to find the right sub-map, then subtracts away the size of the sub-map. It invokes `remove` on the sub-map, which may or may not change the size of the sub-map, depending on whether it finds the key. But either way, we add the new size of the sub-map back to `size`, so the final value of `size` is correct.

The rewritten version of `put` is similar:

```    public V put(K key, V value) {
MyLinearMap<K, V> map = chooseMap(key);
size -= map.size();
V oldValue = map.put(key, value);
size += map.size();

if (size() > maps.size() * FACTOR) {
size = 0;
rehash();
}
return oldValue;
}
```

We have the same problem here: when we invoke `put` on the sub-map, we don’t know whether it added a new entry. So we use the same solution, subtracting off the old size and then adding in the new size.

Now the implementation of the `size` method is simple:

```    public int size() {
return size;
}
```

And that’s pretty clearly constant time.

When I profiled this solution, I found that the total time for putting n keys is proportional to n, which means that each `put` is constant time, as it’s supposed to be.

## 11.6  UML class diagrams

One challenge of working with the code in this chapter is that we have several classes that depend on each other. Here are some of the relationships between the classes:

• `MyLinearMap` contains a `LinkedList` and implements `Map`.
• `MyBetterMap` contains many `MyLinearMap` objects and implements `Map`.
• `MyHashMap` extends `MyBetterMap`, so it also contains `MyLinearMap` objects, and it implements `Map`.
• `MyFixedHashMap` extends `MyHashMap` and implements `Map`.

To help keep track of relationships like these, software engineers often use UML class diagrams. UML stands for Unified Modeling Language (see http://thinkdast.com/uml). A “class diagram” is one of several graphical standards defined by UML.

In a class diagram, each class is represented by a box, and relationships between classes are represented by arrows. Figure 11.2 shows a UML class diagram for the classes from the previous exercise, generated using the online tool yUML at http://yuml.me/. Figure 11.2: UML diagram for the classes in this chapter.

Different relationships are represented by different arrows:

• Arrows with a solid head indicate HAS-A relationships. For example, each instance of `MyBetterMap` contains multiple instances of `MyLinearMap`, so they are connected by a solid arrow.
• Arrows with a hollow head and a solid line indicate IS-A relationships. For example, `MyHashMap` extends `MyBetterMap`, so they are connected by an IS-A arrow.
• Arrows with a hollow head and a dashed line indicate that a class implements an interface; in this diagram, every class implements `Map`.

UML class diagrams provide a concise way to represent a lot of information about a collection of classes. They are used during design phases to communicate about alternative designs, during implementation phases to maintain a shared mental map of the project, and during deployment to document the design.

#### Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.

Think Data Structures   Think DSP   Think Java   Think Bayes   Think Python 2e   Think Stats 2e   Think Complexity      