This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book.
In particular, some of the symbols are not rendered correctly.
You might prefer to read the PDF version.
Or you can buy this book on Amazon.com.
Chapter 11 HashMap
In the previous chapter, we wrote an implementation of the
Map interface that uses hashing. We expect this version
to be faster, because the lists it searches are shorter, but
the order of growth is still linear.
If there are n entries and k sub-maps, the size of the sub-maps is
n/k on average, which is still proportional to n. But if we
increase k along with n, we can limit the size of n/k. For example, suppose we double k every
time n exceeds k; in that case the number of entries
per map would be less than 1 on average, and pretty much always less
than 10, as long as the hash function spreads out the keys reasonably
well.
If the number of entries per sub-map is constant, we can search a single
sub-map in constant time. And computing the hash function is generally
constant time (it might depend on the size of the key, but does not
depend on the number of keys). That makes the core Map methods,
put and get , constant time. In the next exercise, you’ll see the details.
11.1 Exercise 9
In MyHashMap.java , I provide the outline of a hash table that
grows when needed. Here’s the beginning of the definition: public class MyHashMap<K, V> extends MyBetterMap<K, V> implements Map<K, V> {
// average number of entries per sub-map before we rehash
private static final double FACTOR = 1.0;
@Override
public V put(K key, V value) {
V oldValue = super.put(key, value);
// check if the number of elements per sub-map exceeds the threshold
if (size() > maps.size() * FACTOR) {
rehash();
}
return oldValue;
}
}
MyHashMap extends MyBetterMap , so it inherits the
methods defined there. The only method it overrides is put
which calls put in the superclass — that is, it calls the
version of put in MyBetterMap — and then it checks
whether it has to rehash. Calling size returns the total number
of entries, n. Calling maps.size returns the number of
embedded maps, k.
The constant FACTOR , which is called the load factor,
determines the maximum number of entries per sub-map, on average. If
n > k * FACTOR , that means
n/k > FACTOR , which means the number of entries
per sub-map exceeds the threshold, so we call rehash .
Run ant build to compile the source files. Then run ant MyHashMapTest . It should fail because the implementation of
rehash throws an exception. Your job is to fill it in.
Fill in the body of rehash to collect the entries in the table,
resize the table, and then put the entries back in. I provide two
methods that might come in handy: MyBetterMap.makeMaps and
MyLinearMap.getEntries . Your solution should double the number
of maps, k, each time it is called.
11.2 Analyzing MyHashMap
If the number of entries in the biggest sub-map is proportional to
n/k, and k grows in proportion to n, several of the core
MyBetterMap methods become constant time: public boolean containsKey(Object target) {
MyLinearMap<K, V> map = chooseMap(target);
return map.containsKey(target);
}
public V get(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
return map.get(key);
}
public V remove(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
return map.remove(key);
}
Each method hashes a key, which is constant time, and then invokes a
method on a sub-map, which is constant time.
So far, so good. But the other core method, put , is a little
harder to analyze. When we don’t have to rehash, it is constant time,
but when we do, it’s linear. In that way, it’s similar to
ArrayList.add , which we analyzed in Section 3.2.
For the same reason, MyHashMap.put turns out to be
constant time if we average over a series of invocations.
Again, the argument is based on amortized analysis
(see Section 3.2).
Suppose the initial number of sub-maps, k, is 2, and the load
factor is 1. Now let’s see how much work it takes to put a
series of keys. As the basic “unit of work”, we’ll count the number of
times we have to hash a key and add it to a sub-map.
The first time we call put it takes 1 unit of work. The second
time also takes 1 unit. The third time we have to rehash, so it takes 2
units to rehash the existing keys and 1 unit to hash the new key. Now the size of the hash table is 4, so the next time we call
put , it takes 1 unit of work. But the next time we have to
rehash, which takes 4 units to rehash the existing keys and 1 unit to
hash the new key.
Figure 11.1 shows the pattern, with the normal work of hashing
a new key shown across the bottom and extra work of rehashing shown as a
tower.
Figure 11.1: Representation of the work done to add elements to a hash table. |
As the arrows suggest, if we knock down the towers, each one fills the
space before the next tower. The result is a uniform height of 2 units,
which shows that the average work per put is about 2 units. And
that means that put is constant time on average. This diagram also shows why it is important to double the number of
sub-maps, k, when we rehash. If we only add to k
instead of multiplying, the towers would be too close together and
they would start piling up. And that would not be constant
time.
11.3 The tradeoffs
We’ve shown that containsKey , get , and remove
are constant time, and put is constant time on average. We
should take a minute to appreciate how remarkable that is. The
performance of these operations is pretty much the same no matter how
big the hash table is. Well, sort of.
Remember that our analysis is based on a simple model of computation
where each “unit of work” takes the same amount of time. Real
computers are more complicated than that. In particular, they are
usually fastest when working with data structures small enough to fit in
cache; somewhat slower if the structure doesn’t fit in cache but still
fits in memory; and much slower if the structure doesn’t fit in
memory.
Another limitation of this implementation is that hashing doesn’t help
if we are given a value rather than a key: containsValue is
linear because it has to search all of the sub-maps. And there
is no particularly efficient way to look up a value and find the
corresponding key (or possibly keys).
And there’s one more limitation: some of the methods that were constant
time in MyLinearMap have become linear. For example: public void clear() {
for (int i=0; i<maps.size(); i++) {
maps.get(i).clear();
}
}
clear has to clear all of the sub-maps, and the number of
sub-maps is proportional to n, so it’s linear. Fortunately,
this operation is not used very often, so for most applications this
tradeoff is acceptable.
11.4 Profiling MyHashMap
Before we go on, we should check whether MyHashMap.put is really
constant time.
Run ant build to compile the source
files. Then run ant ProfileMapPut . It measures the run time of
HashMap.put (provided by Java) with a range of problem sizes,
and plots run time versus problem size on a log-log scale. If this
operation is constant time, the total time for n operations
should be linear, so the result should be a straight line with slope
1. When I ran this code, the estimated slope was close to 1, which is
consistent with our analysis. You should get something similar. Modify ProfileMapPut.java so it profiles your implementation,
MyHashMap , instead of Java’s HashMap . Run the
profiler again and see if the slope is near 1. You might have to
adjust startN and endMillis to find a range of
problem sizes where the run times are more than a few milliseconds, but
not more than a few thousand. When I ran this code, I got a surprise: the slope was about 1.7,
which suggests that this implementation is not constant time after all.
It contains a “performance bug”.
Before you read the next section, you should track down the error, fix
it, and confirm that put is now constant time, as expected.
11.5 Fixing MyHashMap
The problem with MyHashMap is in size , which is
inherited from MyBetterMap : public int size() {
int total = 0;
for (MyLinearMap<K, V> map: maps) {
total += map.size();
}
return total;
}
To add up the total size it has to iterate the sub-maps. Since we
increase the number of sub-maps, k, as the number of entries,
n, increases, k is proportional to n, so
size is linear.
And that makes put linear, too, because it uses size : public V put(K key, V value) {
V oldValue = super.put(key, value);
if (size() > maps.size() * FACTOR) {
rehash();
}
return oldValue;
}
Everything we did to make put constant time is wasted if
size is linear!
Fortunately, there is a simple solution, and we have seen it before: we
have to keep the number of entries in an instance variable and update it
whenever we call a method that changes it.
You’ll find my solution in the repository for this book, in
MyFixedHashMap.java . Here’s the beginning of the class definition: public class MyFixedHashMap<K, V> extends MyHashMap<K, V> implements Map<K, V> {
private int size = 0;
public void clear() {
super.clear();
size = 0;
}
Rather than modify MyHashMap , I define a new class that
extends it. It adds a new instance variable, size , which is
initialized to zero. Updating clear is straightforward; we invoke clear in
the superclass (which clears the sub-maps), and then update
size .
Updating remove and put is a little more difficult
because when we invoke the method on the superclass, we can’t tell
whether the size of the sub-map changed. Here’s how I worked around
that: public V remove(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
size -= map.size();
V oldValue = map.remove(key);
size += map.size();
return oldValue;
}
remove uses chooseMap to find the right sub-map, then
subtracts away the size of the sub-map. It invokes remove on
the sub-map, which may or may not change the size of the sub-map,
depending on whether it finds the key. But either way, we add the new
size of the sub-map back to size , so the final value of
size is correct.
The rewritten version of put is similar: public V put(K key, V value) {
MyLinearMap<K, V> map = chooseMap(key);
size -= map.size();
V oldValue = map.put(key, value);
size += map.size();
if (size() > maps.size() * FACTOR) {
size = 0;
rehash();
}
return oldValue;
}
We have the same problem here: when we invoke put on the
sub-map, we don’t know whether it added a new entry. So we use the same
solution, subtracting off the old size and then adding in the new size.
Now the implementation of the size method is simple: public int size() {
return size;
}
And that’s pretty clearly constant time.
When I profiled this solution, I found that the total time for putting
n keys is proportional to n, which means that each put is
constant time, as it’s supposed to be.
11.6 UML class diagrams
One challenge of working with the code in this chapter is that we have
several classes that depend on each other. Here are some of the
relationships between the classes: MyLinearMap contains a LinkedList and implements
Map .
MyBetterMap contains many MyLinearMap objects and
implements Map .
MyHashMap extends MyBetterMap , so it also contains
MyLinearMap objects, and it implements Map .
MyFixedHashMap extends MyHashMap and
implements Map .
To help keep track of relationships like these, software engineers
often use UML class diagrams. UML stands for Unified Modeling
Language
(see http://thinkdast.com/uml).
A “class diagram” is one of several graphical standards defined by UML. In a class diagram, each class is represented by a box, and
relationships between classes are represented by
arrows. Figure 11.2 shows a UML class diagram for the classes
from the previous exercise, generated using the online tool yUML at
http://yuml.me/.
Figure 11.2: UML diagram for the classes in this chapter. |
Different relationships are represented by different arrows: - Arrows with a solid head indicate HAS-A relationships. For example,
each instance of
MyBetterMap contains multiple instances of
MyLinearMap , so they are connected by a solid arrow. - Arrows with a hollow head and a solid line indicate IS-A
relationships. For example,
MyHashMap extends
MyBetterMap , so they are connected by an IS-A arrow. - Arrows with a hollow head and a dashed line indicate that a class
implements an interface; in this diagram, every class implements
Map .
UML class diagrams provide a concise way to represent a lot of
information about a collection of classes. They are used during design
phases to communicate about alternative designs, during implementation
phases to maintain a shared mental map of the project, and during
deployment to document the design.
|