This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Or you can buy this book on Amazon.com. Chapter 11 HashMapIn the previous chapter, we wrote an implementation of the
If there are n entries and k sub-maps, the size of the sub-maps is n/k on average, which is still proportional to n. But if we increase k along with n, we can limit the size of n/k. For example, suppose we double k every time n exceeds k; in that case the number of entries per map would be less than 1 on average, and pretty much always less than 10, as long as the hash function spreads out the keys reasonably well. If the number of entries per sub-map is constant, we can search a single
sub-map in constant time. And computing the hash function is generally
constant time (it might depend on the size of the key, but does not
depend on the number of keys). That makes the core In the next exercise, you’ll see the details. 11.1 Exercise 9In public class MyHashMap<K, V> extends MyBetterMap<K, V> implements Map<K, V> { // average number of entries per sub-map before we rehash private static final double FACTOR = 1.0; @Override public V put(K key, V value) { V oldValue = super.put(key, value); // check if the number of elements per sub-map exceeds the threshold if (size() > maps.size() * FACTOR) { rehash(); } return oldValue; } }
The constant Run Fill in the body of 11.2 Analyzing
|
Figure 11.1: Representation of the work done to add elements to a hash table. |
As the arrows suggest, if we knock down the towers, each one fills the
space before the next tower. The result is a uniform height of 2 units,
which shows that the average work per put
is about 2 units. And
that means that put
is constant time on average.
This diagram also shows why it is important to double the number of sub-maps, k, when we rehash. If we only add to k instead of multiplying, the towers would be too close together and they would start piling up. And that would not be constant time.
We’ve shown that containsKey
, get
, and remove
are constant time, and put
is constant time on average. We
should take a minute to appreciate how remarkable that is. The
performance of these operations is pretty much the same no matter how
big the hash table is. Well, sort of.
Remember that our analysis is based on a simple model of computation where each “unit of work” takes the same amount of time. Real computers are more complicated than that. In particular, they are usually fastest when working with data structures small enough to fit in cache; somewhat slower if the structure doesn’t fit in cache but still fits in memory; and much slower if the structure doesn’t fit in memory.
Another limitation of this implementation is that hashing doesn’t help
if we are given a value rather than a key: containsValue
is
linear because it has to search all of the sub-maps. And there
is no particularly efficient way to look up a value and find the
corresponding key (or possibly keys).
And there’s one more limitation: some of the methods that were constant
time in MyLinearMap
have become linear. For example:
public void clear() { for (int i=0; i<maps.size(); i++) { maps.get(i).clear(); } }
clear
has to clear all of the sub-maps, and the number of
sub-maps is proportional to n, so it’s linear. Fortunately,
this operation is not used very often, so for most applications this
tradeoff is acceptable.
MyHashMap
Before we go on, we should check whether MyHashMap.put
is really
constant time.
Run ant build
to compile the source
files. Then run ant ProfileMapPut
. It measures the run time of
HashMap.put
(provided by Java) with a range of problem sizes,
and plots run time versus problem size on a log-log scale. If this
operation is constant time, the total time for n operations
should be linear, so the result should be a straight line with slope
1. When I ran this code, the estimated slope was close to 1, which is
consistent with our analysis. You should get something similar.
Modify ProfileMapPut.java
so it profiles your implementation,
MyHashMap
, instead of Java’s HashMap
. Run the
profiler again and see if the slope is near 1. You might have to
adjust startN
and endMillis
to find a range of
problem sizes where the run times are more than a few milliseconds, but
not more than a few thousand.
When I ran this code, I got a surprise: the slope was about 1.7, which suggests that this implementation is not constant time after all. It contains a “performance bug”.
Before you read the next section, you should track down the error, fix
it, and confirm that put
is now constant time, as expected.
MyHashMap
The problem with MyHashMap
is in size
, which is
inherited from MyBetterMap
:
public int size() { int total = 0; for (MyLinearMap<K, V> map: maps) { total += map.size(); } return total; }
To add up the total size it has to iterate the sub-maps. Since we
increase the number of sub-maps, k, as the number of entries,
n, increases, k is proportional to n, so
size
is linear.
And that makes put
linear, too, because it uses size
:
public V put(K key, V value) { V oldValue = super.put(key, value); if (size() > maps.size() * FACTOR) { rehash(); } return oldValue; }
Everything we did to make put
constant time is wasted if
size
is linear!
Fortunately, there is a simple solution, and we have seen it before: we have to keep the number of entries in an instance variable and update it whenever we call a method that changes it.
You’ll find my solution in the repository for this book, in
MyFixedHashMap.java
. Here’s the beginning of the class definition:
public class MyFixedHashMap<K, V> extends MyHashMap<K, V> implements Map<K, V> { private int size = 0; public void clear() { super.clear(); size = 0; }
Rather than modify MyHashMap
, I define a new class that
extends it. It adds a new instance variable, size
, which is
initialized to zero.
Updating clear
is straightforward; we invoke clear
in
the superclass (which clears the sub-maps), and then update
size
.
Updating remove
and put
is a little more difficult
because when we invoke the method on the superclass, we can’t tell
whether the size of the sub-map changed. Here’s how I worked around
that:
public V remove(Object key) { MyLinearMap<K, V> map = chooseMap(key); size -= map.size(); V oldValue = map.remove(key); size += map.size(); return oldValue; }
remove
uses chooseMap
to find the right sub-map, then
subtracts away the size of the sub-map. It invokes remove
on
the sub-map, which may or may not change the size of the sub-map,
depending on whether it finds the key. But either way, we add the new
size of the sub-map back to size
, so the final value of
size
is correct.
The rewritten version of put
is similar:
public V put(K key, V value) { MyLinearMap<K, V> map = chooseMap(key); size -= map.size(); V oldValue = map.put(key, value); size += map.size(); if (size() > maps.size() * FACTOR) { size = 0; rehash(); } return oldValue; }
We have the same problem here: when we invoke put
on the
sub-map, we don’t know whether it added a new entry. So we use the same
solution, subtracting off the old size and then adding in the new size.
Now the implementation of the size
method is simple:
public int size() { return size; }
And that’s pretty clearly constant time.
When I profiled this solution, I found that the total time for putting
n keys is proportional to n, which means that each put
is
constant time, as it’s supposed to be.
One challenge of working with the code in this chapter is that we have several classes that depend on each other. Here are some of the relationships between the classes:
MyLinearMap
contains a LinkedList
and implements
Map
.
MyBetterMap
contains many MyLinearMap
objects and
implements Map
.
MyHashMap
extends MyBetterMap
, so it also contains
MyLinearMap
objects, and it implements Map
.
MyFixedHashMap
extends MyHashMap
and
implements Map
.
To help keep track of relationships like these, software engineers often use UML class diagrams. UML stands for Unified Modeling Language (see http://thinkdast.com/uml). A “class diagram” is one of several graphical standards defined by UML.
In a class diagram, each class is represented by a box, and relationships between classes are represented by arrows. Figure 11.2 shows a UML class diagram for the classes from the previous exercise, generated using the online tool yUML at http://yuml.me/.
Different relationships are represented by different arrows:
MyBetterMap
contains multiple instances of
MyLinearMap
, so they are connected by a solid arrow.MyHashMap
extends
MyBetterMap
, so they are connected by an IS-A arrow.Map
.UML class diagrams provide a concise way to represent a lot of information about a collection of classes. They are used during design phases to communicate about alternative designs, during implementation phases to maintain a shared mental map of the project, and during deployment to document the design.