This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book.
In particular, some of the symbols are not rendered correctly.
You might prefer to read the PDF version.
Or you can buy this book on Amazon.com.
Chapter 10 Hashing
In this chapter, I define
MyBetterMap , a better implementation of the Map interface
than MyLinearMap , and introduce
hashing, which makes MyBetterMap more efficient.
10.1 Hashing
To improve the performance of MyLinearMap , we’ll write a new
class, called MyBetterMap , that contains a collection of
MyLinearMap objects. It divides the keys among the embedded
maps, so the number of entries in each map is smaller, which speeds up
findEntry and the methods that depend on it. Here’s the beginning of the class definition: public class MyBetterMap<K, V> implements Map<K, V> {
protected List<MyLinearMap<K, V>> maps;
public MyBetterMap(int k) {
makeMaps(k);
}
protected void makeMaps(int k) {
maps = new ArrayList<MyLinearMap<K, V>>(k);
for (int i=0; i<k; i++) {
maps.add(new MyLinearMap<K, V>());
}
}
}
The instance variable, maps , is a collection of
MyLinearMap objects. The constructor takes a parameter,
k , that determines how many maps to use, at least initially.
Then makeMaps creates the embedded maps and stores them in an
ArrayList .
Now, the key to making this work is that we need some way to look at a
key and decide which of the embedded maps it should go into. When we
put a new key, we choose one of the maps; when we get
the same key, we have to remember where we put it.
One possibility is to choose one of the sub-maps at random and keep
track of where we put each key. But how should we keep track? It might
seem like we could use a Map to look up the key and find the
right sub-map, but the whole point of the exercise is to write an
efficient implementation of a Map . We can’t assume we already
have one. A better approach is to use a hash function, which takes an
Object , any Object , and returns an integer called a
hash code. Importantly, if it sees the same Object
more than once, it always returns the same hash code. That way, if we
use the hash code to store a key, we’ll get the same hash code when we
look it up.
In Java, every Object provides a method called
hashCode that computes a hash function. The implementation of
this method is different for different objects; we’ll see an example
soon.
Here’s a helper method that chooses the right sub-map for a
given key: protected MyLinearMap<K, V> chooseMap(Object key) {
int index = 0;
if (key != null) {
index = Math.abs(key.hashCode()) % maps.size();
}
return maps.get(index);
}
If key is null , we choose the sub-map with index 0,
arbitrarily. Otherwise we use hashCode to get an integer,
apply Math.abs to make sure it is non-negative,
then use the remainder operator, \% , which guarantees that the
result is between 0 and maps.size()-1 . So index is
always a valid index into maps . Then chooseMap returns
a reference to the map it chose.
We use chooseMap in both put and get , so when
we look up a key, we get the same map we chose when we added the key. At
least, we should — I’ll explain a little later why this might not
work. Here’s my implementation of put and get : public V put(K key, V value) {
MyLinearMap<K, V> map = chooseMap(key);
return map.put(key, value);
}
public V get(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
return map.get(key);
}
Pretty simple, right? In both methods, we use chooseMap to find
the right sub-map and then invoke a method on the sub-map.
That’s how it works; now let’s think about performance.
If there are n entries split up among k sub-maps,
there will be n/k entries per map, on average. When we look up
a key, we have to compute its hash code, which takes some time, then we
search the corresponding sub-map. Because the entry lists in
MyBetterMap are k times shorter than the entry list in
MyLinearMap , we expect the search to be k times
faster. But the run time is still proportional to n, so
MyBetterMap is still linear. In the next exercise, you’ll see how we
can fix that.
10.2 How does hashing work?
The fundamental requirement for a hash function is that the same object
should produce the same hash code every time. For immutable objects,
that’s relatively easy. For objects with mutable state, we have to think
harder.
As an example of an immutable object, I’ll define a class called
SillyString that encapsulates a String : public class SillyString {
private final String innerString;
public SillyString(String innerString) {
this.innerString = innerString;
}
public String toString() {
return innerString;
}
This class is not very useful, which is why it’s called
SillyString , but I’ll use it to show how a class can define
its own hash function: @Override
public boolean equals(Object other) {
return this.toString().equals(other.toString());
}
@Override
public int hashCode() {
int total = 0;
for (int i=0; i<innerString.length(); i++) {
total += innerString.charAt(i);
}
return total;
}
Notice that SillyString overrides both equals and
hashCode . This is important. In order to work properly,
equals has to be consistent with hashCode , which means
that if two objects are considered equal — that is, equals
returns true — they should have the same hash code. But this
requirement only works one way; if two objects have the same hash code,
they don’t necessarily have to be equal.
equals works by invoking toString , which returns
innerString . So two SillyString objects are equal if
their innerString instance variables are equal.
hashCode works by iterating through the characters in the
String and adding them up. When you add a character to an int ,
Java converts the character to an integer using its Unicode code point.
You don’t need to know anything about Unicode to understand this
example, but if you are curious, you can read more at
http://thinkdast.com/codepoint.
This hash function satisfies the requirement: if two
SillyString objects contain embedded strings that are equal,
they will get the same hash code. This works correctly, but it might not yield good performance,
because it returns the same hash code for many different strings. If two
strings contain the same letters in any order, they will have the same
hash code. And even if they don’t contain the same letters, they might
yield the same total, like "ac" and "bb" . If many objects have the same hash code, they end up in the same
sub-map. If some sub-maps have more entries than others, the speedup
when we have k maps might be much less than k. So one of the goals
of a hash function is to be uniform; that is, it should be equally
likely to produce any value in the range. You can read more about
designing good hash functions at
http://thinkdast.com/hash.
10.3 Hashing and mutation
String s are immutable, and SillyString is also immutable
because innerString is declared to be final . Once you
create a SillyString , you can’t make innerString refer
to a different String , and you can’t modify the String it
refers to. Therefore, it will always have the same hash code.
But let’s see what happens with a mutable object. Here’s a definition
for SillyArray , which is identical to SillyString ,
except that it uses an array of characters instead of a String : public class SillyArray {
private final char[] array;
public SillyArray(char[] array) {
this.array = array;
}
public String toString() {
return Arrays.toString(array);
}
@Override
public boolean equals(Object other) {
return this.toString().equals(other.toString());
}
@Override
public int hashCode() {
int total = 0;
for (int i=0; i<array.length; i++) {
total += array[i];
}
System.out.println(total);
return total;
}
SillyArray also provides setChar , which makes it
possible to modify the characters in the array:
public void setChar(int i, char c) {
this.array[i] = c;
}
Now suppose we create a SillyArray and add it to a map: SillyArray array1 = new SillyArray("Word1".toCharArray());
map.put(array1, 1);
The hash code for this array is 461. Now if we modify the contents of
the array and then try to look it up, like this: array1.setChar(0, 'C');
Integer value = map.get(array1);
the hash code after the mutation is 441. With a different hash code,
there’s a good chance we’ll go looking in the wrong sub-map. In that
case, we won’t find the key, even though it is in the map. And that’s
bad.
In general, it is dangerous to use mutable objects as keys in data
structures that use hashing, which includes MyBetterMap and
HashMap . If you can guarantee that the keys won’t be modified
while they are in the map, or that any changes won’t affect the hash
code, it might be OK. But it is probably a good idea to avoid it.
10.4 Exercise 8
In this exercise, you will finish off the implementation of
MyBetterMap . In the repository for this book,
you’ll find the source files for this exercise: MyLinearMap.java contains our solution to the previous exercise,
which we will build on in this exercise.MyBetterMap.java contains the code from the previous chapter
with some methods you will fill in.MyHashMap.java contains the outline of a hash table that
grows when needed, which you will complete.MyLinearMapTest.java contains the unit tests for
MyLinearMap .MyBetterMapTest.java contains the unit tests for
MyBetterMap .MyHashMapTest.java contains the unit tests for
MyHashMap .Profiler.java contains code for measuring and plotting
run time versus problem size.ProfileMapPut.java contains code that profiles the
Map.put method.
As usual, you should run ant build to compile the source
files. Then run ant MyBetterMapTest . Several tests should fail,
because you have some work to do!
Review the implementation of put and get from the
previous chapter. Then fill in the body of containsKey . HINT:
use chooseMap . Run ant MyBetterMapTest again and confirm
that testContainsKey passes.
Fill in the body of containsValue . HINT: don’t use
chooseMap . Run ant MyBetterMapTest again and confirm
that testContainsValue passes. Notice that we have to do more
work to find a value than to find a key. Like put and get , this implementation of
containsKey is linear, because it has to search one of the
embedded sub-maps. In the next chapter, we’ll see how we can
improve this implementation even more.
|