This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

Or you can buy this book on Amazon.com.

# Chapter 10  Hashing

In this chapter, I define `MyBetterMap`, a better implementation of the `Map` interface than `MyLinearMap`, and introduce hashing, which makes `MyBetterMap` more efficient.

## 10.1  Hashing

To improve the performance of `MyLinearMap`, we’ll write a new class, called `MyBetterMap`, that contains a collection of `MyLinearMap` objects. It divides the keys among the embedded maps, so the number of entries in each map is smaller, which speeds up `findEntry` and the methods that depend on it.

Here’s the beginning of the class definition:

```public class MyBetterMap<K, V> implements Map<K, V> {

protected List<MyLinearMap<K, V>> maps;

public MyBetterMap(int k) {
makeMaps(k);
}

protected void makeMaps(int k) {
maps = new ArrayList<MyLinearMap<K, V>>(k);
for (int i=0; i<k; i++) {
}
}
}
```

The instance variable, `maps`, is a collection of `MyLinearMap` objects. The constructor takes a parameter, `k`, that determines how many maps to use, at least initially. Then `makeMaps` creates the embedded maps and stores them in an `ArrayList`.

Now, the key to making this work is that we need some way to look at a key and decide which of the embedded maps it should go into. When we `put` a new key, we choose one of the maps; when we `get` the same key, we have to remember where we put it.

One possibility is to choose one of the sub-maps at random and keep track of where we put each key. But how should we keep track? It might seem like we could use a `Map` to look up the key and find the right sub-map, but the whole point of the exercise is to write an efficient implementation of a `Map`. We can’t assume we already have one.

A better approach is to use a hash function, which takes an `Object`, any `Object`, and returns an integer called a hash code. Importantly, if it sees the same `Object` more than once, it always returns the same hash code. That way, if we use the hash code to store a key, we’ll get the same hash code when we look it up.

In Java, every `Object` provides a method called `hashCode` that computes a hash function. The implementation of this method is different for different objects; we’ll see an example soon.

Here’s a helper method that chooses the right sub-map for a given key:

```protected MyLinearMap<K, V> chooseMap(Object key) {
int index = 0;
if (key != null) {
index = Math.abs(key.hashCode()) % maps.size();
}
return maps.get(index);
}
```

If `key` is `null`, we choose the sub-map with index 0, arbitrarily. Otherwise we use `hashCode` to get an integer, apply `Math.abs` to make sure it is non-negative, then use the remainder operator, `\%`, which guarantees that the result is between 0 and `maps.size()-1`. So `index` is always a valid index into `maps`. Then `chooseMap` returns a reference to the map it chose.

We use `chooseMap` in both `put` and `get`, so when we look up a key, we get the same map we chose when we added the key. At least, we should — I’ll explain a little later why this might not work.

Here’s my implementation of `put` and `get`:

```public V put(K key, V value) {
MyLinearMap<K, V> map = chooseMap(key);
return map.put(key, value);
}

public V get(Object key) {
MyLinearMap<K, V> map = chooseMap(key);
return map.get(key);
}
```

Pretty simple, right? In both methods, we use `chooseMap` to find the right sub-map and then invoke a method on the sub-map. That’s how it works; now let’s think about performance.

If there are n entries split up among k sub-maps, there will be n/k entries per map, on average. When we look up a key, we have to compute its hash code, which takes some time, then we search the corresponding sub-map.

Because the entry lists in `MyBetterMap` are k times shorter than the entry list in `MyLinearMap`, we expect the search to be k times faster. But the run time is still proportional to n, so `MyBetterMap` is still linear. In the next exercise, you’ll see how we can fix that.

## 10.2  How does hashing work?

The fundamental requirement for a hash function is that the same object should produce the same hash code every time. For immutable objects, that’s relatively easy. For objects with mutable state, we have to think harder.

As an example of an immutable object, I’ll define a class called `SillyString` that encapsulates a `String`:

```public class SillyString {
private final String innerString;

public SillyString(String innerString) {
this.innerString = innerString;
}

public String toString() {
return innerString;
}
```

This class is not very useful, which is why it’s called `SillyString`, but I’ll use it to show how a class can define its own hash function:

```    @Override
public boolean equals(Object other) {
return this.toString().equals(other.toString());
}

@Override
public int hashCode() {
int total = 0;
for (int i=0; i<innerString.length(); i++) {
total += innerString.charAt(i);
}
}
```

Notice that `SillyString` overrides both `equals` and `hashCode`. This is important. In order to work properly, `equals` has to be consistent with `hashCode`, which means that if two objects are considered equal — that is, `equals` returns `true` — they should have the same hash code. But this requirement only works one way; if two objects have the same hash code, they don’t necessarily have to be equal.

`equals` works by invoking `toString`, which returns `innerString`. So two `SillyString` objects are equal if their `innerString` instance variables are equal.

`hashCode` works by iterating through the characters in the `String` and adding them up. When you add a character to an `int`, Java converts the character to an integer using its Unicode code point. You don’t need to know anything about Unicode to understand this example, but if you are curious, you can read more at http://thinkdast.com/codepoint.

This hash function satisfies the requirement: if two `SillyString` objects contain embedded strings that are equal, they will get the same hash code.

This works correctly, but it might not yield good performance, because it returns the same hash code for many different strings. If two strings contain the same letters in any order, they will have the same hash code. And even if they don’t contain the same letters, they might yield the same total, like `"ac"` and `"bb"`.

If many objects have the same hash code, they end up in the same sub-map. If some sub-maps have more entries than others, the speedup when we have k maps might be much less than k. So one of the goals of a hash function is to be uniform; that is, it should be equally likely to produce any value in the range. You can read more about designing good hash functions at http://thinkdast.com/hash.

## 10.3  Hashing and mutation

`String`s are immutable, and `SillyString` is also immutable because `innerString` is declared to be `final`. Once you create a `SillyString`, you can’t make `innerString` refer to a different `String`, and you can’t modify the `String` it refers to. Therefore, it will always have the same hash code.

But let’s see what happens with a mutable object. Here’s a definition for `SillyArray`, which is identical to `SillyString`, except that it uses an array of characters instead of a `String`:

```public class SillyArray {
private final char[] array;

public SillyArray(char[] array) {
this.array = array;
}

public String toString() {
return Arrays.toString(array);
}

@Override
public boolean equals(Object other) {
return this.toString().equals(other.toString());
}

@Override
public int hashCode() {
int total = 0;
for (int i=0; i<array.length; i++) {
total += array[i];
}
System.out.println(total);
}
```

`SillyArray` also provides `setChar`, which makes it possible to modify the characters in the array:

```public void setChar(int i, char c) {
this.array[i] = c;
}
```

Now suppose we create a `SillyArray` and add it to a map:

```SillyArray array1 = new SillyArray("Word1".toCharArray());
map.put(array1, 1);
```

The hash code for this array is 461. Now if we modify the contents of the array and then try to look it up, like this:

```array1.setChar(0, 'C');
Integer value = map.get(array1);
```

the hash code after the mutation is 441. With a different hash code, there’s a good chance we’ll go looking in the wrong sub-map. In that case, we won’t find the key, even though it is in the map. And that’s bad.

In general, it is dangerous to use mutable objects as keys in data structures that use hashing, which includes `MyBetterMap` and `HashMap`. If you can guarantee that the keys won’t be modified while they are in the map, or that any changes won’t affect the hash code, it might be OK. But it is probably a good idea to avoid it.

## 10.4  Exercise 8

In this exercise, you will finish off the implementation of `MyBetterMap`. In the repository for this book, you’ll find the source files for this exercise:

• `MyLinearMap.java` contains our solution to the previous exercise, which we will build on in this exercise.
• `MyBetterMap.java` contains the code from the previous chapter with some methods you will fill in.
• `MyHashMap.java` contains the outline of a hash table that grows when needed, which you will complete.
• `MyLinearMapTest.java` contains the unit tests for `MyLinearMap`.
• `MyBetterMapTest.java` contains the unit tests for `MyBetterMap`.
• `MyHashMapTest.java` contains the unit tests for `MyHashMap`.
• `Profiler.java` contains code for measuring and plotting run time versus problem size.
• `ProfileMapPut.java` contains code that profiles the `Map.put` method.

As usual, you should run `ant build` to compile the source files. Then run `ant MyBetterMapTest`. Several tests should fail, because you have some work to do!

Review the implementation of `put` and `get` from the previous chapter. Then fill in the body of `containsKey`. HINT: use `chooseMap`. Run `ant MyBetterMapTest` again and confirm that `testContainsKey` passes.

Fill in the body of `containsValue`. HINT: don’t use `chooseMap`. Run `ant MyBetterMapTest` again and confirm that `testContainsValue` passes. Notice that we have to do more work to find a value than to find a key.

Like `put` and `get`, this implementation of `containsKey` is linear, because it has to search one of the embedded sub-maps. In the next chapter, we’ll see how we can improve this implementation even more.

#### Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.