This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Or you can buy this book on Amazon.com.

Chapter 13 Binary search tree

This chapter presents solutions to the previous exercise, then tests the performance of the tree-backed map. I present a problem with the implementation and explain how Java’s TreeMap solves it.

13.1 A simple `MyTreeMap`

In the previous exercise I gave you the outline of MyTreeMap and asked you to fill in the missing methods. Now I’ll present a solution, starting with findNode:

private Node findNode(Object target) {
    // some implementations can handle null as a key, but not this one
    if (target == null) {
            throw new IllegalArgumentException();
    }

    // something to make the compiler happy
    @SuppressWarnings("unchecked")
    Comparable<? super K> k = (Comparable<? super K>) target;

    // the actual search
    Node node = root;
    while (node != null) {
        int cmp = k.compareTo(node.key);
        if (cmp < 0)
            node = node.left;
        else if (cmp > 0)
            node = node.right;
        else
            return node;
    }
    return null;
}

findNode is a private method used by containsKey and get; it is not part of the Map interface. The parameter target is the key we’re looking for. I explained the first part of this method in the previous exercise:

In this implementation, null is not a legal value for a key.
Before we can invoke compareTo on target, we have to typecast it to some kind of Comparable. The “type wildcard” used here is as permissive as possible; that is, it works with any type that implements Comparable and whose compareTo method accepts K or any supertype of K.

After all that, the actual search is relatively simple. We initialize a loop variable node so it refers to the root node. Each time through the loop, we compare the target to node.key. If the target is less than the current key, we move to the left child. If it’s greater, we move to the right child. And if it’s equal, we return the current node.

If we get to the bottom of the tree without finding the target, we conclude that it is not in the tree and return null.

13.2 Searching for values

As I explained in the previous exercise, the run time of findNode is proportional to the height of the tree, not the number of nodes, because we don’t have to search the whole tree. But for containsValue, we have to search the values, not the keys; the BST property doesn’t apply to the values, so we have to search the whole tree.

My solution is recursive:

public boolean containsValue(Object target) {
    return containsValueHelper(root, target);
}

private boolean containsValueHelper(Node node, Object target) {
    if (node == null) {
        return false;
    }
    if (equals(target, node.value)) {
        return true;
    }
    if (containsValueHelper(node.left, target)) {
        return true;
    }
    if (containsValueHelper(node.right, target)) {
        return true;
    }
    return false;
}

containsValue takes the target value as a parameter and immediately invokes containsValueHelper, passing the root of the tree as an additional parameter.

Here’s how containsValueHelper works:

The first if statement checks the base case of the recursion. If node is null, that means we have recursed to the bottom of the tree without finding the target, so we should return false. Note that this only means that the target did not appear on one path through the tree; it is still possible that it will be found on another.
The second case checks whether we’ve found what we’re looking for. If so, we return true. Otherwise, we have to keep going.
The third case makes a recursive call to search for target in the left subtree. If we find it, we can return true immediately, without searching the right subtree. Otherwise, we keep going.
The fourth case searches the right subtree. Again, if we find what we are looking for, we return true. Otherwise, having searched the whole tree, we return false.

This method “visits” every node in the tree, so it takes time proportional to the number of nodes.

13.3 Implementing put

The put method is a little more complicated than get because it has to deal with two cases: (1) if the given key is already in the tree, it replaces and returns the old value; (2) otherwise it has to add a new node to the tree, in the right place.

In the previous exercise, I provided this starter code:

public V put(K key, V value) {
    if (key == null) {
        throw new IllegalArgumentException();
    }
    if (root == null) {
        root = new Node(key, value);
        size++;
        return null;
    }
    return putHelper(root, key, value);
}

And asked you to fill in putHelper. Here’s my solution:

private V putHelper(Node node, K key, V value) {
    Comparable<? super K> k = (Comparable<? super K>) key;
    int cmp = k.compareTo(node.key);

    if (cmp < 0) {
        if (node.left == null) {
            node.left = new Node(key, value);
            size++;
            return null;
        } else {
            return putHelper(node.left, key, value);
        }
    }
    if (cmp > 0) {
        if (node.right == null) {
            node.right = new Node(key, value);
            size++;
            return null;
        } else {
            return putHelper(node.right, key, value);
        }
    }
    V oldValue = node.value;
    node.value = value;
    return oldValue;
}

The first parameter, node, is initially the root of the tree, but each time we make a recursive call, it refers to a different subtree. As in get, we use the compareTo method to figure out which path to follow through the tree. If cmp < 0, the key we’re adding is less than node.key, so we want to look in the left subtree. There are two cases:

If the left subtree is empty, that is, if node.left is null, we have reached the bottom of the tree without finding key. At this point, we know that key isn’t in the tree, and we know where it should go. So we create a new node and add it as the left child of node.
Otherwise we make a recursive call to search the left subtree.

If cmp > 0, the key we’re adding is greater than node.key, so we want to look in the right subtree. And we handle the same two cases as in the previous branch. Finally, if cmp == 0, we found the key in the tree, so we replace and return the old value.

I wrote this method recursively to make it more readable, but it would be straightforward to rewrite it iteratively, which you might want to do as an exercise.

13.4 In-order traversal

The last method I asked you to write is keySet, which returns a Set that contains the keys from the tree in ascending order. In other implementations of Map, the keys returned by keySet are in no particular order, but one of the capabilities of the tree implementation is that it is simple and efficient to sort the keys. So we should take advantage of that.

Here’s my solution:

public Set<K> keySet() {
    Set<K> set = new LinkedHashSet<K>();
    addInOrder(root, set);
    return set;
}

private void addInOrder(Node node, Set<K> set) {
    if (node == null) return;
    addInOrder(node.left, set);
    set.add(node.key);
    addInOrder(node.right, set);        
}

In keySet, we create a LinkedHashSet, which is a Set implementation that keeps the elements in order (unlike most other Set implementations). Then we call addInOrder to traverse the tree.

The first parameter, node, is initially the root of the tree, but as you should expect by now, we use it to traverse the tree recursively. addInOrder performs a classic “in-order traversal” of the tree.

If node is null, that means the subtree is empty, so we return without adding anything to set. Otherwise we:

Traverse the left subtree in order.
Add node.key.
Traverse the right subtree in order.

Remember that the BST property guarantees that all nodes in the left subtree are less than node.key, and all nodes in the right subtree are greater. So we know that node.key has been added in the correct order.

Applying the same argument recursively, we know that the elements from the left subtree are in order, as well as the elements from the right subtree. And the base case is correct: if the subtree is empty, no keys are added. So we can conclude that this method adds all keys in the correct order.

Because this method visits every node in the tree, like containsValue, it takes time proportional to n.

13.5 The logarithmic methods

In MyTreeMap, the methods get and put take time proportional to the height of the tree, h. In the previous exercise, we showed that if the tree is full — if every level of the tree contains the maximum number of nodes — the height of the tree is proportional to logn.

And I claimed that get and put are logarithmic; that is, they take time proportional to logn. But for most applications, there’s no guarantee that the tree is full. In general, the shape of the tree depends on the keys and what order they are added.

To see how this works out in practice, we’ll test our implementation with two sample datasets: a list of random strings and a list of timestamps in increasing order.

Here’s the code that generates random strings:

Map<String, Integer> map = new MyTreeMap<String, Integer>();

for (int i=0; i<n; i++) {
    String uuid = UUID.randomUUID().toString();
    map.put(uuid, 0);
}

UUID is a class in the java.util package that can generate a random “universally unique identifier”. UUIDs are useful for a variety of applications, but in this example we’re taking advantage of an easy way to generate random strings.

I ran this code with n=16384 and measured the run time and the height of the final tree. Here’s the output:

Time in milliseconds = 151
Final size of MyTreeMap = 16384
log base 2 of size of MyTreeMap = 14.0
Final height of MyTreeMap = 33

I included “log base 2 of size of MyTreeMap” to see how tall the tree would be if it were full. The result indicates that a full tree with height 14 would contain 16,384 nodes.

The actual tree of random strings has height 33, which is substantially more than the theoretical minimum, but not too bad. To find one key in a collection of 16,384, we only have to make 33 comparisons. Compared to a linear search, that’s almost 500 times faster.

This performance is typical with random strings or other keys that are added in no particular order. The final height of the tree might be 2-3 times the theoretical minimum, but it is still proportional to logn, which is much less than n. In fact, logn grows so slowly as n increases, it can be difficult to distinguish logarithmic time from constant time in practice.

However, binary search trees don’t always behave so well. Let’s see what happens when we add keys in increasing order. Here’s an example that measures timestamps in nanoseconds and uses them as keys:

MyTreeMap<String, Integer> map = new MyTreeMap<String, Integer>();

for (int i=0; i<n; i++) {
    String timestamp = Long.toString(System.nanoTime());
    map.put(timestamp, 0);
}

System.nanoTime returns an integer with type long that indicates elapsed time in nanoseconds. Each time we call it, we get a somewhat bigger number. When we convert these timestamps to strings, they appear in increasing alphabetical order.

And let’s see what happens when we run it:

Time in milliseconds = 1158
Final size of MyTreeMap = 16384
log base 2 of size of MyTreeMap = 14.0
Final height of MyTreeMap = 16384

The run time is more than seven times longer than in the previous case. longer. If you wonder why, take a look at the final height of the tree: 16384!

Figure 13.1: Binary search trees, balanced (left) and unbalanced (right).

If you think about how put works, you can figure out what’s going on. Every time we add a new key, it’s larger than all of the keys in the tree, so we always choose the right subtree, and always add the new node as the right child of the rightmost node. The result is an “unbalanced” tree that only contains right children.

The height of this tree is proportional to n, not logn, so the performance of get and put is linear, not logarithmic.

Figure 13.1 shows an example of a balanced and unbalanced tree. In the balanced tree, the height is 4 and the total number of nodes is 2⁴−1 = 15. In the unbalanced tree with the same number of nodes, the height is 15.

13.6 Self-balancing trees

There are two possible solutions to this problem:

You could avoid adding keys to the Map in order. But this is not always possible.
You could make a tree that does a better job of handling keys if they happen to be in order.

The second solution is better, and there are several ways to do it. The most common is to modify put so that it detects when the tree is starting to become unbalanced and, if so, rearranges the nodes. Trees with this capability are called “self-balancing”. Common self-balancing trees include the AVL tree (“AVL” are the initials of the inventors), and the red-black tree, which is what the Java TreeMap uses.

In our example code, if we replace MyTreeMap with the Java TreeMap, the run times are about the same for the random strings and the timestamps. In fact, the timestamps run faster, even though they are in order, probably because they take less time to hash.

In summary, a binary search tree can implement get and put in logarithmic time, but only if the keys are added in an order that keeps the tree sufficiently balanced. Self-balancing trees avoid this problem by doing some additional work each time a new key is added.

You can read more about self-balancing trees at http://thinkdast.com/balancing.

13.7 One more exercise

In the previous exercise you didn’t have to implement remove, but you might want to try it. If you remove a node from the middle of the tree, you have to rearrange the remaining nodes to restore the BST property. You can probably figure out how to do that on your own, or you can read the explanation at http://thinkdast.com/bstdel.

Removing a node and rebalancing a tree are similar operations: if you do this exercise, you will have a better idea of how self-balancing trees work.