Previous Up Next

This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Or you can buy this book on Amazon.com.

Chapter 6  Tree traversal

This chapter introduces the application we will develop during the rest of the book, a web search engine. I describe the elements of a search engine and introduce the first application, a Web crawler that downloads and parses pages from Wikipedia. This chapter also presents a recursive implementation of depth-first search and an iterative implementation that uses a Java Deque to implement a “last in, first out” stack.

6.1  Search engines

A web search engine, like Google Search or Bing, takes a set of “search terms” and returns a list of web pages that are relevant to those terms (I’ll discuss what “relevant” means later). You can read more at http://thinkdast.com/searcheng, but I’ll explain what you need as we go along.

The essential components of a search engine are:

  • Crawling: We’ll need a program that can download a web page, parse it, and extract the text and any links to other pages.
  • Indexing: We’ll need a data structure that makes it possible to look up a search term and find the pages that contain it.
  • Retrieval: And we’ll need a way to collect results from the Index and identify pages that are most relevant to the search terms.

We’ll start with the crawler. The goal of a crawler is to discover and download a set of web pages. For search engines like Google and Bing, the goal is to find all web pages, but often crawlers are limited to a smaller domain. In our case, we will only read pages from Wikipedia.

As a first step, we’ll build a crawler that reads a Wikipedia page, finds the first link, follows the link to another page, and repeats. We will use this crawler to test the “Getting to Philosophy” conjecture, which states:

Clicking on the first lowercase link in the main text of a Wikipedia article, and then repeating the process for subsequent articles, usually eventually gets one to the Philosophy article.

This conjecture is stated at http://thinkdast.com/getphil, and you can read its history there.

Testing the conjecture will allow us to build the basic pieces of a crawler without having to crawl the entire web, or even all of Wikipedia. And I think the exercise is kind of fun!

In a few chapters, we’ll work on the indexer, and then we’ll get to the retriever.

6.2  Parsing HTML

When you download a web page, the contents are written in HyperText Markup Language, aka HTML. For example, here is a minimal HTML document:

<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>

The phrases “This is a title” and “Hello world!” are the text that actually appears on the page; the other elements are tags that indicate how the text should be displayed.

When our crawler downloads a page, it will need to parse the HTML in order to extract the text and find the links. To do that, we’ll use jsoup, which is an open-source Java library that downloads and parses HTML.

The result of parsing HTML is a Document Object Model tree, or DOM tree, that contains the elements of the document, including text and tags. The tree is a linked data structure made up of nodes; the nodes represent text, tags, and other document elements.

The relationships between the nodes are determined by the structure of the document. In the example above, the first node, called the root, is the <html> tag, which contains links to the two nodes it contains, <head> and <body>; these nodes are the children of the root node.

The <head> node has one child, <title>, and the <body> node has one child, <p> (which stands for “paragraph”). Figure 6.1 represents this tree graphically.


Figure 6.1: DOM tree for a simple HTML page.

Each node contains links to its children; in addition, each node contains a link to its parent, so from any node it is possible to navigate up and down the tree. The DOM tree for real pages is usually more complicated than this example.

Most web browsers provide tools for inspecting the DOM of the page you are viewing. In Chrome, you can right-click on any part of a web page and select “Inspect” from the menu that pops up. In Firefox, you can right-click and select “Inspect Element” from the menu. Safari provides a tool called Web Inspector, which you can read about at http://thinkdast.com/safari. For Internet Explorer, you can read the instructions at http://thinkdast.com/explorer.


Figure 6.2: Screenshot of the Chrome DOM Inspector.

Figure 6.2 shows a screenshot of the DOM for the Wikipedia page on Java, http://thinkdast.com/java. The element that’s highlighted is the first paragraph of the main text of the article, which is contained in a <div> element with id="mw-content-text". We’ll use this element id to identify the main text of each article we download.

6.3  Using jsoup

jsoup makes it easy to download and parse web pages, and to navigate the DOM tree. Here’s an example:

    String url = "http://en.wikipedia.org/wiki/Java_(programming_language)";

    // download and parse the document
    Connection conn = Jsoup.connect(url);
    Document doc = conn.get();

    // select the content text and pull out the paragraphs.
    Element content = doc.getElementById("mw-content-text");
    Elements paragraphs = content.select("p");

Jsoup.connect takes a URL as a String and makes a connection to the web server; the get method downloads the HTML, parses it, and returns a Document object, which represents the DOM.

Document provides methods for navigating the tree and selecting nodes. In fact, it provides so many methods, it can be confusing. This example demonstrates two ways to select nodes:

  • getElementById takes a String and searches the tree for an element that has a matching “id” field. Here it selects the node <div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">, which appears on every Wikipedia page to identify the <div> element that contains the main text of the page, as opposed to the navigation sidebar and other elements.

    The return value from getElementById is an Element object that represents this <div> and contains the elements in the <div> as children, grandchildren, etc.

  • select takes a String, traverses the tree, and returns all the elements with tags that match the String. In this example, it returns all paragraph tags that appear in content. The return value is an Elements object.

Before you go on, you should skim the documentation of these classes so you know what they can do. The most important classes are Element, Elements, and Node, which you can read about at http://thinkdast.com/jsoupelt, http://thinkdast.com/jsoupelts, and http://thinkdast.com/jsoupnode.

Node represents a node in the DOM tree; there are several subclasses that extend Node, including Element, TextNode, DataNode, and Comment. Elements is a Collection of Element objects.


Figure 6.3: UML diagram for selected classes provided by jsoup.
Edit: http://yuml.me/edit/4bc1c919

Figure 6.3 is a UML diagram showing the relationships among these classes. In a UML class diagram, a line with a hollow arrow head indicates that one class extends another. For example, this diagram indicates that Elements extends ArrayList. We’ll get back to UML diagrams in Section 11.6.

6.4  Iterating through the DOM

To make your life easier, I provide a class called WikiNodeIterable that lets you iterate through the nodes in a DOM tree. Here’s an example that shows how to use it:

    Elements paragraphs = content.select("p");
    Element firstPara = paragraphs.get(0);

    Iterable<Node> iter = new WikiNodeIterable(firstPara);
    for (Node node: iter) {
        if (node instanceof TextNode) {
            System.out.print(node);
        }
    }

This example picks up where the previous one leaves off. It selects the first paragraph in paragraphs and then creates a WikiNodeIterable, which implements Iterable<Node>. WikiNodeIterable performs a “depth-first search”, which produces the nodes in the order they would appear on the page.

In this example, we print a Node only if it is a TextNode and ignore other types of Node, specifically the Element objects that represent tags. The result is the plain text of the HTML paragraph without any markup. The output is:

Java is a general-purpose computer programming language that is concurrent, class-based, object-oriented,[13] and specifically designed …

6.5  Depth-first search

There are several ways you might reasonably traverse a tree, each with different applications. We’ll start with “depth-first search”, or DFS. DFS starts at the root of the tree and selects the first child. If the child has children, it selects the first child again. When it gets to a node with no children, it backtracks, moving up the tree to the parent node, where it selects the next child if there is one; otherwise it backtracks again. When it has explored the last child of the root, it’s done.

There are two common ways to implement DFS, recursively and iteratively. The recursive implementation is simple and elegant:

private static void recursiveDFS(Node node) {
    if (node instanceof TextNode) {
        System.out.print(node);
    }
    for (Node child: node.childNodes()) {
        recursiveDFS(child);
    }
}

This method gets invoked on every Node in the tree, starting with the root. If the Node it gets is a TextNode, it prints the contents. If the Node has any children, it invokes recursiveDFS on each one of them in order.

In this example, we print the contents of each TextNode before traversing the children, so this is an example of a “pre-order” traversal. You can read about “pre-order”, “post-order”, and “in-order” traversals at http://thinkdast.com/treetrav. For this application, the traversal order doesn’t matter.

By making recursive calls, recursiveDFS uses the call stack (http://thinkdast.com/callstack) to keep track of the child nodes and process them in the right order. As an alternative, we can use a stack data structure to keep track of the nodes ourselves; if we do that, we can avoid the recursion and traverse the tree iteratively.

6.6  Stacks in Java

Before I explain the iterative version of DFS, I’ll explain the stack data structure. We’ll start with the general concept of a stack, which I’ll call a “stack” with a lowercase “s”. Then we’ll talk about two Java interfaces that define stack methods: Stack and Deque.

A stack is a data structure that is similar to a list: it is a collection that maintains the order of the elements. The primary difference between a stack and a list is that the stack provides fewer methods. In the usual convention, it provides:

  • push: which adds an element to the top of the stack.
  • pop: which removes and returns the top-most element from the stack.
  • peek: which returns the top-most element without modifying the stack.
  • isEmpty: which indicates whether the stack is empty.

Because pop always returns the top-most element, a stack is also called a “LIFO”, which stands for “last in, first out”. An alternative to a stack is a “queue”, which returns elements in the same order they are added; that is, “first in, first out”, or FIFO.

It might not be obvious why stacks and queues are useful: they don’t provide any capabilities that aren’t provided by lists; in fact, they provide fewer capabilities. So why not use lists for everything? There are two reasons:

  1. If you limit yourself to a small set of methods — that is, a small API — your code will be more readable and less error-prone. For example, if you use a list to represent a stack, you might accidentally remove an element in the wrong order. With the stack API, this kind of mistake is literally impossible. And the best way to avoid errors is to make them impossible.
  2. If a data structure provides a small API, it is easier to implement efficiently. For example, a simple way to implement a stack is a singly-linked list. When we push an element onto the stack, we add it to the beginning of the list; when we pop an element, we remove it from the beginning. For a linked list, adding and removing from the beginning are constant time operations, so this implementation is efficient. Conversely, big APIs are harder to implement efficiently.

To implement a stack in Java, you have three options:

  1. Go ahead and use ArrayList or LinkedList. If you use ArrayList, be sure to add and remove from the end, which is a constant time operation. And be careful not to add elements in the wrong place or remove them in the wrong order.
  2. Java provides a class called Stack that provides the standard set of stack methods. But this class is an old part of Java: it is not consistent with the Java Collections Framework, which came later.
  3. Probably the best choice is to use one of the implementations of the Deque interface, like ArrayDeque.

“Deque” stands for “double-ended queue”; it’s supposed to be pronounced “deck”, but some people say “deek”. In Java, the Deque interface provides push, pop, peek, and isEmpty, so you can use a Deque as a stack. It provides other methods, which you can read about at http://thinkdast.com/deque, but we won’t use them for now.

6.7  Iterative DFS

Here is an iterative version of DFS that uses an ArrayDeque to represent a stack of Node objects:

    private static void iterativeDFS(Node root) {
        Deque<Node> stack = new ArrayDeque<Node>();
        stack.push(root);

        while (!stack.isEmpty()) {
            Node node = stack.pop();
            if (node instanceof TextNode) {
                System.out.print(node);
            }

            List<Node> nodes = new ArrayList<Node>(node.childNodes());
            Collections.reverse(nodes);

            for (Node child: nodes) {
                stack.push(child);
            }
        }
    }

The parameter, root, is the root of the tree we want to traverse, so we start by creating the stack and pushing the root onto it.

The loop continues until the stack is empty. Each time through, it pops a Node off the stack. If it gets a TextNode, it prints the contents. Then it pushes the children onto the stack. In order to process the children in the right order, we have to push them onto the stack in reverse order; we do that by copying the children into an ArrayList, reversing the elements in place, and then iterating through the reversed ArrayList.

One advantage of the iterative version of DFS is that it is easier to implement as a Java Iterator; you’ll see how in the next chapter.

But first, one last note about the Deque interface: in addition to ArrayDeque, Java provides another implementation of Deque, our old friend LinkedList. LinkedList implements both interfaces, List and Deque. Which interface you get depends on how you use it. For example, if you assign a LinkedList object to a Deque variable, like this:

Deqeue<Node> deque = new LinkedList<Node>();

you can use the methods in the Deque interface, but not all methods in the List interface. If you assign it to a List variable, like this:

List<Node> deque = new LinkedList<Node>();

you can use List methods but not all Deque methods. And if you assign it like this:

LinkedList<Node> deque = new LinkedList<Node>();

you can use all the methods. But if you combine methods from different interfaces, your code will be less readable and more error-prone.

Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.


Think Data Structures

Think DSP

Think Java

Think Bayes

Think Python 2e

Think Stats 2e

Think Complexity


Previous Up Next