This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Or you can buy this book on Amazon.com. Chapter 6 Tree traversalThis chapter introduces the application we will develop during the
rest of the book, a web search engine.
I describe the elements of a search engine and
introduce the first application, a Web crawler that downloads and parses
pages from Wikipedia. This chapter also presents a recursive implementation of
depth-first search and an iterative implementation that uses a Java
6.1 Search enginesA web search engine, like Google Search or Bing, takes a set of “search terms” and returns a list of web pages that are relevant to those terms (I’ll discuss what “relevant” means later). You can read more at http://thinkdast.com/searcheng, but I’ll explain what you need as we go along. The essential components of a search engine are:
We’ll start with the crawler. The goal of a crawler is to discover and download a set of web pages. For search engines like Google and Bing, the goal is to find all web pages, but often crawlers are limited to a smaller domain. In our case, we will only read pages from Wikipedia. As a first step, we’ll build a crawler that reads a Wikipedia page, finds the first link, follows the link to another page, and repeats. We will use this crawler to test the “Getting to Philosophy” conjecture, which states: Clicking on the first lowercase link in the main text of a Wikipedia article, and then repeating the process for subsequent articles, usually eventually gets one to the Philosophy article. This conjecture is stated at http://thinkdast.com/getphil, and you can read its history there. Testing the conjecture will allow us to build the basic pieces of a crawler without having to crawl the entire web, or even all of Wikipedia. And I think the exercise is kind of fun! In a few chapters, we’ll work on the indexer, and then we’ll get to the retriever. 6.2 Parsing HTMLWhen you download a web page, the contents are written in HyperText Markup Language, aka HTML. For example, here is a minimal HTML document: <!DOCTYPE html> <html> <head> <title>This is a title</title> </head> <body> <p>Hello world!</p> </body> </html> The phrases “This is a title” and “Hello world!” are the text that actually appears on the page; the other elements are tags that indicate how the text should be displayed. When our crawler downloads a page, it will need to parse the HTML in order to extract the text and find the links. To do that, we’ll use jsoup, which is an open-source Java library that downloads and parses HTML. The result of parsing HTML is a Document Object Model tree, or DOM tree, that contains the elements of the document, including text and tags. The tree is a linked data structure made up of nodes; the nodes represent text, tags, and other document elements. The relationships between the nodes are determined by the structure of
the document. In the example above, the first node, called the
root, is the The Each node contains links to its children; in addition, each node contains a link to its parent, so from any node it is possible to navigate up and down the tree. The DOM tree for real pages is usually more complicated than this example. Most web browsers provide tools for inspecting the DOM of the page you are viewing. In Chrome, you can right-click on any part of a web page and select “Inspect” from the menu that pops up. In Firefox, you can right-click and select “Inspect Element” from the menu. Safari provides a tool called Web Inspector, which you can read about at http://thinkdast.com/safari. For Internet Explorer, you can read the instructions at http://thinkdast.com/explorer. Figure 6.2
shows a screenshot of the DOM for the Wikipedia page on Java,
http://thinkdast.com/java.
The element that’s highlighted is the first paragraph of the main text
of the article, which is contained in a
6.3 Using jsoupjsoup makes it easy to download and parse web pages, and to navigate the DOM tree. Here’s an example: String url = "http://en.wikipedia.org/wiki/Java_(programming_language)"; // download and parse the document Connection conn = Jsoup.connect(url); Document doc = conn.get(); // select the content text and pull out the paragraphs. Element content = doc.getElementById("mw-content-text"); Elements paragraphs = content.select("p");
Before you go on, you should skim the documentation of these classes so
you know what they can do. The most important classes are
Figure 6.3 is a UML diagram showing the relationships among
these classes. In a UML class diagram, a line with a hollow arrow
head indicates that one class extends another. For example, this
diagram indicates that 6.4 Iterating through the DOMTo make your life easier, I provide a class called
Elements paragraphs = content.select("p"); Element firstPara = paragraphs.get(0); Iterable<Node> iter = new WikiNodeIterable(firstPara); for (Node node: iter) { if (node instanceof TextNode) { System.out.print(node); } } This example picks up where the previous one leaves off. It selects the
first paragraph in In this example, we print a Java is a general-purpose computer programming language that is concurrent, class-based, object-oriented,[13] and specifically designed … 6.5 Depth-first searchThere are several ways you might reasonably traverse a tree, each with different applications. We’ll start with “depth-first search”, or DFS. DFS starts at the root of the tree and selects the first child. If the child has children, it selects the first child again. When it gets to a node with no children, it backtracks, moving up the tree to the parent node, where it selects the next child if there is one; otherwise it backtracks again. When it has explored the last child of the root, it’s done. There are two common ways to implement DFS, recursively and iteratively. The recursive implementation is simple and elegant: private static void recursiveDFS(Node node) { if (node instanceof TextNode) { System.out.print(node); } for (Node child: node.childNodes()) { recursiveDFS(child); } } This method gets invoked on every In this example, we print the contents of each By making recursive calls, 6.6 Stacks in JavaBefore I explain the iterative version of DFS, I’ll explain
the stack data structure. We’ll start with the general concept of a
stack, which I’ll call a “stack” with a lowercase “s”. Then we’ll
talk about two Java A stack is a data structure that is similar to a list: it is a collection that maintains the order of the elements. The primary difference between a stack and a list is that the stack provides fewer methods. In the usual convention, it provides:
Because It might not be obvious why stacks and queues are useful: they don’t provide any capabilities that aren’t provided by lists; in fact, they provide fewer capabilities. So why not use lists for everything? There are two reasons:
To implement a stack in Java, you have three options:
“Deque” stands for “double-ended queue”; it’s supposed to be
pronounced “deck”, but some people say “deek”. In Java, the
6.7 Iterative DFSHere is an iterative version of DFS that uses an private static void iterativeDFS(Node root) { Deque<Node> stack = new ArrayDeque<Node>(); stack.push(root); while (!stack.isEmpty()) { Node node = stack.pop(); if (node instanceof TextNode) { System.out.print(node); } List<Node> nodes = new ArrayList<Node>(node.childNodes()); Collections.reverse(nodes); for (Node child: nodes) { stack.push(child); } } } The parameter, The loop continues until the stack is empty. Each time through, it pops
a One advantage of the iterative version of DFS is that it is easier to
implement as a Java But first, one last note about the Deqeue<Node> deque = new LinkedList<Node>(); you can use the methods in the List<Node> deque = new LinkedList<Node>(); you can use LinkedList<Node> deque = new LinkedList<Node>(); you can use all the methods. But if you combine methods from different interfaces, your code will be less readable and more error-prone. |
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|