This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book.
In particular, some of the symbols are not rendered correctly.
You might prefer to read the PDF version.
Or you can buy this book on Amazon.com.
Chapter 7 Getting to Philosophy
The goal of this chapter is to develop a Web crawler that tests the
“Getting to Philosophy” conjecture, which we presented in
Section 6.1.
7.1 Getting started
In the repository for this book,
you’ll find some code to help you get started: WikiNodeExample.java contains the code from the previous
chapter, demonstrating recursive and iterative implementations of
depth-first search (DFS) in a DOM tree.WikiNodeIterable.java contains an Iterable class for
traversing a DOM tree. I’ll explain this code in the next section.WikiFetcher.java contains a utility class that uses jsoup to
download pages from Wikipedia. To help you comply with Wikipedia’s
terms of service, this class limits how fast you can download pages;
if you request more than one page per second, it sleeps before
downloading the next page.WikiPhilosophy.java contains an outline of the code you will
write for this exercise. We’ll walk through it below.
You’ll also find the Ant build file
build.xml . If you run ant WikiPhilosophy , it will run
a simple bit of starter code.
7.2 Iterables and Iterators
In the previous chapter, I presented an iterative depth-first search
(DFS), and suggested that an advantage of the iterative version,
compared to the recursive version, is that it is easier to wrap in an
Iterator object. In this section we’ll see how to do that.
If you are not familiar with the Iterator and Iterable
interfaces, you can read about them at
http://thinkdast.com/iterator
and
http://thinkdast.com/iterable. Take a look at the contents of WikiNodeIterable.java . The outer
class, WikiNodeIterable implements the
Iterable<Node> interface so we can use
it in a for loop like this: Node root = ...
Iterable<Node> iter = new WikiNodeIterable(root);
for (Node node: iter) {
visit(node);
}
where root is the root of the tree we want to traverse and
visit is a method that does whatever we want when we “visit”
a Node .
The implementation of WikiNodeIterable follows a conventional
formula: - The constructor takes and stores a reference to the root
Node . - The
iterator method creates a returns an Iterator
object.
Here’s what it looks like: public class WikiNodeIterable implements Iterable<Node> {
private Node root;
public WikiNodeIterable(Node root) {
this.root = root;
}
@Override
public Iterator<Node> iterator() {
return new WikiNodeIterator(root);
}
}
The inner class, WikiNodeIterator , does all the real work: private class WikiNodeIterator implements Iterator<Node> {
Deque<Node> stack;
public WikiNodeIterator(Node node) {
stack = new ArrayDeque<Node>();
stack.push(root);
}
@Override
public boolean hasNext() {
return !stack.isEmpty();
}
@Override
public Node next() {
if (stack.isEmpty()) {
throw new NoSuchElementException();
}
Node node = stack.pop();
List<Node> nodes = new ArrayList<Node>(node.childNodes());
Collections.reverse(nodes);
for (Node child: nodes) {
stack.push(child);
}
return node;
}
}
This code is almost identical to the iterative version of DFS, but now
it’s split into three methods: - The constructor initializes the stack (which is implemented using an
ArrayDeque ) and pushes the root node onto it. isEmpty checks whether the stack is empty.next pops the next Node off the stack, pushes its
children in reverse order, and returns the Node it popped. If
someone invokes next on an empty Iterator , it throws
an exception.
It might not be obvious that it is worthwhile to rewrite a perfectly
good method with two classes and five methods. But now that we’ve
done it, we can use WikiNodeIterable anywhere an
Iterable is called for, which makes it easy and syntactically
clean to separate the logic of the iteration (DFS) from whatever
processing we are doing on the nodes.
7.3 WikiFetcher
When you write a Web crawler, it is easy to download too many pages too
fast, which might violate the terms of service for the server you are
downloading from. To help you avoid that, I provide a class called
WikiFetcher that does two things: - It encapsulates the code we demonstrated in the previous chapter for
downloading pages from Wikipedia, parsing the HTML, and selecting the
content text.
- It measures the time between requests and, if we don’t leave enough
time between requests, it sleeps until a reasonable interval has
elapsed. By default, the interval is one second.
Here’s the definition of WikiFetcher : public class WikiFetcher {
private long lastRequestTime = -1;
private long minInterval = 1000;
/**
* Fetches and parses a URL string,
* returning a list of paragraph elements.
*
* @param url
* @return
* @throws IOException
*/
public Elements fetchWikipedia(String url) throws IOException {
sleepIfNeeded();
Connection conn = Jsoup.connect(url);
Document doc = conn.get();
Element content = doc.getElementById("mw-content-text");
Elements paragraphs = content.select("p");
return paragraphs;
}
private void sleepIfNeeded() {
if (lastRequestTime != -1) {
long currentTime = System.currentTimeMillis();
long nextRequestTime = lastRequestTime + minInterval;
if (currentTime < nextRequestTime) {
try {
Thread.sleep(nextRequestTime - currentTime);
} catch (InterruptedException e) {
System.err.println(
"Warning: sleep interrupted in fetchWikipedia.");
}
}
}
lastRequestTime = System.currentTimeMillis();
}
}
The only public method is fetchWikipedia , which takes a URL as
a String and returns an Elements collection that contains one
DOM element for each paragraph in the content text. This code should
look familiar.
The new code is in sleepIfNeeded , which checks the time since
the last request and sleeps if the elapsed time is less than
minInterval , which is in milliseconds. That’s all there is to WikiFetcher . Here’s an example that
demonstrates how it’s used: WikiFetcher wf = new WikiFetcher();
for (String url: urlList) {
Elements paragraphs = wf.fetchWikipedia(url);
processParagraphs(paragraphs);
}
In this example, we assume that urlList is a collection of
String s, and processParagraphs is a method that does something
with the Elements object returned by fetchWikipedia . This example demonstrates something important: you should create one
WikiFetcher object and use it to handle all requests. If you
have multiple instances of WikiFetcher , they won’t enforce the
minimum interval between requests.
NOTE: My implementation of WikiFetcher is simple, but it would
be easy for someone to misuse it by creating multiple instances. You
could avoid this problem by making WikiFetcher a “singleton”,
which you can read about at
http://thinkdast.com/singleton.
7.4 Exercise 5
In WikiPhilosophy.java you’ll find a simple main
method that shows how to use some of these pieces. Starting with this
code, your job is to write a crawler that: - Takes a URL for a Wikipedia page, downloads it, and parses it.
- It should traverse the resulting DOM tree to find the first
valid link. I’ll explain what “valid” means below.
- If the page has no links, or if the first link is a page we have
already seen, the program should indicate failure and exit.
- If the link matches the URL of the Wikipedia page on philosophy, the
program should indicate success and exit.
- Otherwise it should go back to Step 1.
The program should build a List of the URLs it visits and
display the results at the end (whether it succeeds or fails).
So what should we consider a “valid” link? You have some choices here.
Various versions of the “Getting to Philosophy” conjecture use
slightly different rules, but here are some options: - The link should be in the content text of the page, not in a sidebar
or boxout.
- It should not be in italics or in parentheses.
- You should skip external links, links to the current page, and red
links.
- In some versions, you should skip a link if the text starts with an
uppercase letter.
You don’t have to enforce all of these rules, but we recommend that you
at least handle parentheses, italics, and links to the current page. If you feel like you have enough information to get started, go ahead.
Or you might want to read these hints: - As you traverse the tree, the two kinds of
Node you will need
to deal with are TextNode and Element . If you find
an Element , you will probably have to typecast it to access
the tag and other information. - When you find an
Element that contains a link, you can check
whether it is in italics by following parent links up the tree. If
there is an <i> or <em> tag in the parent chain, the
link is in italics. - To check whether a link is in parentheses, you will have to scan
through the text as you traverse the tree and keep track of opening
and closing parentheses (ideally your solution should be able to
handle nested parentheses (like this)).
- If you start from the Java page, you should get to Philosophy
after following
seven links, unless something has changed since I ran the code.
OK, that’s all the help you’re going to get. Now it’s up to you.
Have fun!
|