This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Or you can buy this book on Amazon.com. Chapter 15 Crawling WikipediaIn this chapter, I present a solution to the previous exercise and analyze the performance of Web indexing algorithms. Then we build a simple Web crawler. 15.1 The Redis-backed indexerIn my solution, we store two kinds of structures in Redis:
We discussed these data types in the previous chapter. You can also read about Redis Sets and Hashes at http://thinkdast.com/redistypes In private String urlSetKey(String term) { return "URLSet:" + term; } And a method that takes a URL and returns the Redis key of its
private String termCounterKey(String url) { return "TermCounter:" + url; } Here’s the implementation of public void indexPage(String url, Elements paragraphs) { System.out.println("Indexing " + url); // make a TermCounter and count the terms in the paragraphs TermCounter tc = new TermCounter(url); tc.processElements(paragraphs); // push the contents of the TermCounter to Redis pushTermCounterToRedis(tc); } To index a page, we
Here’s the new code that pushes a public List<Object> pushTermCounterToRedis(TermCounter tc) { Transaction t = jedis.multi(); String url = tc.getLabel(); String hashname = termCounterKey(url); // if this page has already been indexed, delete the old hash t.del(hashname); // for each term, add an entry in the TermCounter and a new // member of the index for (String term: tc.keySet()) { Integer count = tc.get(term); t.hset(hashname, term, count.toString()); t.sadd(urlSetKey(term), url); } List<Object> res = t.exec(); return res; } This method uses a It loops through the terms in the
If the page has already been indexed, we delete its old
That’s it for indexing new pages. The second part of the exercise asked you to write public Map<String, Integer> getCounts(String term) { Map<String, Integer> map = new HashMap<String, Integer>(); Set<String> urls = getURLs(term); for (String url: urls) { Integer count = getCount(url, term); map.put(url, count); } return map; } This method uses two helper methods:
Here are the implementations: public Set<String> getURLs(String term) { Set<String> set = jedis.smembers(urlSetKey(term)); return set; } public Integer getCount(String url, String term) { String redisKey = termCounterKey(url); String count = jedis.hget(redisKey, term); return new Integer(count); } Because of the way we designed the index, these methods are simple and efficient. 15.2 Analysis of lookupSuppose we have indexed N pages and discovered M unique search terms. How long will it take to look up a search term? Think about your answer before you continue. To look up a search term, we run
Inside the loop, we run This algorithm is as efficient as it can be, in terms of
algorithmic complexity, but it is very slow because it sends many small
operations to Redis. You can make it faster using a
15.3 Analysis of indexingUsing the data structures we designed, how long will it take to index a page? Again, think about your answer before you continue. To index a page, we traverse its DOM tree, find all the
For each term, we increment a counter in a HashMap, which is a constant
time operation. So making the Pushing the
Both of these are constant time operations, so the total time to push
the In summary, making the Since the number of words on the page usually exceeds the number of unique search terms, the overall complexity is proportional to the number of words on the page. In theory a page might contain all search terms in the index, so the worst case performance is O(M), but we don’t expect to see the worse case in practice. This analysis suggests a way to improve performance: we should probably
avoid indexing very common words. First of all, they take up a lot of
time and space, because they appear in almost every Most search engines avoid indexing common words, which are known in this context as stop words (http://thinkdast.com/stopword). 15.4 Graph traversalIf you did the “Getting to Philosophy” exercise in Chapter 7, you already have a program that reads a Wikipedia page, finds the first link, uses the link to load the next page, and repeats. This program is a specialized kind of crawler, but when people say “Web crawler” they usually mean a program that
You can think of the Web as a graph where each page is a node and each link is a directed edge from one node to another. If you are not familiar with graphs, you can read about them at http://thinkdast.com/graph. Starting from a source node, a crawler traverses this graph, visiting each reachable node once. The collection we use to store the URLs determines what kind of traversal the crawler performs:
You can read more about graph traversal at http://thinkdast.com/graphtrav. 15.5 Exercise 12Now it’s time to write the crawler. In the repository for this book, you’ll find the source files for this exercise:
You’ll also need some of the helper classes we’ve used in previous exercises:
Before you run Run Now run Here’s the beginning of the public class WikiCrawler { public final String source; private JedisIndex index; private Queue<String> queue = new LinkedList<String>(); final static WikiFetcher wf = new WikiFetcher(); public WikiCrawler(String source, JedisIndex index) { this.source = source; this.index = index; queue.offer(source); } public int queueSize() { return queue.size(); } The instance variables are
Your job is to fill in public String crawl(boolean testing) throws IOException {} The parameter When
When
When your crawler is working as specified, this test should pass. Good luck! |
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|