This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Or you can buy this book on Amazon.com. Chapter 8 IndexerAt this point we have built a basic Web crawler; the next piece we will work on is the index. In the context of web search, an index is a data structure that makes it possible to look up a search term and find the pages where that term appears. In addition, we would like to know how many times the search term appears on each page, which will help identify the pages most relevant to the term. For example, if a user submits the search terms “Java” and “programming”, we would look up both search terms and get two sets of pages. Pages with the word “Java” would include pages about the island of Java, the nickname for coffee, and the programming language. Pages with the word “programming” would include pages about different programming languages, as well as other uses of the word. By selecting pages with both terms, we hope to eliminate irrelevant pages and find the ones about Java programming. Now that we understand what the index is and what operations it performs, we can design a data structure to represent it. 8.1 Data structure selectionThe fundamental operation of the index is a lookup; specifically, we need the ability to look up a term and find all pages that contain it. The simplest implementation would be a collection of pages. Given a search term, we could iterate through the contents of the pages and select the ones that contain the search term. But the run time would be proportional to the total number of words on all the pages, which would be way too slow. A better alternative is a map, which is a data structure that
represents a collection of key-value pairs and provides a fast
way to look up a key and find the corresponding value.
For example, the first map we’ll construct is a Java provides an interface called
Java provides several implementations of In addition to the In this case, we’ll need to combine two or more collections and find the pages that appear in all of them. You might recognize this operation as set intersection: the intersection of two sets is the set of elements that appear in both. As you might expect by now, Java provides a
Java provides several implementations of Now that we’ve designed our data structures from the top down, we’ll
implement them from the inside out, starting with 8.2 TermCounter
public class TermCounter { private Map<String, Integer> map; private String label; public TermCounter(String label) { this.label = label; this.map = new HashMap<String, Integer>(); } } The instance variables are To implement the mapping, I chose
public void put(String term, int count) { map.put(term, count); } public Integer get(String term) { Integer count = map.get(term); return count == null ? 0 : count; }
On the other hand, public void incrementTermCount(String term) { put(term, get(term) + 1); } If the term has not been seen before, In addition, public void processElements(Elements paragraphs) { for (Node node: paragraphs) { processTree(node); } } public void processTree(Node root) { for (Node node: new WikiNodeIterable(root)) { if (node instanceof TextNode) { processText(((TextNode) node).text()); } } } public void processText(String text) { String[] array = text.replaceAll("\\pP", " "). toLowerCase(). split("\\s+"); for (int i=0; i<array.length; i++) { String term = array[i]; incrementTermCount(term); } }
Finally, here’s an example that demonstrates how String url = "http://en.wikipedia.org/wiki/Java_(programming_language)"; WikiFetcher wf = new WikiFetcher(); Elements paragraphs = wf.fetchWikipedia(url); TermCounter counter = new TermCounter(url); counter.processElements(paragraphs); counter.printCounts(); This example uses a In the next section, you’ll have a chance to run this code and test your understanding by filling in a missing method. 8.3 Exercise 6In the repository for this book, you’ll find the source files for this exercise:
You’ll also find the Ant build file
Run genericservlet, 2 configurations, 1 claimed, 1 servletresponse, 2 occur, 2 Total of all counts = -1 When you run it, the order of the terms might be different. The last line is supposed to print the total of the term counts, but
it returns Run For the second part of the exercise, I’ll present an implementation of an
public class Index { private Map<String, Set<TermCounter>> index = new HashMap<String, Set<TermCounter>>(); public void add(String term, TermCounter tc) { Set<TermCounter> set = get(term); // if we're seeing a term for the first time, make a new Set if (set == null) { set = new HashSet<TermCounter>(); index.put(term, set); } // otherwise we can modify an existing Set set.add(tc); } public Set<TermCounter> get(String term) { return index.get(term); } The instance variable, The Finally, the This data structure is moderately complicated. To review, an
Figure 8.1 is an object diagram that shows these
objects. The Each The method public void printIndex() { // loop through the search terms for (String term: keySet()) { System.out.println(term); // for each term, print pages where it appears and frequencies Set<TermCounter> tcs = get(term); for (TermCounter tc: tcs) { Integer count = tc.get(term); System.out.println(" " + tc.getLabel() + " " + count); } } } The outer loop iterates the search terms. The inner loop iterates the
Run Your job is to fill in public void indexPage(String url, Elements paragraphs) { // make a TermCounter and count the terms in the paragraphs // for each term in the TermCounter, add the TermCounter to the index } When it’s working, run ... configurations http://en.wikipedia.org/wiki/Programming_language 1 http://en.wikipedia.org/wiki/Java_(programming_language) 1 claimed http://en.wikipedia.org/wiki/Java_(programming_language) 1 servletresponse http://en.wikipedia.org/wiki/Java_(programming_language) 2 occur http://en.wikipedia.org/wiki/Java_(programming_language) 2 The order of the search terms might be different when you run it. Also, run |
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|