Previous Up Next

This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Or you can buy this book on Amazon.com.

Chapter 8  Indexer

At this point we have built a basic Web crawler; the next piece we will work on is the index. In the context of web search, an index is a data structure that makes it possible to look up a search term and find the pages where that term appears. In addition, we would like to know how many times the search term appears on each page, which will help identify the pages most relevant to the term.

For example, if a user submits the search terms “Java” and “programming”, we would look up both search terms and get two sets of pages. Pages with the word “Java” would include pages about the island of Java, the nickname for coffee, and the programming language. Pages with the word “programming” would include pages about different programming languages, as well as other uses of the word. By selecting pages with both terms, we hope to eliminate irrelevant pages and find the ones about Java programming.

Now that we understand what the index is and what operations it performs, we can design a data structure to represent it.

8.1  Data structure selection

The fundamental operation of the index is a lookup; specifically, we need the ability to look up a term and find all pages that contain it. The simplest implementation would be a collection of pages. Given a search term, we could iterate through the contents of the pages and select the ones that contain the search term. But the run time would be proportional to the total number of words on all the pages, which would be way too slow.

A better alternative is a map, which is a data structure that represents a collection of key-value pairs and provides a fast way to look up a key and find the corresponding value. For example, the first map we’ll construct is a TermCounter, which maps from each search term to the number of times it appears in a page. The keys are the search terms and the values are the counts (also called “frequencies”).

Java provides an interface called Map that specifies the methods a map should provide; the most important are:

  • get(key): This method looks up a key and returns the corresponding value.
  • put(key, value): This method adds a new key-value pair to the Map, or if the key is already in the map, it replaces the value associated with key.

Java provides several implementations of Map, including the two we will focus on, HashMap and TreeMap. In upcoming chapters, we’ll look at these implementations and analyze their performance.

In addition to the TermCounter, which maps from search terms to counts, we will define a class called Index, which maps from a search term to a collection of pages where it appears. And that raises the next question, which is how to represent a collection of pages. Again, if we think about the operations we want to perform, that guides our decision.

In this case, we’ll need to combine two or more collections and find the pages that appear in all of them. You might recognize this operation as set intersection: the intersection of two sets is the set of elements that appear in both.

As you might expect by now, Java provides a Set interface that defines the operations a set should perform. It doesn’t actually provide set intersection, but it provides methods that make it possible to implement intersection and other set operations efficiently. The core Set methods are:

  • add(element): This method adds an element to a set; if the element is already in the set, it has no effect.
  • contains(element): This method checks whether the given element is in the set.

Java provides several implementations of Set, including HashSet and TreeSet.

Now that we’ve designed our data structures from the top down, we’ll implement them from the inside out, starting with TermCounter.

8.2  TermCounter

TermCounter is a class that represents a mapping from search terms to the number of times they appear in a page. Here is the first part of the class definition:

public class TermCounter {

    private Map<String, Integer> map;
    private String label;

    public TermCounter(String label) {
        this.label = label;
        this.map = new HashMap<String, Integer>();
    }
}

The instance variables are map, which contains the mapping from terms to counts, and label, which identifies the document the terms came from; we’ll use it to store URLs.

To implement the mapping, I chose HashMap, which is the most commonly-used Map. Coming up in a few chapters, you will see how it works and why it is a common choice.

TermCounter provides put and get, which are defined like this:

    public void put(String term, int count) {
        map.put(term, count);
    }

    public Integer get(String term) {
        Integer count = map.get(term);
        return count == null ? 0 : count;
    }

put is just a wrapper method; when you call put on a TermCounter, it calls put on the embedded map.

On the other hand, get actually does some work. When you call get on a TermCounter, it calls get on the map, and then checks the result. If the term does not appear in the map, TermCount.get returns 0. Defining get this way makes it easier to write incrementTermCount, which takes a term and increases by one the counter associated with that term.

    public void incrementTermCount(String term) {
        put(term, get(term) + 1);
    }

If the term has not been seen before, get returns 0; we add 1, then use put to add a new key-value pair to the map. If the term is already in the map, we get the old count, add 1, and then store the new count, which replaces the old value.

In addition, TermCounter provides these other methods to help with indexing Web pages:

    public void processElements(Elements paragraphs) {
        for (Node node: paragraphs) {
            processTree(node);
        }
    }

    public void processTree(Node root) {
        for (Node node: new WikiNodeIterable(root)) {
            if (node instanceof TextNode) {
                processText(((TextNode) node).text());
            }
        }
    }

    public void processText(String text) {
        String[] array = text.replaceAll("\\pP", " ").
                              toLowerCase().
                              split("\\s+");

        for (int i=0; i<array.length; i++) {
            String term = array[i];
            incrementTermCount(term);
        }
    }
  • processElements takes an Elements object, which is a collection of jsoup Element objects. It iterates through the collection and calls processTree on each.
  • processTree takes a jsoup Node that represents the root of a DOM tree. It iterates through the tree to find the nodes that contain text; then it extracts the text and passes it to processText.
  • processText takes a String that contains words, spaces, punctuation, etc. It removes punctuation characters by replacing them with spaces, converts the remaining letters to lowercase, then splits the text into words. Then it loops through the words it found and calls incrementTermCount on each. The replaceAll and split methods take regular expressions as parameters; you can read more about them at http://thinkdast.com/regex.

Finally, here’s an example that demonstrates how TermCounter is used:

    String url = "http://en.wikipedia.org/wiki/Java_(programming_language)";
    WikiFetcher wf = new WikiFetcher();
    Elements paragraphs = wf.fetchWikipedia(url);

    TermCounter counter = new TermCounter(url);
    counter.processElements(paragraphs);
    counter.printCounts();

This example uses a WikiFetcher to download a page from Wikipedia and parse the main text. Then it creates a TermCounter and uses it to count the words in the page.

In the next section, you’ll have a chance to run this code and test your understanding by filling in a missing method.

8.3  Exercise 6

In the repository for this book, you’ll find the source files for this exercise:

  • TermCounter.java contains the code from the previous section.
  • TermCounterTest.java contains test code for TermCounter.java.
  • Index.java contains the class definition for the next part of this exercise.
  • WikiFetcher.java contains the class we used in the previous exercise to download and parse Web pages.
  • WikiNodeIterable.java contains the class we used to traverse the nodes in a DOM tree.

You’ll also find the Ant build file build.xml.

Run ant build to compile the source files. Then run ant TermCounter; it should run the code from the previous section and print a list of terms and their counts. The output should look something like this:

genericservlet, 2
configurations, 1
claimed, 1
servletresponse, 2
occur, 2
Total of all counts = -1

When you run it, the order of the terms might be different.

The last line is supposed to print the total of the term counts, but it returns -1 because the method size is incomplete. Fill in this method and run ant TermCounter again. The result should be 4798.

Run ant TermCounterTest to confirm that this part of the exercise is complete and correct.

For the second part of the exercise, I’ll present an implementation of an Index object and you will fill in a missing method. Here’s the beginning of the class definition:

public class Index {

    private Map<String, Set<TermCounter>> index = 
        new HashMap<String, Set<TermCounter>>();

    public void add(String term, TermCounter tc) {
        Set<TermCounter> set = get(term);

        // if we're seeing a term for the first time, make a new Set
        if (set == null) {
            set = new HashSet<TermCounter>();
            index.put(term, set);
        }
        // otherwise we can modify an existing Set
        set.add(tc);
    }

    public Set<TermCounter> get(String term) {
        return index.get(term);
    }

The instance variable, index, is a map from each search term to a set of TermCounter objects. Each TermCounter represents a page where the search term appears.

The add method adds a new TermCounter to the set associated with a term. When we index a term that has not appeared before, we have to create a new set. Otherwise we can just add a new element to an existing set. In that case, set.add modifies a set that lives inside index, but doesn’t modify index itself. The only time we modify index is when we add a new term.

Finally, the get method takes a search term and returns the corresponding set of TermCounter objects.

This data structure is moderately complicated. To review, an Index contains a Map from each search term to a Set of TermCounter objects, and each TermCounter is a map from search terms to counts.


Figure 8.1: Object diagram of an Index.

Figure 8.1 is an object diagram that shows these objects. The Index object has an instance variable named index that refers to a Map. In this example the Map contains only one string, "Java", which maps to a Set that contains two TermCounter objects, one for each page where the word “Java” appears.

Each TermCounter contains label, which is the URL of the page, and map, which is a Map that contains the words on the page and the number of times each word appears.

The method printIndex shows how to unpack this data structure:

    public void printIndex() {
        // loop through the search terms
        for (String term: keySet()) {
            System.out.println(term);

            // for each term, print pages where it appears and frequencies
            Set<TermCounter> tcs = get(term);
            for (TermCounter tc: tcs) {
                Integer count = tc.get(term);
                System.out.println("    " + tc.getLabel() + " " + count);
            }
        }
    }

The outer loop iterates the search terms. The inner loop iterates the TermCounter objects.

Run ant build to make sure your source code is compiled, and then run ant Index. It downloads two Wikipedia pages, indexes them, and prints the results; but when you run it you won’t see any output because we’ve left one of the methods empty.

Your job is to fill in indexPage, which takes a URL (as a String) and an Elements object, and updates the index. The comments below sketch what it should do:

public void indexPage(String url, Elements paragraphs) {
    // make a TermCounter and count the terms in the paragraphs

    // for each term in the TermCounter, add the TermCounter to the index
}

When it’s working, run ant Index again, and you should see output like this:

...
configurations
    http://en.wikipedia.org/wiki/Programming_language 1
    http://en.wikipedia.org/wiki/Java_(programming_language) 1
claimed
    http://en.wikipedia.org/wiki/Java_(programming_language) 1
servletresponse
    http://en.wikipedia.org/wiki/Java_(programming_language) 2
occur
    http://en.wikipedia.org/wiki/Java_(programming_language) 2

The order of the search terms might be different when you run it.

Also, run ant TestIndex to confirm that this part of the exercise is complete.

Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.


Think Data Structures

Think DSP

Think Java

Think Bayes

Think Python 2e

Think Stats 2e

Think Complexity


Previous Up Next