Previous Up Next

This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Or you can buy this book on Amazon.com.

Chapter 16  Boolean search

In this chapter I present a solution to the previous exercise. Then you will write code to combine multiple search results and sort them by their relevance to the search terms.

16.1  Crawler solution

First, let’s go over our solution to the previous exercise. I provided an outline of WikiCrawler; your job was to fill in crawl. As a reminder, here are the fields in the WikiCrawler class:

public class WikiCrawler {
    // keeps track of where we started
    private final String source;

    // the index where the results go
    private JedisIndex index;

    // queue of URLs to be indexed
    private Queue<String> queue = new LinkedList<String>();

    // fetcher used to get pages from Wikipedia
    final static WikiFetcher wf = new WikiFetcher();
}

When we create a WikiCrawler, we provide source and index. Initially, queue contains only one element, source.

Notice that the implementation of queue is a LinkedList, so we can add elements at the end — and remove them from the beginning — in constant time. By assigning a LinkedList object to a Queue variable, we limit ourselves to using methods in the Queue interface; specifically, we’ll use offer to add elements and poll to remove them.

Here’s my implementation of WikiCrawler.crawl:

    public String crawl(boolean testing) throws IOException {
        if (queue.isEmpty()) {
            return null;
        }
        String url = queue.poll();
        System.out.println("Crawling " + url);

        if (testing==false && index.isIndexed(url)) {
            System.out.println("Already indexed.");
            return null;
        }

        Elements paragraphs;
        if (testing) {
            paragraphs = wf.readWikipedia(url);
        } else {
            paragraphs = wf.fetchWikipedia(url);
        }
        index.indexPage(url, paragraphs);
        queueInternalLinks(paragraphs);
        return url;
    }

Most of the complexity in this method is there to make it easier to test. Here’s the logic:

  • If the queue is empty, it returns null to indicate that it did not index a page.
  • Otherwise it removes and stores the next URL from the queue.
  • If the URL has already been indexed, crawl doesn’t index it again, unless it’s in testing mode.
  • Next it reads the contents of the page: if it’s in testing mode, it reads from a file; otherwise it reads from the Web.
  • It indexes the page.
  • It parses the page and adds internal links to the queue.
  • Finally, it returns the URL of the page it indexed.

I presented an implementation of Index.indexPage in Section 15.1. So the only new method is WikiCrawler.queueInternalLinks.

I wrote two versions of this method with different parameters: one takes an Elements object containing one DOM tree per paragraph; the other takes an Element object that contains a single paragraph.

The first version just loops through the paragraphs. The second version does the real work.

    void queueInternalLinks(Elements paragraphs) {
        for (Element paragraph: paragraphs) {
            queueInternalLinks(paragraph);
        }
    }

    private void queueInternalLinks(Element paragraph) {
        Elements elts = paragraph.select("a[href]");
        for (Element elt: elts) {
            String relURL = elt.attr("href");

            if (relURL.startsWith("/wiki/")) {
                String absURL = elt.attr("abs:href");
                queue.offer(absURL);
            }
        }
    }

To determine whether a link is “internal,” we check whether the URL starts with “/wiki/”. This might include some pages we don’t want to index, like meta-pages about Wikipedia. And it might exclude some pages we want, like links to pages in non-English languages. But this simple test is good enough to get started.

That’s all there is to it. This exercise doesn’t have a lot of new material; it is mostly a chance to bring the pieces together.

16.2  Information retrieval

The next phase of this project is to implement a search tool. The pieces we’ll need include:

  1. An interface where users can provide search terms and view results.
  2. A lookup mechanism that takes each search term and returns the pages that contain it.
  3. Mechanisms for combining search results from multiple search terms.
  4. Algorithms for ranking and sorting search results.

The general term for processes like this is “information retrieval”, which you can read more about at http://thinkdast.com/infret.

In this exercise, we’ll focus on steps 3 and 4. We’ve already built a simple version of 2. If you are interested in building Web applications, you might consider working on step 1.

16.3  Boolean search

Most search engines can perform “boolean searches”, which means you can combine the results from multiple search terms using boolean logic. For example:

  • The search “java AND programming” might return only pages that contain both search terms: “java” and “programming”.
  • “java OR programming” might return pages that contain either term but not necessarily both.
  • “java -indonesia” might return pages that contain “java” and do not contain “indonesia”.

Expressions like these that contain search terms and operators are called “queries”.

When applied to search results, the boolean operators AND, OR, and - correspond to the set operations intersection, union, and difference. For example, suppose

  • s1 is the set of pages containing “java”,
  • s2 is the set of pages containing “programming”, and
  • s3 is the set of pages containing “indonesia”.

In that case:

  • The intersection of s1 and s2 is the set of pages containing “java” AND “programming”.
  • The union of s1 and s2 is the set of pages containing “java” OR “programming”.
  • The difference of s1 and s2 is the set of pages containing “java” and not “indonesia”.

In the next section you will write a method to implement these operations.

16.4  Exercise 13

In the repository for this book you’ll find the source files for this exercise:

  • WikiSearch.java, which defines an object that contains search results and performs operations on them.
  • WikiSearchTest.java, which contains test code for WikiSearch.
  • Card.java, which demonstrates how to use the sort method in java.util.Collections.

You will also find some of the helper classes we’ve used in previous exercises.

Here’s the beginning of the WikiSearch class definition:

public class WikiSearch {

    // map from URLs that contain the term(s) to relevance score
    private Map<String, Integer> map;

    public WikiSearch(Map<String, Integer> map) {
        this.map = map;
    }

    public Integer getRelevance(String url) {
        Integer relevance = map.get(url);
        return relevance==null ? 0: relevance;
    }
}

A WikiSearch object contains a map from URLs to their relevance score. In the context of information retrieval, a “relevance score” is a number intended to indicate how well a page meets the needs of the user as inferred from the query. There are many ways to construct a relevance score, but most of them are based on “term frequency”, which is the number of times the search terms appear on the page. A common relevance score is called TF-IDF, which stands for “term frequency – inverse document frequency”. You can read more about it at http://thinkdast.com/tfidf.

You’ll have the option to implement TF-IDF later, but we’ll start with something even simpler, TF:

  • If a query contains a single search term, the relevance of a page is its term frequency; that is, the number of time the term appears on the page.
  • For queries with multiple terms, the relevance of a page is the sum of the term frequencies; that is, the total number of times any of the search terms appear.

Now you’re ready to start the exercise. Run ant build to compile the source files, then run ant WikiSearchTest. As usual, it should fail, because you have work to do.

In WikiSearch.java, fill in the bodies of and, or, and minus so that the relevant tests pass. You don’t have to worry about testSort yet.

You can run WikiSearchTest without using Jedis because it doesn’t depend on the index in your Redis database. But if you want to run a query against your index, you have to provide a file with information about your Redis server. See Section 14.3 for details.

Run ant JedisMaker to make sure it is configured to connect to your Redis server. Then run WikiSearch, which prints results from three queries:

  • “java”
  • “programming”
  • “java AND programming”

Initially the results will be in no particular order, because WikiSearch.sort is incomplete.

Fill in the body of sort so the results are returned in increasing order of relevance. I suggest you use the sort method provided by java.util.Collections, which sorts any kind of List. You can read the documentation at http://thinkdast.com/collections.

There are two versions of sort:

  • The one-parameter version takes a list and sorts the elements using the compareTo method, so the elements have to be Comparable.
  • The two-parameter version takes a list of any object type and a Comparator, which is an object that provides a compare method that compares elements.

If you are not familiar with the Comparable and Comparator interfaces, I explain them in the next section.

16.5  Comparable and Comparator

The repository for this book includes Card.java, which demonstrates two ways to sort a list of Card objects. Here’s the beginning of the class definition:

public class Card implements Comparable<Card> {

    private final int rank;
    private final int suit;

    public Card(int rank, int suit) {
        this.rank = rank;
        this.suit = suit;
    }

A Card object has two integer fields, rank and suit. Card implements Comparable<Card>, which means that it provides compareTo:

    public int compareTo(Card that) {
        if (this.suit < that.suit) {
            return -1;
        }
        if (this.suit > that.suit) {
            return 1;
        }
        if (this.rank < that.rank) {
            return -1;
        }
        if (this.rank > that.rank) {
            return 1;
        }
        return 0;
    }

The specification of compareTo indicates that it should return a negative number if this is considered less than that, a positive number if it is considered greater, and 0 if they are considered equal.

If you use the one-parameter version of Collections.sort, it uses the compareTo method provided by the elements to sort them. To demonstrate, we can make a list of 52 cards like this:

    public static List<Card> makeDeck() {
        List<Card> cards = new ArrayList<Card>();
        for (int suit = 0; suit <= 3; suit++) {
            for (int rank = 1; rank <= 13; rank++) {
                Card card = new Card(rank, suit);
                cards.add(card);
            }
        }
        return cards;
    }

And sort them like this:

        Collections.sort(cards);

This version of sort puts the elements in what’s called their “natural order” because it’s determined by the objects themselves.

But it is possible to impose a different ordering by providing a Comparator object. For example, the natural order of Card objects treats Aces as the lowest rank, but in some card games they have the highest rank. We can define a Comparator that considers “Aces high”, like this:

        Comparator<Card> comparator = new Comparator<Card>() {
            @Override
            public int compare(Card card1, Card card2) {
                if (card1.getSuit() < card2.getSuit()) {
                    return -1;
                }
                if (card1.getSuit() > card2.getSuit()) {
                    return 1;
                }
                int rank1 = getRankAceHigh(card1);
                int rank2 = getRankAceHigh(card2);

                if (rank1 < rank2) {
                    return -1;
                }
                if (rank1 > rank2) {
                    return 1;
                }
                return 0;
            }

            private int getRankAceHigh(Card card) {
                int rank = card.getRank();
                if (rank == 1) {
                    return 14;
                } else {
                    return rank;
                }
            }
        };

This code defines an anonymous class that implements compare, as required. Then it creates an instance of the newly-defined, unnamed class. If you are not familiar with anonymous classes in Java, you can read about them at http://thinkdast.com/anonclass.

Using this Comparator, we can invoke sort like this:

        Collections.sort(cards, comparator);

In this ordering, the Ace of Spades is considered the highest class in the deck; the two of Clubs is the lowest.

The code in this section is in Card.java if you want to experiment with it. As an exercise, you might want to write a comparator that sorts by rank first and then by suit, so all the Aces should be together, and all the twos, etc.

16.6  Extensions

If you get a basic version of this exercise working, you might want to work on these optional exercises:

  • Read about TF-IDF at http://thinkdast.com/tfidf and implement it. You might have to modify JavaIndex to compute document frequencies; that is, the total number of times each term appears on all pages in the index.
  • For queries with more than one search term, the total relevance for each page is currently the sum of the relevance for each term. Think about when this simple version might not work well, and try out some alternatives.
  • Build a user interface that allows users to enter queries with boolean operators. Parse the queries, generate the results, then sort them by relevance and display the highest-scoring URLs. Consider generating “snippets” that show where the search terms appeared on the page. If you want to make a Web application for your user interface, consider using Heroku as a simple option for developing and deploying Web applications using Java. See http://thinkdast.com/heroku.

Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.


Think Data Structures

Think DSP

Think Java

Think Bayes

Think Python 2e

Think Stats 2e

Think Complexity


Previous Up Next