|
This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Or you can buy this book on Amazon.com. Chapter 16 Boolean searchIn this chapter I present a solution to the previous exercise. Then you will write code to combine multiple search results and sort them by their relevance to the search terms. 16.1 Crawler solutionFirst, let’s go over our solution to the previous exercise. I provided an
outline of public class WikiCrawler {
// keeps track of where we started
private final String source;
// the index where the results go
private JedisIndex index;
// queue of URLs to be indexed
private Queue<String> queue = new LinkedList<String>();
// fetcher used to get pages from Wikipedia
final static WikiFetcher wf = new WikiFetcher();
}
When we create a Notice that the implementation of Here’s my implementation of public String crawl(boolean testing) throws IOException {
if (queue.isEmpty()) {
return null;
}
String url = queue.poll();
System.out.println("Crawling " + url);
if (testing==false && index.isIndexed(url)) {
System.out.println("Already indexed.");
return null;
}
Elements paragraphs;
if (testing) {
paragraphs = wf.readWikipedia(url);
} else {
paragraphs = wf.fetchWikipedia(url);
}
index.indexPage(url, paragraphs);
queueInternalLinks(paragraphs);
return url;
}
Most of the complexity in this method is there to make it easier to test. Here’s the logic:
I presented an implementation of I wrote two versions of this method with different parameters: one
takes an The first version just loops through the paragraphs. The second version does the real work. void queueInternalLinks(Elements paragraphs) {
for (Element paragraph: paragraphs) {
queueInternalLinks(paragraph);
}
}
private void queueInternalLinks(Element paragraph) {
Elements elts = paragraph.select("a[href]");
for (Element elt: elts) {
String relURL = elt.attr("href");
if (relURL.startsWith("/wiki/")) {
String absURL = elt.attr("abs:href");
queue.offer(absURL);
}
}
}
To determine whether a link is “internal,” we check whether the URL starts with “/wiki/”. This might include some pages we don’t want to index, like meta-pages about Wikipedia. And it might exclude some pages we want, like links to pages in non-English languages. But this simple test is good enough to get started. That’s all there is to it. This exercise doesn’t have a lot of new material; it is mostly a chance to bring the pieces together. 16.2 Information retrievalThe next phase of this project is to implement a search tool. The pieces we’ll need include:
The general term for processes like this is “information retrieval”, which you can read more about at http://thinkdast.com/infret. In this exercise, we’ll focus on steps 3 and 4. We’ve already built a simple version of 2. If you are interested in building Web applications, you might consider working on step 1. 16.3 Boolean searchMost search engines can perform “boolean searches”, which means you can combine the results from multiple search terms using boolean logic. For example:
Expressions like these that contain search terms and operators are called “queries”. When applied to search results, the boolean operators
In that case:
In the next section you will write a method to implement these operations. 16.4 Exercise 13In the repository for this book you’ll find the source files for this exercise:
You will also find some of the helper classes we’ve used in previous exercises. Here’s the beginning of the public class WikiSearch {
// map from URLs that contain the term(s) to relevance score
private Map<String, Integer> map;
public WikiSearch(Map<String, Integer> map) {
this.map = map;
}
public Integer getRelevance(String url) {
Integer relevance = map.get(url);
return relevance==null ? 0: relevance;
}
}
A You’ll have the option to implement TF-IDF later, but we’ll start with something even simpler, TF:
Now you’re ready to start the exercise.
Run In You can run Run
Initially the results will be in no particular order, because
Fill in the body of There are two versions of
If you are not familiar with the 16.5 Comparable and ComparatorThe repository for this book includes public class Card implements Comparable<Card> {
private final int rank;
private final int suit;
public Card(int rank, int suit) {
this.rank = rank;
this.suit = suit;
}
A public int compareTo(Card that) {
if (this.suit < that.suit) {
return -1;
}
if (this.suit > that.suit) {
return 1;
}
if (this.rank < that.rank) {
return -1;
}
if (this.rank > that.rank) {
return 1;
}
return 0;
}
The specification of If you use the one-parameter version of public static List<Card> makeDeck() {
List<Card> cards = new ArrayList<Card>();
for (int suit = 0; suit <= 3; suit++) {
for (int rank = 1; rank <= 13; rank++) {
Card card = new Card(rank, suit);
cards.add(card);
}
}
return cards;
}
And sort them like this: Collections.sort(cards); This version of But it is possible to impose a different ordering by providing a
Comparator<Card> comparator = new Comparator<Card>() {
@Override
public int compare(Card card1, Card card2) {
if (card1.getSuit() < card2.getSuit()) {
return -1;
}
if (card1.getSuit() > card2.getSuit()) {
return 1;
}
int rank1 = getRankAceHigh(card1);
int rank2 = getRankAceHigh(card2);
if (rank1 < rank2) {
return -1;
}
if (rank1 > rank2) {
return 1;
}
return 0;
}
private int getRankAceHigh(Card card) {
int rank = card.getRank();
if (rank == 1) {
return 14;
} else {
return rank;
}
}
};
This code defines an anonymous class that implements Using this Collections.sort(cards, comparator); In this ordering, the Ace of Spades is considered the highest class in the deck; the two of Clubs is the lowest. The code in this section is in 16.6 ExtensionsIf you get a basic version of this exercise working, you might want to work on these optional exercises:
|
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|