Previous Up Next

This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Or you can buy this book on Amazon.com.

Chapter 14  Persistence

In the next few exercises we will get back to building a web search engine. To review, the components of a search engine are:

  • Crawling: We’ll need a program that can download a web page, parse it, and extract the text and any links to other pages.
  • Indexing: We’ll need an index that makes it possible to look up a search term and find the pages that contain it.
  • Retrieval: And we’ll need a way to collect results from the index and identify pages that are most relevant to the search terms.

If you did Exercise 8.3, you implemented an index using Java maps. In this exercise, we’ll revisit the indexer and make a new version that stores the results in a database.

If you did Exercise 7.4, you built a crawler that follows the first link it finds. In the next exercise, we’ll make a more general version that stores every link it finds in a queue and explores them in order.

And then, finally, you will work on the retrieval problem.

In these exercises, I provide less starter code, and you will make more design decisions. These exercises are also more open-ended. I will suggest some minimal goals you should try to reach, but there are many ways you can go farther if you want to challenge yourself.

Now, let’s get started on a new version of the indexer.

14.1  Redis

The previous version of the indexer stores the index in two data structures: a TermCounter that maps from a search term to the number of times it appears on a web page, and an Index that maps from a search term to the set of pages where it appears.

These data structures are stored in the memory of a running Java program, which means that when the program stops running, the index is lost. Data stored only in the memory of a running program is called “volatile”, because it vaporizes when the program ends.

Data that persists after the program that created it ends is called “persistent”. In general, files stored in a file system are persistent, as well as data stored in databases.

A simple way to make data persistent is to store it in a file. Before the program ends, it could translate its data structures into a format like JSON (http://thinkdast.com/json) and then write them into a file. When it starts again, it could read the file and rebuild the data structures.

But there are several problems with this solution:

  1. Reading and writing large data structures (like a Web index) would be slow.
  2. The entire data structure might not fit into the memory of a single running program.
  3. If a program ends unexpectedly (for example, due to a power outage), any changes made since the program last started would be lost.

A better alternative is a database that provides persistent storage and the ability to read and write parts of the database without reading and writing the whole thing.

There are many kinds of database management systems (DBMS) that provide different capabilities. You can read an overview at http://thinkdast.com/database.

The database I recommend for this exercise is Redis, which provides persistent data structures that are similar to Java data structures. Specifically, it provides:

  • Lists of strings, similar to Java List.
  • Hashes, similar to Java Map.
  • Sets of strings, similar to Java Set.

Redis is a “key-value database”, which means that the data structures it contains (the values) are identified by unique strings (the keys). A key in Redis plays the same role as a reference in Java: it identifies an object. We’ll see some examples soon.

14.2  Redis clients and servers

Redis is usually run as a remote service; in fact, the name stands for “REmote DIctionary Server”. To use Redis, you have to run the Redis server somewhere and then connect to it using a Redis client. There are many ways to set up a server and many clients you could use. For this exercise, I recommend:

  1. Rather than install and run the server yourself, consider using a service like RedisToGo (http://thinkdast.com/redistogo), which runs Redis in the cloud. They offer a free plan with enough resources for the exercise.
  2. For the client I recommend Jedis, which is a Java library that provides classes and methods for working with Redis.

Here are more detailed instructions to help you get started:

  • Create an account on RedisToGo, at http://thinkdast.com/redissign, and select the plan you want (probably the free plan to get started).
  • Create an “instance”, which is a virtual machine running the Redis server. If you click on the “Instances” tab, you should see your new instance, identified by a host name and a port number. For example, I have an instance named “dory-10534”.
  • Click on the instance name to get the configuration page. Make a note of the URL near the top of the page, which looks like this:
      redis://redistogo:1234567feedfacebeefa1e1234567@dory.redistogo.com:10534
      

This URL contains the server’s host name, dory.redistogo.com, the port number, 10534, and the password you will need to connect to the server, which is the long string of letters and numbers in the middle. You will need this information for the next step.

14.3  Making a Redis-backed index

In the repository for this book, you’ll find the source files for this exercise:

  • JedisMaker.java contains example code for connecting to a Redis server and running a few Jedis methods.
  • JedisIndex.java contains starter code for this exercise.
  • JedisIndexTest.java contains test code for JedisIndex.
  • WikiFetcher.java contains the code we saw in previous exercises to read web pages and parse them using jsoup.

You will also need these files, which you worked on in previous exercises:

  • Index.java implements an index using Java data structures.
  • TermCounter.java represents a map from terms to their frequencies.
  • WikiNodeIterable.java iterates through the nodes in a DOM tree produced by jsoup.

If you have working versions of these files, you can use them for this exercise. If you didn’t do the previous exercises, or you are not confident in your solutions, you can copy my solutions from the solutions folder.

The first step is to use Jedis to connect to your Redis server. RedisMaker.java shows how to do this. It reads information about your Redis server from a file, connects to it and logs in using your password, then returns a Jedis object you can use to perform Redis operations.

If you open JedisMaker.java, you should see the JedisMaker class, which is a helper class that provides one static method, make, which creates a Jedis object. Once this object is authenticated, you can use it to communicate with your Redis database.

JedisMaker reads information about your Redis server from a file named redis_url.txt, which you should put in the directory src/resources:

  • Use a text editor to create end edit ThinkDataStructures/code/src/resources/redis_url.txt.
  • Paste in the URL of your server. If you are using RedisToGo, the URL will look like this:

    redis://redistogo:1234567feedfacebeefa1e1234567@dory.redistogo.com:10534

Because this file contains the password for your Redis server, you should not put this file in a public repository. To help you avoid doing that by accident, the repository contains a .gitignore file that makes it harder (but not impossible) to put this file in your repo.

Now run ant build to compile the source files and ant JedisMaker to run the example code in main:

    public static void main(String[] args) {

        Jedis jedis = make();
        
        // String
        jedis.set("mykey", "myvalue");
        String value = jedis.get("mykey");
        System.out.println("Got value: " + value);
        
        // Set
        jedis.sadd("myset", "element1", "element2", "element3");
        System.out.println("element2 is member: " + 
                           jedis.sismember("myset", "element2"));
        
        // List
        jedis.rpush("mylist", "element1", "element2", "element3");
        System.out.println("element at index 1: " + 
                           jedis.lindex("mylist", 1));
        
        // Hash
        jedis.hset("myhash", "word1", Integer.toString(2));
        jedis.hincrBy("myhash", "word2", 1);
        System.out.println("frequency of word1: " + 
                           jedis.hget("myhash", "word1"));
        System.out.println("frequency of word1: " + 
                            jedis.hget("myhash", "word2"));
        
        jedis.close();
    }

This example demonstrates the data types and methods you are most likely to use for this exercise. When you run it, the output should be:

Got value: myvalue
element2 is member: true
element at index 1: element2
frequency of word1: 2
frequency of word2: 1

In the next section, I’ll explain how the code works.

14.4  Redis data types

Redis is basically a map from keys, which are strings, to values, which can be one of several data types. The most basic Redis data type is a string. I will write Redis types in italics to distinguish them from Java types.

To add a string to the database, use jedis.set, which is similar to Map.put; the parameters are the new key and the corresponding value. To look up a key and get its value, use jedis.get:

        jedis.set("mykey", "myvalue");
        String value = jedis.get("mykey");

In this example, the key is "mykey" and the value is "myvalue".

Redis provides a set structure, which is similar to a Java Set<String>. To add elements to a Redis set, you choose a key to identify the set and then use jedis.sadd:

        jedis.sadd("myset", "element1", "element2", "element3");
        boolean flag = jedis.sismember("myset", "element2");

You don’t have to create the set as a separate step. If it doesn’t exist, Redis creates it. In this case, it creates a set named myset that contains three elements.

The method jedis.sismember checks whether an element is in a set. Adding elements and checking membership are constant time operations.

Redis also provides a list structure, which is similar to a Java List<String>. The method jedis.rpush adds elements to the end (right side) of a list:

        jedis.rpush("mylist", "element1", "element2", "element3");
        String element = jedis.lindex("mylist", 1);

Again, you don’t have to create the structure before you start adding elements. This example creates a list named “mylist” that contains three elements.

The method jedis.lindex takes an integer index and returns the indicated element of a list. Adding and accessing elements are constant time operations.

Finally, Redis provides a hash structure, which is similar to a Java Map<String, String>. The method jedis.hset adds a new entry to the hash:

        jedis.hset("myhash", "word1", Integer.toString(2));
        String value = jedis.hget("myhash", "word1");

This example creates a hash named myhash that contains one entry, which maps from the key word1 to the value "2".

The keys and values are strings, so if we want to store an Integer, we have to convert it to a String before we call hset. And when we look up the value using hget, the result is a String, so we might have to convert it back to Integer.

Working with Redis hashes can be confusing, because we use a key to identify which hash we want, and then another key to identify a value in the hash. In the context of Redis, the second key is called a “field”, which might help keep things straight. So a “key” like myhash identifies a particular hash, and then a “field” like word1 identifies a value in the hash.

For many applications, the values in a Redis hash are integers, so Redis provides a few special methods, like hincrby, that treat the values as numbers:

        jedis.hincrBy("myhash", "word2", 1);

This method accesses myhash, gets the current value associated with word2 (or 0 if it doesn’t already exist), increments it by 1, and writes the result back to the hash.

Setting, getting, and incrementing entries in a hash are constant time operations.

You can read more about Redis data types at http://thinkdast.com/redistypes.

14.5  Exercise 11

At this point you have the information you need to make a web search index that stores results in a Redis database.

Now run ant JedisIndexTest. It should fail, because you have some work to do!

JedisIndexTest tests these methods:

  • JedisIndex, which is the constructor that takes a Jedis object as a parameter.
  • indexPage, which adds a Web page to the index; it takes a String URL and a jsoup Elements object that contains the elements of the page that should be indexed.
  • getCounts, which takes a search term and returns a Map<String, Integer> that maps from each URL that contains the search term to the number of times it appears on that page.

Here’s an example of how these methods are used:

        WikiFetcher wf = new WikiFetcher();
        String url1 = 
            "http://en.wikipedia.org/wiki/Java_(programming_language)";
        Elements paragraphs = wf.readWikipedia(url1);

        Jedis jedis = JedisMaker.make();
        JedisIndex index = new JedisIndex(jedis);
        index.indexPage(url1, paragraphs);
        Map<String, Integer> map = index.getCounts("the");

If we look up url1 in the result, map, we should get 339, which is the number of times the word “the” appears on the Java Wikipedia page (that is, the version we saved).

If we index the same page again, the new results should replace the old ones.

One suggestion for translating data structures from Java to Redis: remember that each object in a Redis database is identified by a unique key, which is a string. If you have two kinds of objects in the same database, you might want to add a prefix to the keys to distinguish between them. For example, in our solution, we have two kinds of objects:

  • We define a URLSet to be a Redis set that contains the URLs that contain a given search term. The key for each URLSet starts with "URLSet:", so to get the URLs that contain the word “the”, we access the set with the key "URLSet:the".
  • We define a TermCounter to be a Redis hash that maps from each term that appears on a page to the number of times it appears. The key for each TermCounter starts with "TermCounter:" and ends with the URL of the page we’re looking up.

In my implementation, there is one URLSet for each term and one TermCounter for each indexed page. I provide two helper methods, urlSetKey and termCounterKey, to assemble these keys.

14.6  More suggestions if you want them

At this point you have all the information you need to do the exercise, so you can get started if you are ready. But I have a few suggestions you might want to read first:

  • For this exercise I provide less guidance than in previous exercises. You will have to make some design decisions; in particular, you will have to figure out how to divide the problem into pieces that you can test one at a time, and then assemble the pieces into a complete solution. If you try to write the whole thing at once, without testing smaller pieces, it might take a very long time to debug.
  • One of the challenges of working with persistent data is that it is persistent. The structures stored in the database might change every time you run the program. If you mess something up in the database, you will have to fix it or start over before you can proceed. To help you keep things under control, I’ve provided methods called deleteURLSets, deleteTermCounters, and deleteAllKeys, which you can use to clean out the database and start fresh. You can also use printIndex to print the contents of the index.
  • Each time you invoke a Jedis method, your client sends a message to the server, then the server performs the action you requested and sends back a message. If you perform many small operations, it will probably take a long time. You can improve performance by grouping a series of operations into a Transaction.

For example, here’s a simple version of deleteAllKeys:

    public void deleteAllKeys() {
        Set<String> keys = jedis.keys("*");
        for (String key: keys) {
            jedis.del(key);
        }
    }

Each time you invoke del requires a round-trip from the client to the server and back. If the index contains more than a few pages, this method would take a long time to run. We can speed it up with a Transaction object:

    public void deleteAllKeys() {
        Set<String> keys = jedis.keys("*");
        Transaction t = jedis.multi();
        for (String key: keys) {
            t.del(key);
        }
        t.exec();
    }

jedis.multi returns a Transaction object, which provides all the methods of a Jedis object. But when you invoke a method on a Transaction, it doesn’t run the operation immediately, and it doesn’t communicate with the server. It saves up a batch of operations until you invoke exec. Then it sends all of the saved operations to the server at the same time, which is usually much faster.

14.7  A few design hints

Now you really have all the information you need; you should start working on the exercise. But if you get stuck, or if you really don’t know how to get started, you can come back for a few more hints.

Don’t read the following until you have run the test code, tried out some basic Redis commands, and written a few methods in JedisIndex.java.

OK, if you are really stuck, here are some methods you might want to work on:

    /**
     * Adds a URL to the set associated with term.
     */
    public void add(String term, TermCounter tc) {}

    /**
     * Looks up a search term and returns a set of URLs.
     */
    public Set<String> getURLs(String term) {}

    /**
     * Returns the number of times the given term appears at the given URL.
     */
    public Integer getCount(String url, String term) {}

    /**
     * Pushes the contents of the TermCounter to Redis.
     */
    public List<Object> pushTermCounterToRedis(TermCounter tc) {}

These are the methods I used in my solution, but they are certainly not the only way to divide things up. So please take these suggestions if they help, but ignore them if they don’t.

For each method, consider writing the tests first. When you figure out how to test a method, you often get ideas about how to write it.

Good luck!

Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.


Think Data Structures

Think DSP

Think Java

Think Bayes

Think Python 2e

Think Stats 2e

Think Complexity


Previous Up Next