This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Or you can buy this book on Amazon.com. Chapter 14 PersistenceIn the next few exercises we will get back to building a web search engine. To review, the components of a search engine are:
If you did Exercise 8.3, you implemented an index using Java maps. In this exercise, we’ll revisit the indexer and make a new version that stores the results in a database. If you did Exercise 7.4, you built a crawler that follows the first link it finds. In the next exercise, we’ll make a more general version that stores every link it finds in a queue and explores them in order. And then, finally, you will work on the retrieval problem. In these exercises, I provide less starter code, and you will make more design decisions. These exercises are also more open-ended. I will suggest some minimal goals you should try to reach, but there are many ways you can go farther if you want to challenge yourself. Now, let’s get started on a new version of the indexer. 14.1 RedisThe previous version of the indexer stores the index in two data
structures: a These data structures are stored in the memory of a running Java program, which means that when the program stops running, the index is lost. Data stored only in the memory of a running program is called “volatile”, because it vaporizes when the program ends. Data that persists after the program that created it ends is called “persistent”. In general, files stored in a file system are persistent, as well as data stored in databases. A simple way to make data persistent is to store it in a file. Before the program ends, it could translate its data structures into a format like JSON (http://thinkdast.com/json) and then write them into a file. When it starts again, it could read the file and rebuild the data structures. But there are several problems with this solution:
A better alternative is a database that provides persistent storage and the ability to read and write parts of the database without reading and writing the whole thing. There are many kinds of database management systems (DBMS) that provide different capabilities. You can read an overview at http://thinkdast.com/database. The database I recommend for this exercise is Redis, which provides persistent data structures that are similar to Java data structures. Specifically, it provides:
Redis is a “key-value database”, which means that the data structures it contains (the values) are identified by unique strings (the keys). A key in Redis plays the same role as a reference in Java: it identifies an object. We’ll see some examples soon. 14.2 Redis clients and serversRedis is usually run as a remote service; in fact, the name stands for “REmote DIctionary Server”. To use Redis, you have to run the Redis server somewhere and then connect to it using a Redis client. There are many ways to set up a server and many clients you could use. For this exercise, I recommend:
Here are more detailed instructions to help you get started:
This URL contains the server’s host name, 14.3 Making a Redis-backed indexIn the repository for this book, you’ll find the source files for this exercise:
You will also need these files, which you worked on in previous exercises:
If you have working versions of these files, you can use them for this exercise. If you didn’t do the previous exercises, or you are not confident in your solutions, you can copy my solutions from the solutions folder. The first step is to use Jedis to connect to your Redis server.
If you open
Because this file contains the password for your Redis server, you should not put this file in a public repository. To help you avoid doing that by accident, the repository contains a .gitignore file that makes it harder (but not impossible) to put this file in your repo. Now run public static void main(String[] args) { Jedis jedis = make(); // String jedis.set("mykey", "myvalue"); String value = jedis.get("mykey"); System.out.println("Got value: " + value); // Set jedis.sadd("myset", "element1", "element2", "element3"); System.out.println("element2 is member: " + jedis.sismember("myset", "element2")); // List jedis.rpush("mylist", "element1", "element2", "element3"); System.out.println("element at index 1: " + jedis.lindex("mylist", 1)); // Hash jedis.hset("myhash", "word1", Integer.toString(2)); jedis.hincrBy("myhash", "word2", 1); System.out.println("frequency of word1: " + jedis.hget("myhash", "word1")); System.out.println("frequency of word1: " + jedis.hget("myhash", "word2")); jedis.close(); } This example demonstrates the data types and methods you are most likely to use for this exercise. When you run it, the output should be: Got value: myvalue element2 is member: true element at index 1: element2 frequency of word1: 2 frequency of word2: 1 In the next section, I’ll explain how the code works. 14.4 Redis data typesRedis is basically a map from keys, which are strings, to values, which can be one of several data types. The most basic Redis data type is a string. I will write Redis types in italics to distinguish them from Java types. To add a string to the database,
use jedis.set("mykey", "myvalue"); String value = jedis.get("mykey"); In this example, the key is Redis provides a set structure, which is
similar to a Java
jedis.sadd("myset", "element1", "element2", "element3"); boolean flag = jedis.sismember("myset", "element2"); You don’t have to create the set as a separate step. If it doesn’t
exist, Redis creates it. In this case, it creates a set named
The method Redis also provides a list structure, which is
similar to a Java
jedis.rpush("mylist", "element1", "element2", "element3"); String element = jedis.lindex("mylist", 1); Again, you don’t have to create the structure before you start adding elements. This example creates a list named “mylist” that contains three elements. The method Finally, Redis provides a hash structure, which is similar to a Java
jedis.hset("myhash", "word1", Integer.toString(2)); String value = jedis.hget("myhash", "word1"); This example creates a hash named The keys and values are strings, so if we want to store
an Working with Redis hashes can be confusing, because we use a key to
identify which hash we want, and then another key to identify a value in
the hash. In the context of Redis, the second key is called a “field”,
which might help keep things straight. So a “key” like For many applications, the values in a Redis hash are integers, so Redis
provides a few special methods, like jedis.hincrBy("myhash", "word2", 1); This method accesses Setting, getting, and incrementing entries in a hash are constant time operations. You can read more about Redis data types at http://thinkdast.com/redistypes. 14.5 Exercise 11At this point you have the information you need to make a web search index that stores results in a Redis database. Now run
Here’s an example of how these methods are used: WikiFetcher wf = new WikiFetcher(); String url1 = "http://en.wikipedia.org/wiki/Java_(programming_language)"; Elements paragraphs = wf.readWikipedia(url1); Jedis jedis = JedisMaker.make(); JedisIndex index = new JedisIndex(jedis); index.indexPage(url1, paragraphs); Map<String, Integer> map = index.getCounts("the"); If we look up If we index the same page again, the new results should replace the old ones. One suggestion for translating data structures from Java to Redis: remember that each object in a Redis database is identified by a unique key, which is a string. If you have two kinds of objects in the same database, you might want to add a prefix to the keys to distinguish between them. For example, in our solution, we have two kinds of objects:
In my implementation, there is one 14.6 More suggestions if you want themAt this point you have all the information you need to do the exercise, so you can get started if you are ready. But I have a few suggestions you might want to read first:
For example, here’s a simple version of public void deleteAllKeys() { Set<String> keys = jedis.keys("*"); for (String key: keys) { jedis.del(key); } } Each time you invoke public void deleteAllKeys() { Set<String> keys = jedis.keys("*"); Transaction t = jedis.multi(); for (String key: keys) { t.del(key); } t.exec(); }
14.7 A few design hintsNow you really have all the information you need; you should start working on the exercise. But if you get stuck, or if you really don’t know how to get started, you can come back for a few more hints. Don’t read the following until you have run the test code, tried
out some basic Redis commands, and written a few methods in
OK, if you are really stuck, here are some methods you might want to work on: /** * Adds a URL to the set associated with term. */ public void add(String term, TermCounter tc) {} /** * Looks up a search term and returns a set of URLs. */ public Set<String> getURLs(String term) {} /** * Returns the number of times the given term appears at the given URL. */ public Integer getCount(String url, String term) {} /** * Pushes the contents of the TermCounter to Redis. */ public List<Object> pushTermCounterToRedis(TermCounter tc) {} These are the methods I used in my solution, but they are certainly not the only way to divide things up. So please take these suggestions if they help, but ignore them if they don’t. For each method, consider writing the tests first. When you figure out how to test a method, you often get ideas about how to write it. Good luck! |
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|