This HTML version of Think Perl 6 is provided for convenience, but it is not the best format of the book. You might prefer to read the PDF version. Chapter 11 Case Study: Data Structure SelectionAt this point you have learned about Perl’s core data structures, and you have seen some of the algorithms that use them. This chapter presents a case study with exercises that let you think about choosing data structures and practice using them. But first, I would like to briefly introduce two conditional structures that have been left aside so far and provide a couple of new possibilities about subroutine signatures. 11.1 The Ternary Conditional OperatorConsider the following code that tests the value of a positive integer: my $result; if $num < 10 { $result = "One digit"; } else { $result = "More than one digit"; } say $result; This is quite simple, but a bit long. This can be rewritten in just one line of code: say $num < 10 ?? "One digit" !! "More than one digit"; The operator is in two parts: the ?? and the !!, which separate three expressions (hence the name “ternary operator”):
This statement checks if This operator does not provide any new functionality; it just offers a more concise syntax. It is possible to nest several ternary operators to examine successively multiple choices: say $value < 10 ?? "One digit" !! $value < 100 ?? "Two digits" !! $value < 1000 ?? "Three digits" !! "More than three digits"; This construct is a form of what is sometimes called a switch statement, because the C language and many languages derived from it use the switch keyword to describe such a multiple choice conditional. This is much more concise and often more convenient than nested if ... then ... else conditionals, but the next section provides a more powerful switch type of statement. 11.2 The given ... when “Switch” StatementPerl 6 has a “switch” statement, written with the given and when keywords. The given keyword introduces the variable or expression that will be tested, and each of the when statements specify a condition followed by a block that will execute if the condition is true. By default, the process stops at the first condition that is satisfied. The example just above can be rewritten as follows: given $value { when 0..9 { say "One digit" } when $_ < 99 { say "Two digits" } when /^\d**3$/ { say "Three digits" } default { say "More than three digits" } } The
Only one message will be printed, because the
matching process stops as soon as one condition has been
satisfied, and the If there is no specific operator in the when clause,
then it will smart match the expression in the when
clause against when $foo { ... } # equivalent to: when $foo ~~ $_ { ... } Note that the given keyword is not doing much more than
topicalizing its argument for the rest of the block.
The when clauses are doing the bulk of the real work. In fact,
you could even use the when clauses without a given,
provided you assign the right value to my $val = 7; for $val { when 0..6 { say "less than"} when 7 {say "Exact";} when 8..* {say "greater than";} } It is possible to add a proceed clause at the end of any of the conditional code blocks to prevent the process from stopping after that code block has succeeded. For example, you might write this: given $value { when 0..9 { say "One digit"} when 10..99 { say "Two digits"; proceed} when 42 { say "The response to the ultimate question"} when /^\d**3$/ { say "Three digits" } default { say "More than three digits" } } Here, if Good, it seems, but there is a problem. The proceed clause
should be used with some care, as it can easily lead
to unexpected results. In fact, the code above
is actually wrong: if As an exercise, test the above code with various values and try to find a way to fix the bug with the proceed clause. Solution: ?? 11.3 Subroutine Named and Optional ParametersThe subroutines that we have seen so far used positional parameters, i.e., parameters whose binding with the subroutine call arguments rely on their order within the list of arguments and in the signature. This is usually fine when the number of arguments passed to the subroutine is small (say, three or less). When the subroutine signature becomes longer, using positional arguments might become cumbersome and error-prone. 11.3.1 Named ParametersNamed arguments may be supplied in any order: the name of the parameter is bound to the argument having the same name. For example: sub divide (:$dividend, :$divisor where $divisor != 0) { return $dividend/$divisor; } say divide(dividend => 2048, divisor => 128); # -> 16 # or: say divide(divisor => 128, dividend => 2048); # -> 16 The arguments are supplied at the subroutine call as a list of
pairs using the pair-constructor syntax. In the signature,
the parameters are retrieved with the
so-called colon-pair syntax: the These named parameters are especially useful when the number of arguments is large. For example, we haven’t covered object-oriented programming yet (see Chapter ??), but this is how we could create an object of the (user-defined) Rectangle class: my $rect = Rectangle.new( origin_x => 100, origin_y => 200, width => 23, length => 42, color => 'black' ); Clearly, using five positional parameters would be unpractical. 11.3.2 Optional ParametersSometimes, the actual number of arguments is not known in advance: for example, a subroutine may be called with a variable number of arguments. Such a subroutine is said to be variadic. You can define a parameter to be optional by postfixing it with a question mark in the subroutine signature: sub my-sub($x, $y?) { # simple optional parameter if $y.defined { say "The second parameter has been supplied and defined"; } else { say "The second parameter has not been supplied"; } # ... } When using positional parameters, the optional parameters always have to be the last ones in the list (after the mandatory ones). A parameter can also be made optional by supplying a default value: sub my-log($number, $base = e) { # e is a predefined constant # $base is an optional parameter return log($number) / log($base); } say my-log(4); # Natural log (base e) -> 1.38629436111989 say my-log(32, 2); # Log base 2 -> 5 say my-log(100, 10); # Common log (base 10) -> 2 Here, if the second argument is not supplied, the default value (e) is used instead. Conversely, if there is a second argument, it overrides the default value.
Sometimes, having optional or default parameters is not good
enough. For example, the subroutine may have to process a list
containing any number of values. For situations like this,
you can use a slurpy parameter, i.e., a kind of array
placed at the end of the parameter list that will slurp up
all the remaining arguments. This kind of slurpy parameter
uses the “ sub my-sum($first-num, *@rest) { say @rest; # -> [3 4 5 12 17] return $first-num + [+] @rest; } say my-sum 1, 3, 4, 5, 12, 17; # -> 42 Some further examples of slurpy parameters have been provided in Section ??. 11.4 Word Frequency AnalysisNow, let’s get to the case study. As usual, you should at least attempt the exercises before you read the suggested solutions, which are provided in the following sections of this chapter. Exercise 1 Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. Exercise 2
Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. Then modify the program to count the total number of words in the book, and the number of times each word is used. Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary? Exercise 3 Modify the program from the previous exercise to print the 20 most frequently used words in a given book. Exercise 4
Modify the previous program to read a word list (see Section ??) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure? 11.5 Random NumbersGiven the same inputs, most computer programs generate the same outputs every time, so they are said to be deterministic. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are an obvious example, but there are more. Making a program truly nondeterministic turns out to be difficult, but there are ways to make it at least seem nondeterministic. One of them is to use algorithms that generate pseudorandom numbers. Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random. Perl provides functions such as rand that generate pseudorandom numbers (which we will simply call “random” numbers from here on). The function rand returns a random number (of Num type) between 0.0 and 1.0 (including 0.0 but not 1.0). Each time you call rand, you get the next number in a long series. To see a sample, run this loop in the REPL: say rand for 1..5; Used as a method, rand returns a random number between 0.0 and the value of the invocant. For example, 10.rand returns a random number between 0 and 10 (10 not included). You might try it as a one-liner: $ perl6 -e 'say 10.rand for 1..5' 8.23987158729588 9.83276889381497 2.52313276833335 3.44713459548771 1.82329894347025 You should hopefully get a different output than I did. If you want to run such a one-liner under Windows, remember that you’ll need to replace single quotes with double quotes. To obtain random integers between 1 and 10, you may use the Int and rand methods: $ perl6 -e 'say 10.rand.Int + 1 for 1..5' 5 10 1 6 3 The pick function or method takes a number > say <1 3 4 5 7 9 10 12 14 42>.pick(5); (5 42 3 4 7) > say pick 5, <1 3 4 5 7 9 10 12 14 42>; (42 12 5 1 9) If To obtain random unique integers in a range, you might use pick on a range: > say pick 5, 1..20; (5 3 6 18 7) > say (1..20).pick(5); (20 4 18 2 7) If you don’t specify the number of random numbers, you’ll get one random pick: > say (1..20).pick; 19 Exercise 5
Write a function named 11.6 Word HistogramYou should attempt the previous exercises before you go on. For the purpose of presenting the solutions to the above exercises, I’ve used the plain text of Emma (1816), the novel by Jane Austen, downloaded from the Gutenberg project (http://www.gutenberg.org/files/158/158-0.txt) and saved in a file called emma.txt. Use the same text if you want to compare your solutions and results with mine. Here is a program that reads the emma.txt file and builds a histogram of the words in the file: my %histogram; my $skip = True; # flag to skip the header sub process-line(Str $line is copy) { if defined index $line, "*END*THE SMALL PRINT!" { $skip = False ; return; } return if $skip; $line ~~ s:g/<[-']>/ /; # Replacing dashes and apostrophes with spaces $line ~~ s:g/<[;:,!?.()"_`]>//; # removing punctuation symbols $line = $line.lc; # setting string to lower case for $line.words -> $word { %histogram{$word}++; } } process-line $_ for "emma.txt".IO.lines; The program reads each line of the emma.txt file and, for
each line, calls
The To know whether the program is doing something like what
it is supposed to do, we can display a few entries of the
# displaying 20 lines of the histogram my $count; for %histogram -> $pair { say sprintf "%-24s %d", $pair.key, $pair.value; $count++; last if $count > 20; } This prints out the following output: embarrassing 1 hows 1 appealed 2 bestow 2 articulate 1 demands 2 closely 1 dull 9 hearts 1 column 1 possesses 1 attributed 1 jumped 2 forwards 2 wittier 2 expert 2 attractive 2 asserted 2 oftentimes 1 fancy 38 finds 1 To count the total number of words in the file, we can add up the values in the histogram: my $word_count = [+] %histogram.values; say "There are $word_count words in the book."; The number of different words is just the number of items in the hash: my $distinct-words = %histogram.elems; say "There are $distinct-words distinct words in the book."; Note that you could reduce the above to one code line by interpolating a code block within the output string: say "There are {%histogram.elems} distinct words in the book." And the results: There are 161991 words in the book. There are 7110 distinct words in the book. 11.7 Most Common WordsTo find the most common words in my @most_commons = (sort { %histogram{$^b} cmp %histogram{$^a} }, %histogram.keys)[0..9]; say $_ for map { "$_ \t%histogram{$_} "}, @most_commons; The sort functions receives the keys of the histogram and
its comparison function compares the values associated with
those keys. Since we use the key This displays the following output: to 5241 the 5205 and 4897 of 4295 i 3192 a 3130 it 2529 her 2490 was 2400 she 2364 If you want to see more interesting words, you might, as a further exercise, filter the histogram and retain only words that have more than, say, four letters. The say $_ for map { "$_ \t%histogram{$_} "}, (sort { %histogram{$^b} cmp %histogram{$^a} }, %histogram.keys)[0..9]; This is an example of data pipeline (or stream) programming. Such a statement
needs to be read from bottom to top and from right to left. The
first step is Let me add one word of caution here: sorting the histogram by values and picking up the top 10 to get the most frequent words is probably the easiest way to solve the problem and the shortest code to do it. That’s the reason I have used this solution here. But it is not the most efficient solution from the standpoint of the run time, because it involves the cost of sorting a relatively large data set, whereas we are using only a small part of the sorted data. There are some better algorithms to do that from the standpoint of runtime efficiency, but they are more complicated. So, there is a tradeoff here between coding efficiency and performance. Assuming this is code that we want to run only once, I have chosen to favor coding efficiency. 11.8 Optional ParametersWe saw earlier in this chapter that subroutines can
take optional parameters. We can use this functionality to
write a subroutine that prints the most common words in the
display-most-common(%histogram, 5); sub display-most-common (%hist, Int $num = 10) { say $_ for map { "$_ \t%hist{$_} "}, (sort { %hist{$^b} cmp %hist{$^a} }, %hist.keys)[0..$num - 1]; } This will display the five top words of the list above (Section ??). If we call it without the second parameter: display-most-common(%histogram); we will get the 10 most common words (same list as the one shown in Section ?? above), because the default value for the second parameter is 10. 11.9 Hash SubtractionFinding the words from emma.txt that are not in the word list words.txt is a problem you might recognize as set subtraction; that is, we want to find all the words from one set (the words in emma.txt) that are not in the other (the words in words.txt). subtract takes hashes sub subtract (%main, %dict) { my %difference; for %main.keys -> $word { %difference{$word} = 1 unless %dict{$word}:exists; } return %difference; } To find the words in emma.txt that are not in words.txt, we need to load the word list into a hash and to call subtract, passing the two hashes as arguments. We also add some code to print the first 20 words not found in the word list: my %word-list = map { $_ => 1 }, "words.txt".IO.lines; my %unknown-words = subtract(%histogram, %word-list); say %unknown-words.keys.head(20); Notice that rather than using a subscript slice, I have used here the head method to print out the first 20 words of the list. This is just another way to get the first “n” items of a list or array. There is also a tail method to retrieve the last “n” items of a list or an array. Here are some of the results from Emma: (penetrated unsullied bateses outstepped particularity experienced italian sunday solicitously blockhead unbleached ult 26th christian 7th untouched iii greensward houseroom tete) Some of these words are names and ordinal numbers. Others are rare or no longer in common use. But a few are common words that should really be in the list! Note that I have used a hash ( 11.10 Constructing New OperatorsLearning about hash subtraction is an excellent opportunity to introduce a very interesting functionality of Perl 6: the capacity to construct new operators or to redefine existing ones. Since we are subtracting two lists of words, it is tempting to use the minus sign to denote this subtraction operation. It is very easy to create such an operator in Perl 6: multi sub infix:<-> (%main, %dict) { my %difference; for %main.keys -> $word { %difference{$word} = 1 unless %dict{$word}:exists; } return %difference; } Compared to the definition of the subtract subroutine, the only differences are in the header line. We will cover the details in a later chapter, but let us briefly explain here. Most Perl 6 operators are defined as “multi” subroutines, i.e., subroutines that can be defined several times with different signatures and bodies; Perl will know which one to use depending on the signature. Here we define the minus operator as a multi subroutine whose parameters are two hashes; this will be enough for the Perl compiler to know that we don’t want the regular subtraction between numerical values, but something else that applies to two hashes. The minus operator is defined as “infix,” which means that the operator will be placed between its two operands. Calling this new operator is now just as easy as calling the regular subtraction operator between numbers; we just need to use two hashes as operands: my %unknown-words = %histogram - %word-list; The rest or the program works just as before. This ease or creating new operators is one of the facilities offered by Perl 6 to extend the language from within itself. We’ll come back to that and other ways to extend the language in later chapters. As an exercise, write a multi subroutine that creates the new postfix “!” operator for computing the factorial of an integer: say 5!; # -> 120 Also try to think about how you would test this new “!” operator against various input values. Solution: ?? 11.11 Sets, Bags and MixesPerl 6 has a variety of data structure types called Set, Bag and Mix that provide many common set operations. They are unordered collections of unique and weighed items. They are immutable (but there also exist also mutable versions of these data structures, SetHash, BagHash and MixHash). You may create and use a set as follows: > my $s = set <banana apple orange orange banana pear apple>; set(banana, orange, apple, pear) > say $s.perl set("banana","orange","apple","pear") > say $s.elems; 4 > say $s{'apple'} True > say $s{'strawberry'} False As you can see, duplicates have been removed. Sets only tell us whether or not at least one item of a given name has been seen. A bag, by contrast, also keeps track of how many of each items have been seen: > my $b = bag <banana apple orange orange banana pear apple orange>; bag(banana(2), orange(3), pear, apple(2)) > say $b{'banana'} 2 Mixes are similar to bags, except that the elements’ weights don’t have to be integers. The interesting thing about these collections is that they can use
many set operators commonly used in mathematics, such as the
> say "Found it!" if 'apple' ∈ $s; Found it! > say "Is subset!" if <orange banana> ⊂ $s; Is subset! > say <orange banana pineapple> ∩ $s; set(banana, orange) Notice that we haven’t used the set keyword to define the
We can rewrite the hash subtraction program using a set for
the word list and the my %histogram; my $skip = True; # flag to skip the header sub process-line(Str $line is copy) { # (same as above) } process-line $_ for "emma.txt".IO.lines; my $word-list = set "words.txt".IO.lines; my %unknown-words = subtract(%histogram, $word-list); say %unknown-words.keys.head(20); sub subtract (%main, $dict) { my %difference; for %main.keys -> $word { %difference{$word} = 1 unless $word ∈ $dict; } return %difference; } The code line in the for loop could also be written as follows: %difference{$word} = 1 unless $word (elem) $dict; #or: %difference{$word} = 1 if $word ∉ $dict; #or: %difference{$word} = 1 if $word !(elem) $dict; #or even with the (cont) or ∋ contain operator: %difference{$word} = 1 unless $dict (cont) $word; #or: %difference{$word} = 1 unless $dict ∋ $word; #or: %difference{$word} = 1 if $dict ∌ $word; # etc.
The process-line $_ for "emma.txt".IO.lines; my $word-list = set "words.txt".IO.lines; my $unknown-words = %histogram.keys (-) $word-list; say $unknown-words.keys.head(20); There are more than 30 set operators available, covering most of the set operators used in mathematics. I’ve only shown some that are the most likely to be useful. Check into the official documentation (https://doc.perl6.org/language/setbagmix if you need some others. As an exercise, you may rewrite the process-line subroutine or replace it to use a set or a bag instead of a hash to store the words of emma.txt (and possibly adapt the rest of the program where needed), in order to find the words of the emma.txt that are not in the words.txt. Solution: ?? 11.12 Random WordsTo choose random words from the histogram, the simplest algorithm
is to build a list with multiple copies of each word, according
to the observed frequency, and then choose from the list with the
my @array = map {| (.key xx .value)}, %histogram; say pick 30, @array; The expression This algorithm works, but it is not very efficient; each time you choose one or some random words, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big. An alternative is:
Exercise 6
Write a program that uses this algorithm to choose a random word from emma.txt. Solution: ?? Note that Perl offers a shortcut to perform the task
at hand: when used on a bag, pick returns a random
selection of elements, weighted by the values corresponding
to each key. Ideally, we should have used a bag instead of
a hash to store our > my %histo = ( banana => 5, pear => 1, orange => 12); {banana => 5, orange => 12, pear => 1} > my $fruit-bag = bag map { $_ xx %histo{$_}}, %histo.keys; bag(banana(5), orange(12), pear) > for 1..10 { say $fruit-bag.pick: 5} (banana orange orange orange orange) (orange orange banana orange banana) (orange orange banana orange orange) (pear orange banana banana orange) (orange banana orange orange orange) ... As you can see, the most common item, “orange,” has been picked more often than the others, and the least common, “pear,” has not been picked up at all before the fourth draw. As an exercise, you may want to adapt this code to choose random words from emma.txt. 11.13 Markov AnalysisIf you choose words from emma.txt at random, you can get a sense of the vocabulary, but you probably won’t get a sentence: this the small regard harriet which knightley's it most things A series of random words seldom makes sense because there is no relationship between successive words. For example, in a real sentence you would expect an article like “the” to be followed by an adjective or a noun, and probably not a verb or adverb. One way to measure these kinds of relationships is Markov analysis, which characterizes, for a given sequence of words, the probability of the words that might come next. For example, the second chapter of The Little Prince (1943), the famous novella written by French writer and pioneer aviator Antoine de Saint-Exupéry, begins: The first night, then, I went to sleep on the sand, a thousand miles from any human habitation. I was more isolated than a shipwrecked sailor on a raft in the middle of the ocean. Thus you can imagine my amazement, at sunrise, when I was awakened by an odd little voice. It said: In this text, the word “draw” is always followed by the word “me,” and the phrase “draw me” is always followed by “a sheep.” And the phrase “a thousand” is always followed by “miles,” but the phrase “a thousand miles” may be followed by “from any human habitation" or by “from any inhabited region.” The result of Markov analysis is a mapping from each prefix (like “draw me” and “a thousand miles”) to all possible suffixes (like “a sheep” and “from any habitation” or “from any inhabited region”). Given this mapping, you can generate a random text by starting with any prefix and choosing at random from the possible suffixes. Next, you can combine the end of the prefix and the new suffix to form the next prefix, and repeat. For example, if you start with the prefix “draw me,” then the next word has to be “a sheep,” because the prefix is always followed by “a sheep” in the text. If a prefix is “a thousand miles,” the next suffix might be “from any habitation” or “from any inhabited region.” In this example the lengths of the prefixes are two or three words, but you can do Markov analysis with any prefix length. Exercise 7
Credit: this case study is based on an example from Kernighan and Pike, The Practice of Programming, Addison-Wesley, 1999. You should attempt this exercise before you go on. Then you can can study our solution in Subsection??. 11.14 Data StructuresUsing Markov analysis to generate random text is fun, but there is also a point to this exercise: data structure selection. In your solution to the previous exercises, you had to choose:
The last one is easy: a hash is the obvious choice for a mapping from keys to corresponding values. For the prefixes, the most obvious options are a string or a list of strings. For the suffixes, one option is a list; another is a histogram (hash). How should you choose? The first step is to think about the operations you will need to implement for each data structure. For the prefixes, we need to be able to remove words from the beginning and add words to the end. For example, if the current prefix is “draw me,” and the next word is “a,” you need to be able to form the next prefix, “me a” in order to find the next suffix, “sheep.” Your first choice might be an array, since it is easy to add and remove elements, but we also need to be able to use the prefixes as keys in a hash, so that sort of rules out arrays. For the collection of suffixes, the operations we need to perform include adding a new suffix (or increasing the frequency of an existing one), and choosing a random suffix. Adding a new suffix is equally easy for the list implementation or the hash. Choosing a random element from a list is easy; choosing from a hash is harder to do efficiently (see Exercise ??). So far we have been talking mostly about ease of implementation, but there are other factors to consider in choosing data structures. One is run time. Sometimes there is a theoretical reason to expect one data structure to be faster than other; for example, we mentioned that a lookup operation is faster for hashes than for arrays, especially when the number of elements is large. But often you don’t know ahead of time which implementation will be faster. One option is to implement both of them and see which is better. This approach is called benchmarking. A practical alternative is to choose the data structure that is easiest to implement, and then see if it is fast enough for the intended application. If so, there is no need to go on. If not, there are tools, like profile modules, that can identify the places in a program that take the most time. The other factor to consider is storage space. For example, using a histogram for the collection of suffixes might take less space because you only have to store each word once, no matter how many times it appears in the text. In some cases, saving space can also make your program run faster, and in the extreme, your program might not run at all if you run out of memory. But for many applications, space is a secondary consideration after run time. One final thought: in this discussion, we have implied that we should use one data structure for both analysis and generation. But since these are separate phases, it would also be possible to use one structure for analysis and then convert to another structure for generation. This would be a net win if the time saved during generation exceeded the time spent in conversion. 11.15 Building Your Own Data StructuresPerl has a number of compound types such as arrays and hashes that you can combine to construct arrays of arrays, arrays of hashes, hashes of arrays, hashes of hashes, arrays of arrays of arrays or hashes, and so on, and this is usually sufficient for most needs. Sometimes, however, you need something very specific that is not built in. Over the years, computer science has studied and used scores of specific data structures such as linked lists, stacks, queues, circular lists, trees of numerous kinds, etc. We will briefly study a couple of them. 11.15.1 Linked ListsA linked list is a collection of items in which each item holds a value (or several values) and a link to the next item of the collection. In many programming languages, arrays have a fixed size (contrary to Perl in which arrays can usually grow according to your needs). In those programming languages, a linked list is often used to represent a variable-size collection of items. We saw in Section ?? how to use arrays to build stacks and queues. It was fairly easy. In some lower-level programming languages, you would need linked lists for that. In Perl, a linked list item may be represented by a pair
(an object of type sub add-to-stack (Pair $stack-top, $item) { my $new-pair = $item => $stack-top; return $new-pair; } sub take-from-stack (Pair $stack-top) { my $new-top = $stack-top.value; return $stack-top.key, $new-top; } sub create-stack ($item) { return $item => Nil; } my $stack = create-stack (0); for 1..5 -> $item { $stack = add-to-stack($stack, $item); } say "The stack is: ", $stack.perl; for 1..5 { my $value; ($value, $stack) = take-from-stack($stack); say "$value -- ", $stack; } Once populated, the resulting stack looks like this: 5 => 4 => 3 => 2 => 1 => 0 => Nil This is just given as an example for the construction of a linked list. There is usually no need to use anything like this in Perl, since it is much easier to implement a stack using an array, as seen in Section ??, although the same principle can be used for building more advanced data structures. You might still want, as an exercise, to implement a queue (see section Section ??) using a linked list. 11.15.2 TreesA tree is usually a collection of items in which each item (or node) holds a value (or possibly several values) and two or several links to other items of the collection, the children. Think of a family tree or a tree of directories on a hard disk drive to get the general idea. The ultimate nodes that don’t have their own children are often called the leaves. There is usually only one ultimate grandparent node, sometimes called the root (if there is more than one ultimate grandparent, then it is not really a tree but several trees or a “forest”). Dozens of different types of trees have been invented and their descriptions have filled entire books about computer science algorithms. Some are designed to keep the data sorted, others to maintain balance between the tree branches, and so on. The data structure is often not very complicated, but the algorithms needed to maintain the required properties sometimes can be a bit hairy. For example, a typical tree might hold one value and two links, one to the left child and one to the right child. You might implement such a binary tree in a similar way as the linked list described above, except that the value would no longer be a link to the next element, but an array of two elements, the links to the two children. Or you could follow more closely the linked list model above and have nested pairs. For example, a binary tree might look like this: my $tree = 1 => (( 2 => 3 ) => (4 => (5 => ((6 => 7) => 8)))); The implementation is left as an exercise to the reader. Quite often, though, a tree may be implemented in Perl as a simpler data structure such as a nested array or hash. The next section examines such an example. 11.15.3 Binary HeapsA binary heap is a binary tree that keeps a partial order: each node has a value larger that its parent and less than either of its two children; there is no specific order imposed between siblings. (You could also do it the other way around: you could design heaps in which any node has a value less than its parent.) Because there is no order between siblings, it is not possible to find a particular element without potentially searching the whole heap. Therefore, a heap is not very good if you need random access to specific nodes. But if you’re interested in always finding the smallest item, then a heap is a very efficient data structure. Heaps are used for solving a number of CS problems, and serve as the basis for an efficient and very popular sorting technique called heap sort. For a human, it is useful to represent a heap in a tree-like
form. But a computer can store a heap as a simple array (not
even a nested array). For this, the index of an element is
used to compute the index of its parent or its two children.
The children of an element are the two locations where the
indices are about double its index; conversely, the parent
of a node is located at about half its index. If the heap
starts at index 0, the exact formulas for a node with index
The root node is at index 0. Its children are at positions 1 and 2. The children of 1 are 3 and 4 and the children of 2 are 5 and 6. The children of 3 are 7 and 8, and so on. Thus, if interpreted as a binary heap, the array: [0, 1, 2, 3, 4, 5, 6, 7, 8] is associated with the tree displayed in Figure ??. Here is one way to build a heap (in partial alphabetic order) from a list of unordered letters: sub build-heap (@array, $index is copy) { my $index-val = @array[$index]; while ($index) { my $parent = Int( ($index - 1) / 2); my $parent-val = @array[$parent]; last if $parent-val lt $index-val; @array[$index] = $parent-val; $index = $parent; } @array[$index] = $index-val; } sub heapify (@array) { for @array.keys -> $i { build-heap @array, $i; } } my @input = <m t f l s j p o b h v k n q g r i a d u e c>; heapify @input; say @input; Note that the heap is built in place (there is no need for a second array). The resulting array is displayed as follows: [a b g d c k j l f h e m n q p t r o i u s v] Is this a correct heap? It’s difficult to say at first glance and checking it manually is somewhat tedious. When writing a program for building such a data structure, it is often useful to write some subroutines to display the content in a way that makes it easier to understand the result and check its correctness. The following code shows two examples of such possible subroutines: sub print-heap (@array) { my $start = 0; my $end = 0; my $last = @array.end; my $step = 1; loop { say @array[$start..$end]; last if $end == $last; $start += $step; $step *= 2; $end += $step; $end = $last if $end > $last; } } sub print-heap2 (@array) { my $step = 0; for @array.keys -> $current { my $left_child = @array[2 * $current + 1]; say "$current\tNode = @array[$current];\tNo child" and next unless defined $left_child; my $right_child = @array[2 * $current + 2] // "' '"; say "$current\tNode = @array[$current];\tChildren: " . " $left_child and $right_child"; $step++; } } The first one displays the related tree level by level: (a) (b g) (d c k j) (l f h e m n q p) (t r o i u s v) which makes it easy to draw the tree (see Figure ??). The second one shows the children for each node and makes it possible to easily check that the partial alphabetic order constraint is satisfied (i.e., each node is smaller than its children): 0 Node = a; Children: b and g 1 Node = b; Children: d and c 2 Node = g; Children: k and j 3 Node = d; Children: l and f 4 Node = c; Children: h and e 5 Node = k; Children: m and n 6 Node = j; Children: q and p 7 Node = l; Children: t and r 8 Node = f; Children: o and i 9 Node = h; Children: u and s 10 Node = e; Children: v and ' ' 11 Node = m; No child 12 Node = n; No child (...) 21 Node = v; No child 11.16 DebuggingWhen you are debugging a program, and especially if you are working on a hard bug, here are some things to try:
Beginning programmers sometimes get stuck on one of these activities and forget the others. Each activity comes with its own failure mode. For example, reading your code might help if the problem is a typographical error, but not if the problem is a conceptual misunderstanding. If you don’t understand what your program does, you can read it 100 times and never see the error, because the error is in your head. Running experiments can help, especially if you run small, simple tests. But if you run experiments without thinking or reading your code, you might fall into a pattern we call “random walk programming,” which is the process of making random changes until the program does the right thing. Needless to say, random walk programming can take a very long time. You have to take time to think. Debugging is like an experimental science. You should have at least one hypothesis about what the problem is. If there are two or more possibilities, try to think of a test that would eliminate one of them. But even the best debugging techniques will fail if there are too many errors, or if the code you are trying to fix is too big and complicated. Sometimes the best option is to retreat, simplifying the program until you get to something that works and that you understand. Beginning programmers are often reluctant to retreat because they can’t stand to delete a line of code (even if it’s wrong). If it makes you feel better, copy your program into another file before you start stripping it down. Then you can copy the pieces back one at a time. Finding a hard bug requires reading, running (with or without a debugger), ruminating, and sometimes retreating. If you get stuck on one of these activities, try the others. 11.17 Glossary
11.18 Exercises: Huffman CodingHuffman coding is a technique used for data compression, i.e., to reduce the size of a piece of data (such as, for example, compressing a file). 11.18.1 Variable-Length CodesIf you are familiar with Morse code, you know that it is a system for encoding the letters of the alphabet as a series of dots and dashes. For example, the famous signal ...---... represents the letters SOS, which comprise an internationally recognized call for help. The table in Figure ?? shows the rest of the codes. Morse code (invented between 1836 and 1844) was one of the very first attempts at digital encoding of the alphabet of a plain text. The only known earlier attempt is the braille alphabet (1824-1837). Notice that some codes are longer than others. By design, the most common letters have the shortest codes. Since there are a limited number of short codes, that means that less common letters and symbols have longer codes. A typical message will have more short codes than long ones, which minimizes the average transmission time per letter. Codes like this are called variable-length codes. In this exercise, we will look at an algorithm for generating a variable-length code called a Huffman code. It is an interesting algorithm in its own right, but it also makes a useful exercise because its implementation uses a variety of data structures. Here is an outline of what we will do until the end of this chapter:
11.18.2 The Frequency TableSince the goal is to give short codes to common letters, we have to know how often each letter occurs. In Edgar Allan Poe’s short story “The Gold Bug,” one of the characters, William Legrand, uses letter frequencies to crack a cypher. He explains: “Now, in English, the letter which most frequently occurs is e. Afterwards, the succession runs thus: a o i d h n r s t u y c f g l m w b k p q x z. E however predominates so remarkably that an individual sentence of any length is rarely seen, in which it is not the prevailing character.” So our first mission is to see whether Poe got it right. To check, let’s use as a sample the text of “The Gold Bug” itself, which can be downloaded from Project Gutenberg (http://www.gutenberg.org/files/2147/2147-0.txt) and a variety of other websites. Exercise 8
Write a program that counts the number of times each letter
appears in a sample text. Download the text of “The Gold Bug”
and analyze the frequency of the letters.
Solution: see Section ?? 11.18.3 Building the Huffman CodeFor our purposes, Morse code has one defect: it does not use just two symbols as you might think, but actually three: in addition to the dots and dashes, it it also implicitly using the space between two symbols, as well as a longer space between two letters. The reason why some space is needed is quite simple. Refer to the Morse
code table above and suppose you receive three dots ( In 1951, David A. Huffman invented a code-building technique avoiding this problem: provided that you know where a given letter starts, it is totally unambiguous. For example, we will meet later a Huffman code for a small subset of the alphabet that looks like this: a => .. e => .- s => -.- n => -.. t => --. d => ---. r => ---- If you start reading a sequence of dots and dashes representing a valid text composed with these seven letters, you can always decode it without any ambiguity. If the first symbol is a dot, then the letter is either an “a” or a “e” depending on the next symbol. If the first symbol is a dash and the next one a dot, then the letter must be either a “s” or an “n” depending on the third symbol. If the two first symbols are dashes, you can similarly determine that the current letter is a “t” (if the third symbol is a dot), or a “d” or a “r,” which you can find out by looking at the fourth symbol. In brief, you don’t need spaces between the symbols, it is always possible to unambiguously decode a letter. How can we build such a Huffman code? Let’s do it by hand with a very simple alphabet: the four letters of the nucleo-bases of DNA: A, C, T, and G. Suppose we want to encode the following input string: CCTATCCTCGACTCCAGTCCA This gives the following frequency table: C : 10 47.62 T : 5 23.81 A : 4 19.05 G : 2 9.52 To build the Huffman code, we start with the two less frequent
letters and merge them into one new temporary symbol, We are now left with three letters, C, T, and [GA]. We merge the
two least frequent letters, “T” and “[GA],” and can now tell
that the symbol for “T” will be We can now unroll our dummy letters: “T” is C => . T => -. G => --. A => --- Notice that, by construction, the most frequent letter (C) has the shortest code and the least common letters (G and A) the longest codes. Manually encoding the ..-.----...-..--.---.-...-----.-...--- Note that our Huffman code is not ambiguous: the first dot can
only be a “C,” and the second one also. The next symbol is a
dash, which can be the beginning of the three other letters, but
only “T” can have a dot afterwards. The next sequence of
symbols is four dashes; this can only be the three dashes of a
“A”, with the last dash being the beginning of the next letter;
and In a real-life Huffman encoding for text file compression, we would not use dots and dashes, but 0 and 1 bits; however, dots and dashes are just another nice way of representing those binary values. So, let’s just pretend that dots and dashes are really 0 and 1 binary numbers. Did we really achieve data compression? Our pseudo-Morse string has 38 binary symbols. The original input string had 21 characters or bytes, that is 168 bits. The data has been compressed by a factor of about 4.4. Is Huffman coding better than a fixed-length code? A string representation where each letter would be represented by two bits (two bits can represent four letters) would require 42 symbols. So, yes, we did obtain a better data compression than a fixed-length encoding (by about 10%). This may seem to be a small achievement, but this is actually quite good with such a small alphabet. With real text data, Huffman coding can achieve significant data compression. Exercise 9
|
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|