This HTML version of Think Perl 6 is provided for convenience, but it is not the best format of the book. You might prefer to read the PDF version. Chapter 13 Regexes and GrammarsRegular expressions or regexes were introduced in Sections ?? to ??. You might want to review those sections before reading this chapter if you don’t remember much about regexes. You don’t need to remember the details of everything we covered earlier and we will explain again briefly specific parts of the functionality that we will be using, but you are expected to understand generally how regexes work. 13.1 A Brief RefresherRegexes, as we have studied them so far, are about string exploration using patterns. A pattern is a sequence of (often special) characters that is supposed to describe a string or part of a string. A pattern matches a string if a correspondence can be found between the pattern and the string. For example, the following code snippet searches the string for the letter “a”, followed by any number (but at least one) of letters “b” or “c”, followed by zero or more digits followed by a “B” or a “C”: my $str = "foo13abbccbcbcbb42Cbar"; say ~$/ if $str ~~ /a <[bc]>+ (\d*) [B|C]/; # -> abbccbcbcbb42C say ~$0; # -> 42
This code uses the The pattern is made of the following components:
If the match is successful (as is the case in this example),
the result is stored into the match object, This is what might be called low-level matching: pattern recognition is done mostly at the individual character level. Perl 6 offers ways to group and name regex patterns so that these individual patterns can then be used as building blocks for higher level matching: recognizing words and sequences of words (rather than just characters), for the purpose of performing what is called lexical analysis (or lexing) and grammatical analysis (or parsing) on a piece of text. This chapter is mostly devoted to this higher type of matching, leading to the creation of full-fledged grammars that can analyze structured text such as XML or HTML texts, JSON or YAML documents, or even computer programs: Perl 6 programs are actually parsed using a Perl 6 grammar written in Perl 6. Grammars are a very important topic in computer science, but, obviously, most programmers don’t commonly write full-fledged grammars for parsing programming languages. However, writing a simple grammar and a simple parser might be, or perhaps should be, a much more common task. Quite often, people spend a lot of effort at deciphering a simple configuration file with low-level techniques, whereas writing a simple parser might be a lot easier and much more efficient. Perl 6 offers all the tools to do that very easily. Sometimes, you also need to develop a domain-specific language (DSL), i.e., a usually relatively small sublanguage (a.k.a. slang) adapted to a specific field of knowledge (scientific, engineering, business, art, or other) with its own conventions, symbols, operators, and so on. With a grammar and Perl’s ability to create its own operators, you can often express specialized knowledge within the terminology framework of subject-matter experts. 13.2 Declarative ProgrammingBoth regexes and grammars are examples of yet another programming paradigm that we haven’t really explored so far: declarative programming. This is a programming model in which, contrary to ordinary imperative or procedural programming, you don’t state how to do something and don’t choose your control flow. Rather, you specify a set of definitions, rules, properties, and possibly some constraints and actions, and let the program apply those to derive some new information about the input data. This form of programming is widely used in logic programming (e.g., Prolog), artificial intelligence, expert systems, data analysis, database query languages (e.g., SQL), text and source code recognition (e.g., Lex and Flex), program compilation (e.g., Yacc or Bison), configuration management, makefiles, and also in some ways functional programming. 13.3 CapturesAs we noted in the regex examples at the beginning of this chapter, round parentheses not only group things together, but also capture data: they make the string matched by the subpattern within the parentheses available as a special variable: my $str = 'number 42'; say "Number is $0" if $str ~~ /number \s+ (\d+) /; # -> Number is 42
Here, the pattern matched the say "$0 $1 $2" if "abcde" ~~ /(a) b (c) d (e)/; # -> a c e This is fine for simple captures, but the numbering of captures can become tedious if there are many captures and somewhat complicated when there are nested parentheses in the pattern: if 'abc' ~~ / ( a (.) (.) ) / { say "Outside: $0"; # Outside: abc say "Inside: $0[0] and $0[1]"; # Inside: b and c } When it gets complicated, it is often better to use another feature called named captures. The standard way to name a capture is as follows: if 'abc;%' ~~ / $<capture_name> = \w+ / { say ~$<capture_name>; # abc } The use of the named capture, Named captures can be nested using regular capture group syntax: if 'abc' ~~ / $<overall>=( a $<part1>=(.) $<part2>=(.) ) / { say "Overall: $<overall>"; # Overall: abc say "Part 1: $<overall><part1>"; # Part 1: b say "Part 2: $<overall><part2>"; # Part 2: c } Assigning the match object to a hash gives you easy programmatic access to all named captures: if 'abc' ~~ / $<overall>=( a $<part1>=(.) $<part2>=(.) ) / { my %capture = $/.hash; say ~%capture<overall>; # -> abc for kv %capture<overall> -> $key, $val { say $key, " ", ~$val; # -> part2 c \n part1 b } } But you might as well do the same thing directly on the match object without having to perform an extra hash assignment: if 'abc' ~~ / $<overall>=( a $<part1>=(.) $<part2>=(.) ) / { say "Overall: $<overall>"; # -> Overall: abc for kv %<overall> -> $key, $val { say $key, " ", ~$val; # -> part2 c \n part1 b } } Remember that, in the above code, There is, however, a more convenient way to get named captures which is discussed in the next section. 13.4 Named Rules (a.k.a. Subrules)It is possible to store pieces of regexes into named rules. The following example uses a named regex, which is one of the kinds of named rules, to match a text line: my regex line { \N* \n } # any number of characters other # than new line, followed by 1 new line if "abc\ndef" ~~ /<line> def/ { say "First line: ", $<line>.chomp; # First line: abc } Notice that the syntax with a block of code is akin to a subroutine or method definition. This is not a coincidence; we will see that named rules are very similar to methods. Notably, rules can call each other (or even sometimes call themselves recursively) just like methods and subroutines, and we will see that this is a very powerful and expressive feature.
A named regex can be declared with
As you can see in the example above, a successful named regex creates a named capture with the same name. If you need a different name for the capture, you can do this with the syntax <capturename=regexname>. In this example, we call the same named regex twice and, for convenience, use a different name to distinguish the two captures: my regex line { \N* \n } if "abc\ndef\n" ~~ / <first=line> <second=line> / { say "First line: ", $<first>.chomp; # -> First line: abc say "Second line: ", $<second>.chomp; # -> Second line: def print $_.chomp for $<line>.list; # -> abc def } Here, we have used chomp method calls to remove the new line characters from the captures. There is in fact a way to match on the new line character but exclude it from the capture: my regex line { \N* )> \n } if "abc\ndef\n" ~~ / <first=line> <second=line> / { say "First line: ", ~$<first>; # -> First line: abc say "Second line: ", ~$<second>; # -> Second line: def print $<line>.list; # -> abc def } This relatively little-known token, " Named regexes are only one form (and probably not the most common) of the named rules, which come in three main flavors:
In the two examples above, we did not need the regexes to backtrack. We could (and probably should) have used a named token instead of a named regex: my token line { \N* \n } if "abc\ndef" ~~ /<line> def/ { say "First line: ", $<line>.chomp; # First line: abc } But, for a rule to match, we would have to remove the space from within the pattern: my rule line { \N*\n } if "abc\ndef" ~~ /<line> def/ { say "First line: ", $<line>.chomp; # First line: abc } Collectively, these three types of named rules are usually referred to as rules, independently of the specific keyword used for their definition. Remember the various regexes we experimented for extracting dates from a string in Subsection ?? (p. ??)? The last example used subpatterns as building blocks for constructing the full pattern. We could now rewrite it, with the added feature of recognizing multiple date formats, as follows: my $string = "Christmas : 2016-12-25."; my token year { \d ** 4 } my token month { 1 <[0..2]> # 10 to 12 || 0 <[1..9]> # 01 to 09 }; my token day { (\d ** 2) <?{1 <= $0 <= 31 }> } my token sep { '/' || '-' } my rule date { <year> (<sep>) <month> $0 <day> || <day> (<sep>) <month> $0 <year> || <month>\s<day>',' <year> } if $string ~~ /<date>/ { say ~$/; # -> 2016-12-25 say "Day\t= " , ~$/<date><day>; # -> 25 say "Month\t= " , ~$/<date><month>; # -> 12 say "Year\t= " , ~$/<date><year>; # -> 2016 } The first four named tokens define the basic building blocks for matching the year, the month, the day, and possible separators. Then, the date named rule uses these building blocks to define an alternation between three possible date formats. This code checks that the day in the month is between 0 and 31 and that the month is between 01 and 12, and this is probably sufficient to recognize dates in a text in most cases, but this would match “2016-11-31” as a date, although November only has 30 days. We may want to be a little bit stricter about valid dates and prevent that by adding a negative code assertion to the date named rule: my rule date { [ <year> (<sep>) <month> $0 <day> || <day> (<sep>) <month> $0 <year> || <month>\s<day>',' <year> ] <!{ $<day> > 30 and $<month> == 2|4|6|9|11}> } This is better, but we can still match an invalid date such as “2016-02-30”. Exercise 1
As an exercise, change the code
assertion to reject a “Feb. 30” date. If you feel
courageous, you might even want to check the number of
days in February depending on whether the date occurs in
a leap year. You may also want to try to define and test
other date formats. Solution: ??
Rules can (and usually should) be grouped in grammars; that’s in fact what they have been designed for. 13.5 GrammarsGrammars are a powerful tool used to analyze textual data and often to return data structures that have been created by interpreting that text. For example, any Perl 6 program is parsed and executed using a Perl 6 grammar written in Perl 6, and you could write a grammar for parsing (almost) any other programming language. To tell the truth, programmers rarely write grammars for parsing programming languages. But grammars are very useful for performing many tasks that are much more common than parsing programs. If you ever tried to use regexes for analyzing a piece of HTML (or XML) text1, you probably found out that this is quickly becoming next to impossible, except perhaps for the most simple HTML data. For analyzing any piece of such data, you need an actual parser which, in turn, will usually be based on an underlying grammar. If you didn’t like grammar in school, don’t let that scare you off grammars. Perl 6 grammars are nothing complicated; they just allow you to group named rules, just as classes allow you to group methods of regular code. A grammar creates a namespace and is introduced with the keyword grammar. It usually groups a number of named rules, in the same way a class groups a number of methods. A grammar is actually a class that inherits from the Grammar superclass, which provides methods such as parse to analyze a string and .parsefile to analyze a file. Moreover, you can actually write some methods in a grammar, and even import some roles. And, as we shall see, grammars are often associated with some actions classes or actions objects. Unless told otherwise, the parsing methods will look for a default rule named “TOP” (which may be a named regex, token, or rule) to start the parsing. The date parsing rules used above might be assembled into a grammar as follows: grammar My-date { rule TOP { \s*? [ <year> (<sep>) <month> $0 <day> || <day> (<sep>) <month> $0 <year> || <month>\s<day>',' <year> ] \s* <!{ ($<day> > 30 and $<month> == 2|4|6|9|11)}> } token year { \d ** 4 } token month { 1 <[0..2]> || 0 <[1..9]> } token day { (\d ** 2) <?{1 <= $0 <= 31 }> } token sep { '/' || '-' } } for " 2016/12/25 ", " 2016-02-25 ", " 31/04/2016 " -> $string { my $matched = My-date.parse($string); say ~$matched if defined $matched; } This will print out: 2016/12/25 2016-02-25 The code assertion within the “TOP” rule prevents invalid dates such as “31/04/2016” from being matched; you would need to add some code for handling the end of February dates, as we did in the solution to the previous exercise (see Subsection ??) if this is important. You may want to do it as an exercise. Besides that, this code is not very different from our earlier code, but there are a few changes that are significant. I renamed the date rule as TOP because this is the default name searched by parse for the top-level rule. A grammar creates its own namespace and lexical scope, and I no longer need to declare the rules with the my declarator (which is required for rules declared outside of a grammar). Within a grammar, the order in which the rules are defined is generally not relevant, so that I could define the TOP rule first, even though it uses tokens that are defined afterwards (which again would have not been possible with rules used outside a grammar). This is important because, within a grammar, you can have many rules that call each other (or rules that call themselves recursively), which would be unpractical if the order of the rule definitions mattered. If you’re parsing the input string with the .parse method, the TOP rule is automatically anchored to the start and end of the string, which means that the grammar has to match the whole string to be successful. This is why we had to add patterns for spaces at the beginning and at the end of our TOP rule to match our strings which have some spaces before and after the date itself. There is another method, ’.subparse’, which does not have to reach the end of the string to be successful, but we would still need to have the space pattern at the beginning of the rule. 13.6 Grammar InheritanceA grammar can inherit from another grammar, just as a class can inherit from another class. Consider this very simple (almost simplistic) grammar for parsing a mail message: grammar Message { rule TOP { <greet> $<body>=<line>+? <end> } rule greet { [Hi||Hello||Hey] $<to>=\S+? ',' } rule end { Later dude ',' $<from>=.+ } token line { \N* \n} } We can test it with the following code: my $msg = "Hello Tom, I hope you're well and that your car is now repaired. Later dude, Liz"; my $matched = Message.parse($msg); if defined $matched { say "Greeting \t= ", ~$matched<greet>.chomp; say "Addressee\t= $matched<greet><to>"; say "Author \t= $matched<end><from>"; say "Content \t= $matched<body>"; } This will print out the following: Greeting = Hello Tom, Addressee = Tom Author = Liz Content = I hope you're well and that your car is now repaired. Suppose now that we want a similar grammar for parsing a more formal message and we figure out that we could reuse part of the Message grammar. We can have our new child grammar inherit from the existing parent: grammar FormalMessage is Message { rule greet { [Dear] $<to>=\S+? ',' } rule end { [Yours sincerely|Best regards] ',' $<from>=.+ } } The is Message trait in the header tells Perl that FormalMessage should inherit from the Message grammar. Only two rules, greet and end, need to be redefined; the others (the TOP rule and the line token) will be inherited from the Message grammar. Let’s try some code to run it: my $formal_msg = "Dear Thomas, enclosed is our invoice for June 2016. Best regards, Elizabeth."; my $matched2 = FormalMessage.parse($formal_msg); if defined $matched2 { say "Greeting \t= ", ~$matched2<greet>.chomp; say "Addressee\t= $matched2<greet><to>"; say "Author \t= $matched2<end><from>"; say "Content \t= $matched2<body>"; } This will print: Greeting = Dear Thomas, Addressee = Thomas Author = Elizabeth. Content = enclosed is our invoice for June 2016. 13.7 Actions ObjectsA successful grammar match gives you a parse tree of
match objects (objects of To avoid the need for diving deep into a match tree, you can supply an actions object. After each successful match of a named rule in your grammar, it tries to call a method with the same name as the grammar rule, giving it the newly created match object as a positional argument. If no such method exists, it is skipped. (Action methods are sometimes also called reduction methods.) If it exists, the action method is often used to construct an abstract syntax tree (AST), i.e., a data structure presumably simpler to explore and to use than the match object tree, or it can do any other thing deemed to be useful. In this somewhat simplistic example of a basic arithmetic calculator, the actions don’t try to build an AST, but simply do the bulk of the calculation work between the various tokens matched by the grammar: grammar ArithmGrammar { token TOP { \s* <num> \s* <operation> \s* <num> \s*} token operation { <[^*+/-]> } token num { \d+ | \d+\.\d+ | \.\d+ } } class ArithmActions { method TOP($/) { given $<operation> { when '*' { $/.make([*] $/<num>)} when '+' { $/.make([+] $<num>)} when '/' { $/.make($<num>[0] / $<num>[1]) } when '-' { $/.make([-] $<num>) } when '^' { $/.make($<num>[0] ** $<num>[1]) } } } } for ' 6*7 ', '46.2 -4.2', '28+ 14.0 ', '70 * .6 ', '126 /3', '6.4807407 ^ 2' -> $op { my $match = ArithmGrammar.parse($op, :actions(ArithmActions)); say "$match\t= ", $match.made; } This prints the following output: 6*7 = 42 46.2 -4.2 = 42 28+ 14.0 = 42 70 * .6 = 42 126 /3 = 42 6.4807407 ^ 2 = 42.00000002063649 The aim of this example is not to describe how to implement a basic calculator (there are better ways to do that, we’ll come back to that), but only to show how actions may be used in conjunction with a grammar. The grammar is quite simple and is looking for two decimal numbers
separated by an infix arithmetic operator. If there is a
match, The parse method is called with an actions: named argument, the ArithmActions class, which tells Perl which actions object to use with the grammar. In this example, we don’t really pass an action object, but simply the name of the actions class (actually a type object), because there is no need to instantiate an object. In other cases, for example if there was a need to initialize or somehow use some object attributes, we would need to pass an actual object that would have to be constructed beforehand. Whenever the TOP rule succeeds, the TOP method of class ArithmActions is invoked with the match object for the current rule as the argument. This method calls the make method on the match object and returns the result of the actual arithmetic operation between the two numbers. Then, the made method in the caller code (within the for loop) returns that result. 13.8 A grammar for Parsing JSONJSON (JavaScript Object Notation) is an open-standard format for text data derived from the object notation in the JavaScript programming language. It has become one of the commonly used standards for serializing data structures, which makes it possible, for example, to exchange them between different platforms and different programming languages, to send them over a network, and to store them permanently in files on disks. 13.8.1 The JSON FormatThe JSON format is quite simple and is composed of two types of structural entities:
Values can be either (recursively) objects or arrays as defined just above, or basic data types, which are: strings, numbers, Boolean (true or false), and null (empty value or undefined value). A string is a sequence of Unicode characters between quotation marks, and numbers are signed decimal numbers that may contain a fractional part and may use exponential “E” notation. 13.8.2 Our JSON SampleTo illustrate the format description above and for the purpose of our tests, we will use an example borrowed from the Wikipedia article on JSON (https://en.wikipedia.org/wiki/JSON), which is a possible JSON description of a person: { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" }, { "type": "mobile", "number": "123 456-7890" } ], "children": [], "spouse": null, "Bank-account": { "credit": 2342.25 }
Compared to the Wikipedia example, we’ve added a
13.8.3 Writing the JSON Grammar Step by StepLet’s take each of the JSON entities in turn and handle them with rules. 13.8.3.1 NumbersThe example JSON document above only has integers and decimal numbers, but we need to be able to recognize numbers such as “17,” “-138.27,” “1.2e-3,” “.35,” etc. We can use the following token to do so: token number { [\+|\-]? # optional sign [ \d+ [ \. \d+ ]? ] # integer part and optional fractional part | [ \. \d+ ] # or only a fractional part [ <[eE]> [\+|\-]? \d+ ]? # optional exponent } 13.8.3.2 JSON StringsThere are many possible patterns to define a string. For our sample JSON document, the following rule will be sufficient: token string { \" <[ \w \s \- ' ]>+ \" } This will match a double-quoted sequence of alphanumeric characters, spaces, dashes, and apostrophes. For a real JSON parser, a rule using a negative character class excluding anything that cannot belong to a string might be better, for example: token string { \" <-[\n " \t]>* \" } i.e., a double-quoted sequence of any characters other than double quotes, newlines, and tabulations. You might want to study the JSON standards2 to figure out exactly what is accepted or forbidden in a JSON string. For our purposes, the first rule above will be sufficient. 13.8.3.3 JSON ObjectsJSON objects are lists of key-value pairs. Lists are delimited by curly braces and pairs separated by commas. A key-value pair is a string followed by a colon, followed by a value (to be defined later). This can be defined as follows: rule object { '{' <pairlist> '}' } rule pairlist { [<pair> [',' <pair>]*] } rule pair { <string> ':' <value> }
We can use a regex feature that we haven’t seen yet, the
quantifier modifier, to simplify the pairlist
rule. To more easily match things like comma-separated
values, you can tack on a Thus, the pairlist rule can be rewritten as follows: rule pairlist {<pair> + % \,} or: rule pairlist {<pair> * % \,} if we accept that a pairlist may also be empty. 13.8.3.4 JSON ArraysArrays are comma-separated lists of values between square brackets: rule array { '[' <valueList> ']'} rule valueList { <value> * % \, } Here, we again used the modified quantifier shown just above. 13.8.3.5 JSON ValuesValues are objects, arrays, string, numbers, Booleans (true or false), or null: token value { | <object> | <array> | <string> | <number> | true | false | null } 13.8.4 The JSON GrammarWe have defined all the elements of the JSON grammar; we only need to declare a grammar and to add a TOP rule to complete it: grammar JSON-Grammar { token TOP { \s* [ <object> | <array> ] \s* } rule object { '{' \s* <pairlist> '}' \s* } rule pairlist { <pair> * % \, } rule pair { <string>':' <value> } rule array { '[' <valueList> ']'} rule valueList { <value> * % \, } token string { \" <[ \w \s \- ' ]>+ \" } token number { [\+|\-]? [ \d+ [ \. \d+ ]? ] | [ \. \d+ ] [ <[eE]> [\+|\-]? \d+ ]? } token value { <object> | <array> | <string> | <number> | true | false | null } } We can now test the grammar with our sample JSON string and try to print the match object: my $match = JSON-Grammar.parse($JSON-string); say ~$match if $match; This produces the following output: { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "address": { "streetAddress": "21 2nd Street "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" }, { "type": "mobile", "number": "123 456-7890" } ], "children": [], "spouse": null, "Bank-account": { "credit": 2342.25 } } The sample JSON document has been fully matched. This JSON grammar works perfectly on it, and takes less than 20 lines of code. If you think about it, this is really powerful. Test it for yourself. Try to change the grammar in various places to see if it still works. You could also try to introduce errors into the JSON document (for example to remove a comma between two values of a list) and the match should no longer occur (or, at least, should not be the same). You may object that this grammar covers only a subset of JSON. This is sort of true, but not really: it is almost complete. True, I would not recommend using this grammar in a production environment for parsing JSON documents, because it has been built only for pedagogical purposes and may not comply with every single fine detail of the JSON standards. Take a look at the grammar of the Perl 6 JSON::Tiny module (https://github.com/moritz/json), which can parse any valid JSON document. It is not much more complicated than what we have shown here (except for the use of proto regexes, a topic that we haven’t covered here), and it is not much longer, as it contains about 35 code lines. 13.8.5 Adding ActionsThe JSON grammar works fine, but printing out the tree of parse objects just for our relatively small JSON document will display about 300 lines of text, as it provides all the details of everything that has been matched, rule by rule and subpattern by subpattern. This can be very useful in helping you to understand what the grammar does (especially when it does not work as expected), but exploring that tree to extract the data can be quite tedious. You can use actions to populate a simpler tree structure (often called an abtract syntax tree) containing only the information you really need. Let us add an actions class to build an abstract syntax tree (AST): class JSON-actions { method TOP($/) { make $/.values.[0].made; }; method object($/) { make $<pairlist>.made.hash.item; } method pairlist($/) { make $<pair>>>.made.flat; } method pair($/) { make $<string>.made => $<value>.made; } method array($/) { make $<valueList>.made.item; } method valueList($/) { make [$<value>.map(*.made)]; } method string($/) { make ~$0 } method number($/) { make +$/.Str; } method value($/) { given ~$/ { when "true" {make Bool::True;} when "false" {make Bool::False;} when "null" {make Any;} default { make $<val>.made;} } } } For this actions class to work, we need to make a small change to the grammar. The value method uses a val named capture to access its content; we need to add the relevant named captures to the value token: token value { <val=object> | <val=array> | <val=string> | <val=number> | true | false | null } We can now call our grammar with the following syntax: my $j-actions = JSON-actions.new(); my $match = JSON-Grammar.parse($JSON-string, :actions($j-actions)); say $match.made; Notice that, here, we’ve used an actions object rather than simply the actions class, but this is just for the purpose of showing how to do it; we could have used the class directly as before. The last statement in the above code prints out the AST. We have reformatted the output to better show the structure of the AST: { Bank-account => { credit => 2342.25 }, address => { city => New York, postalCode => 10021-3100, state => NY, streetAddress => 21 2nd Street }, age => 25, children => [], firstName => John, isAlive => True, lastName => Smith, phoneNumbers => [ {number => 212 555-1234, type => home} {number => 646 555-4567, type => office} {number => 123 456-7890, type => mobile} ], spouse => (Any) } In this case, the top structure is a hash (it could also have been an array with a different JSON input string). We can now explore this hash to find the data of interest to us. For example: say "Keys are: \n", $match.made.keys; say "\nSome values:"; say $match.made{$_} for <firstName lastName isAlive>; say $match.made<address><city>; say "\nPhone numbers:"; say $match.made<phoneNumbers>[$_]<type number> for 0..$match.made<phoneNumbers>.end; This will display the following output: Keys are: (lastName Bank-account phoneNumbers children address age firstName spouse isAlive) Some values: John Smith True New York Phone numbers: (home 212 555-1234) (office 646 555-4567) (mobile 123 456-7890) 13.9 Inheritance and Mutable GrammarsThe capacity for a grammar to inherit from another one opens the door to very rich possibilities in terms of extending the Perl 6 language itself. It is possible, for example in the context of a module or a framework, to “subclass” the standard Perl grammar, i.e., to write a new child grammar that inherits from the standard Perl grammar, but adds a new feature, overloads an operator, or modifies some other syntax element, and to run this program with the same Perl 6 compiler, but with this locally modified grammar. This means that it is actually possible to dynamically extend the language for new needs, often without even changing the compiler or the virtual machine. These are however advanced topics that are more geared towards language gurus than to beginners. So we only mention these exciting possibilities with the hope of whetting your appetite and pushing you to study these further, but will not dwell further on them in this book. 13.10 DebuggingWriting grammars is a lot of fun, but it can also be difficult or even tedious when you start. When you started to practice programming with this book, you probably made a lot of small mistakes that initially prevented your programs from compiling and running, or from doing what you expected. With practice, however, you hopefully gradually made fewer errors and spent less time chasing bugs. When you begin to learn grammars (and to a lesser extent regexes), you may feel like you are starting again at square one. Even very good programmers often make silly mistakes when they start writing grammars. It is a different programming paradigm, and it requires a new learning phase. In this case, small is beautiful. Start with small regexes and small rules, and with small test input. Test individual regexes or rules under the REPL, and add them to your code only when you’re confident that they do what you want. Write test cases at the same time as your code (or actually even before you write the code), and make sure that you pass all the relevant tests before moving on. And add new tests when you add new rules. One standard debugging technique is to add print statements to the code in order to figure out various information about the state of the program (such as the value of variables, the flow of execution of the program, etc.). You can also do that with regexes and grammars. Let’s take the example of the very simple grammar for matching dates of Section ?? and let’s suppose that you have written this grammar: grammar My-Date { token TOP { \s* <year> '-' <month> '-' <day> } token year { \d ** 4 } token month { 1 <[0..2]> || 0 <[1..9]> } token day { (\d ** 2) <?{1 <= $0 <= 31 }> } } my $string = " 2016-07-31 "; say so My-Date.parse($string); # -> False This test fails. At this point, it has already become a bit difficult to figure out why the grammar fails (unless we have thoroughly tested each of the three tokens before building the grammar, but let’s assume for the sake of this discussion that we haven’t). Let’s try not to randomly change things here or there and see if it works better; we would be likely to spend hours doing that and probably not get anywhere. Let’s be more methodical. Let’s first test the building-block tokens, year, month, and day. We’ve seen before that the parse method looks by default for the TOP rule in the grammar, but you can specify another rule, and that’s what we need here. We can test these tokens individually: say so My-Date.parse("2016", :rule<year>); # -> True say so My-Date.parse("07", :rule<month>); # -> True say so My-Date.parse("31", :rule<day>); # -> True These three tokens seem to work fine. At this point, you might be guessing where the problem is, but let’s assume you don’t. We need to debug the “TOP” token. We can just use the common debugging method of printing where we are at various stages of the program. You can insert a print statement block in a named rule. Let’s try to change the TOP token to this: token TOP { \s* <year> { say "matched year"; } '-' <month> { say "matched month";} '-' <day> { say "matched day"; } } This displays the following output: matched year matched month matched day So, even the “TOP” token seems to work almost to the end. At this point, we should be able to figure out that we lack final spacing in the “TOP” token. So either we should add an optional spacing at the end of the token: token TOP { \s* <year> '-' <month> '-' <day> \s*} or change it to a rule: rule TOP { \s* <year> '-' <month> '-' <day> } or it was possibly the test string that was wrong (because it wasn’t supposed to have spaces) and needed to be fixed. If you have an actions class, you can also add print statements to the actions methods. Remember also that the Perl debugger (see Section ??) can be very helpful. We briefly showed in Subsection ?? (p. ??) how to go step by step through a regex match. Most of what has been described there also applies to debugging grammars.
Finally, there is a very good module, use Grammar::Tracer; to your program, then any grammar within the lexical scope will print out debugging information about the rules which tried to match, those which succeeded and those which failed. You can also use the following: use Grammar::Debugger; to do the same thing step by step. Just type “h” at the prompt for a list of available commands. 13.11 Glossary
13.12 Exercise: A Grammar for an Arithmetic CalculatorThe arithmetic calculator presented in Section ?? above is very simplistic. In particular, it can parse simple arithmetic expressions composed of two operands separated by one infix operator, but not much more than that. We would like to be able to parse more complicated arithmetic expressions. The calculator should also be able to handle:
These are a few examples of expressions the calculator should parse and compute correctly: 3 + 4 + 5; 3 + 4 * 5; # result should be 23 (3 + 4) * 5; # result should be 35 Exercise 2
Your mission,
[Dan|Jim] , should you choose to accept it,
is to write such a grammar. As usual, should you fail, the
Government shall deny any knowledge of your actions class.
There are many possible ways to accomplish this; the solution presented in Section ?? is only one of them. |
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|