Chapter 5. The Lexer and the Parser

In this chapter, we're going to examine how Perl goes about turning a piece of Perl code into an internal representation ready to be executed. The nature of the internal representation, a tree of structures representing operations, will be looked at in the next chapter, but here we'll just concern ourselves with the lexer and parser which work together to "understand" Perl code.

5.1. The Parser

The parser lives in perly.y. This is code in a language called Yacc, which is converted to C using the byacc command.

In fact, Perl needs to do some fixing up on the byacc output to have it deal with dynamic rather than static memory allocation. Hence, if you make any changes to perly.y, just running byacc isn't enough - you need to run the Make target run_byacc, which will do the fixups that Perl requires.

In order to understand this language, we need to understand how grammars work and how parsing works.

5.1.1. BNF and Parsing

Computer programmers define a language by its grammar, which is a set of rules. They usually describe this grammar in a form called "Backhaus-Naur Form" [1] or BNF. BNF tells us how phrases fit together to make sentences. For instance, here's a simple BNF for English - obviously, this isn't going to describe the whole of the English grammar, but it's a start:

	  sentence   : nounphrase verbphrase nounphrase;

	  verbphrase : VERB;

	  nounphrase : NOUN
	  | ADJECTIVE  nounphrase
	  | PRONOMINAL nounphrase
	  | ARTICLE    nounphrase;
	
Here is the prime rule of BNF: you can make the thing on the left of the colon if you see all the things on the right in sequence. So, this grammar tells us that a sentence is made up of a noun phrase, a verb phrase and then a noun phrase. The vertical bar does exactly what it does in regular expressions: you can make a noun phrase if you have a noun, or an adjective plus another noun phrase, or an article plus a noun phrase. Turning the things on the right into the thing on the left is called a reduction. The idea of parsing is to reduce all of the input down to the first thing in the grammar - a sentence.

You'll notice that things which can't be broken down any further are in capitals - there's no rule which tells us how to make a noun, for instance. This is because these are fed to us by the lexer; these are called terminal symbols, and the things which aren't in capitals are called non-terminal symbols. Why? Well, let's see what happens if we try and parse a sentence in this grammar.

The text right at the bottom - "my cat eats fish" - is what we get in from the user. The tokeniser then turns that into a series of tokens - "PRONOMINAL NOUN VERB NOUN". From that, we can start performing some reductions: we have a pronominal, so we're looking for a noun phrase to satisfy the nounphrase : PRONOMINAL nounphrase rule. Can we make a noun phrase? Yes, we can, by reducing the NOUN ("cat") into a nounphrase. Then we can use PRONOMINAL nounphrase to make another nounphrase.

Now we've got a nounphrase and a VERB. We can't do anything further with the nounphrase, so we'll switch to the VERB, and the only thing we can do with that is turn it into a verbphrase. Finally, we can reduce the noun to a nounphrase, leaving us with nounphrase verbphrase nounphrase. Since we can turn this into a sentence, we've parsed the text.

5.1.2. Parse actions and token values

It's important to note that the tree we've constructed above - the "parse tree" - is only a device to help us understand the parsing process. It doesn't actually exist as a data structure anywhere in the parser. This is actually a little inconvenient, because the whole point of parsing a piece of Perl text is to come up with a data structure pretty much like that.

Not a problem. Yacc allows us to extend BNF by adding actions to rules - every time the parser performs a reduction using a rule, it can trigger a piece of C code to be executed. Here's an extract from Perl's grammar in perly.y:

term    :   term ASSIGNOP term
	  { $$ = newASSIGNOP(OPf_STACKED, $1, $2, $3); }
        |   term ADDOP term
	  { $$ = newBINOP($2, 0, scalar($1), scalar($3)); }
	
The pieces of code in the curlies are actions to be performed. Here's the final piece of the puzzle: each symbol carries some additional information around. For instance, in our "cat" example, the first NOUN had the value "cat". You can get at the value of a symbol by a Yacc variable starting with a dollar sign: in the example above, $1 is the value of the first symbol on the right of the colon (term), $2 is the value of the second symbol (either ASSIGNOP or ADDOP depending on which line you're reading) and so on. $$ is the value of the symbol on the left. Hence information is propagated "up" the parse tree by manipulating the information on the right and assigning it to the symbol on the left.

5.1.3. Parsing some Perl

So, let's see what happens if we parse the Perl code $a = $b + $c. We have to assume that $a, $b and $c have already been parsed a little; they'll turn into term symbols. Each of these symbols will have a value, and that will be an "op". An "op" is a data structure representing an operation, and the operation to be represented will be that of retrieving the storage pointed to by the appropriate variable.

Let's start from the right[2] , and deal with $b + $c. The + is turned by the lexer into the terminal symbol ADDOP. Now, just like there can be lots of different nouns that all get tokenised to NOUN, there can be several different ADDOPs - concatenation is classified as an ADDOP, so $b . $c would look just the same to the parser. The difference, of course, is the value of the symbol - this ADDOP will have the value '+'.

Hence, we have term ADDOP term. This means we can perform a reduction, using the second rule in our snippet. When we do that, we have to perform the code in curlies underneath the rule - { $$ = newBINOP($2, 0, scalar($1), scalar($3)); }. newBINOP is a function which creates a new binary "op". The first argument is the type of binary operator, and we feed it the value of the second symbol. This is ADDOP, and as we have just noted, this symbol will have the value '+'. So although '.' and '+' look the same to the parser, they'll eventually be distinguished by the value of their symbol. Back to newBINOP. The next argument is the flags we wish to pass to the op. We don't want anything special, so we pass zero.

Then we have our arguments to the binary operator - obviously, these are the value of the symbol on the left and the value of the symbol on the right of the operator. As we mentioned above, these are both "op"s, to retrieve the values of $b and $c respectively. We assign the new "op" created by newBINOP to be the value of the symbol we're propagating upwards. Hence, we've taken two ops - the ones for $b and $c - plus an addition symbol, and turned them into a new op representing the combined action of fetching the values of $b and $c and then adding them together.

Now we do the same thing with $a = ($b+$c). I've put the right hand side in brackets to show that we've already got something which represents fetching $b and $c and adding them. = is turned into an ASSIGNOP by the tokeniser in the same way as we turned + into an ADDOP. And, in just the same way, there are various different types of assignment operator - ||= and &&= are also passed as ASSIGNOPs. From here, it's easy: we take the term representing $a, plus the ASSIGNOP, plus the term we've just constructed, reduce them all to another term, and perform the action underneath the rule. In the end, we end up with a data structure a little like this:

You can find a hypertext version of the Perl grammar at http://simon-cozens.org/hacks/grammar.pdf

Notes

[1]

Sometimes "Backhaus Normal Form"

[2]

This is slightly disingenous, as parsing is always done from left to right, but this simplification is easier than getting into the details of how Yacc grammars recognise the precendence of operators.