The tokeniser, in toke.c is one of the most difficult parts of the Perl core to understand; this is primarily because there is no real "roadmap" to explain its operation. In this section, we'll try to show how the tokeniser is put together.
The core of the tokeniser is the intimidatingly long yylex function. This is the function called by the parser, yyparse, when it requests a new token of input.
First, some basics. When a token has been identified, it is placed in PL_tokenbuf. The file handle from which input is being read is PL_rsfp. The current position in the input is stored in the variable PL_bufptr, which is a pointer into the PV of the SV PL_linestr. When scanning for a token, the variable s advances from the start of PL_bufptr towards the end of the buffer (PL_bufend) until it finds a token.
The first thing the parser does is test whether the next thing in the input stream has already been identified as an identifier; when the tokeniser sees '%', '$' and the like as part of the input, it tests to see whether it introduces a variable. If so, it puts the variable name into the token buffer. It then returns the type sigil (%, $, etc.) as a token, and sets a flag (PL_pending_ident) so that the next time yylex is called, it can pull the variable name straight out of the token buffer. Hence, right at the top of yylex, you'll see code which tests PL_pending_ident and deals with the variable name.
Next, if there's no identifier in the token buffer, it checks its tokeniser state. The tokeniser uses the variable PL_lex_state to store state information.
One important state is LEX_KNOWNEXT, which occurs when Perl has had to look ahead one token to identify something. If this happens, it has tokenised not just the next token, but the one after as well. Hence, it sets LEX_KNOWNEXT to say "we've already tokenised this token, simply return it."
The functions which set LEX_KNOWNEXT are force_word, which declares that the next token has to be a word, (for instance, after having seen an arrow in $foo->bar) force_ident, which makes the next token an identifier, (for instance, if it sees a * when not expecting an operator, this must be a glob) force_version, (on seeing a number after use) and the general force_next.
Many of the other states are to do with interpolation of double-quoted strings; we'll look at those in more detail in the next section.
After checking the lexer state, it's time to actually peek at the buffer and see what's waiting; this is the start of the giant switch statement in the middle of yylex, just following the label retry.
One of the first things we check for is character zero - this signifies either the start or the end of the file or the end of the line. If it's the end of the file, the tokeniser returns zero and the game is one; at the beginning of the file, Perl has to process the code for command line switches such as -n and -p. Otherwise, Perl calls filter_gets to get a new line from the file through the source filter system, and calls incline to increase the line number.
The next test is for comments and new lines, which Perl skips over. After that come the tests for individual special characters. For instance, the first test is for minus, which could be unary minus if followed by a number or identifier, or the binary minus operator if Perl is expecting an operator, or the arrow operator if followed by a >, or the start of a filetest operator if followed by an appropriate letter, or a quoting option such as (-foo => "bar" ). Perl tests for each case, and returns the token type using one of the upper-case token macros defined at the beginning of toke.c: OPERATOR, TERM, and so on.
If the next character isn't a symbol that Perl knows about, it's an alphabetic character which might start a keyword: the tokeniser jumps to the label keylookup where it checks for labels and things like CORE::function. It then calls keyword to test whether it is a valid built-in or not - if so, keyword turns it into a special constant (such as KEY_open) which can be fed into the switch statement. If it's not a keyword, Perl has to determine whether it's a bareword, a function call or an indirect object or method call.
The final section of the switch statement deals with the KEY_ constants handed back from keyword, performing any actions necessary for using the builtins. (For instance, given __DATA__, the tokeniser sets up the DATA filehandle.)
"Sublexing" refers to the the fact that inside double-quoted strings and other interpolation contexts (regular expressions, for instance) a different type of tokenisation is needed.
This is typically started after a call to scan_str, which is an exceptionally clever piece of code which extracts a string with balanced delimiters, placing it into the SV PL_lex_stuff. Then sublex_start is called which sets up the data structures used for sublexing and changes the lexer's state to LEX_INTERPPUSH, which is essentially a scoping operator for sublexing.
Why does sublexing need scoping? Well, consider something like "Foo\u\LB\uarBaz". This actually gets tokenized as the moral equivalent of "Foo" . ucfirst(lc("B" . ucfirst("arBaz"))). The push state (which makes a call to sublex_push) quite literally pushes an opening bracket onto the input stream.
This in turn changes the state to LEX_INTERPCONCAT; the concatentation state uses scan_const to pull out constant strings and supplies the concatenation operator between them. If a variable to be interpolated is found, the state is changed to LEX_INTERPSTART: this means that "foo$bar" is changed into "foo".$bar and "foo@bar" is turned into "foo".join($",@bar).
There are times when it is not sure when sublexing of an interpolated variable should end - in these cases, the horrifyingly scary function intuit_more is called to make an educated guess on the likelihood of more interpolation.
Finally, once sublexing is done, the state is set to LEX_INTERPEND which fixes up the closing brackets.