Chapter 6. Fundamental Operations

So we've seen that the job of the parsing stage is to reduce a program to a tree structure, and each node of the tree represents an operation. In this chapter, we'll look more closely at those operations: what they are, how they're coded, and how they fit together.

6.1. The basic op

Just AVs and HVs are "extensions" of the basic SV structure, there are a number of different "flavours" of ops, built on a basic OP structure; you can find this structure defined as BASEOP in op.h:

    OP*         op_next;
    OP*         op_sibling;
    OP*         (CPERLscope(*op_ppaddr))(pTHX); 
    PADOFFSET   op_targ;
    OPCODE      op_type;
    U16         op_seq;
    U8          op_flags;
   U8          op_private;
    

Some of these fields are easy to explain, so we'll deal with them now.

The op_next field is a pointer to the next op which needs to be executed. We'll see later, in Section 6.1.3, how the "thread of execution" is derived from the tree.

op_ppaddr is the address of the C function which carries out this particular operation. It's stored here so that our main execution code can simply dereference the function pointer and jump to it, instead of having to perform a lookup.

Each unique operation has a different number; this can be found in the enum in opnames.h:

typedef enum opcode {
    OP_NULL,        /* 0 */
    OP_STUB,        /* 1 */
    OP_SCALAR,      /* 2 */
    OP_PUSHMARK,    /* 3 */
    OP_WANTARRAY,   /* 4 */
    OP_CONST,       /* 5 */
    OP_GVSV,        /* 6 */
    OP_GV,          /* 7 */
    ...
};
The number of the operation to perform is stored in the op_type field. We'll examine some of the more interesting operations in Section 6.1.1.

op_flags is a set of flags generic to all ops; op_private stores flags which are specific to the type of op. For instance, the repeat op which implements the x operator has the flag OPpREPEAT_DOLIST set when it's repeating a list rather than a string. This flag only makes sense for that particular operation, so is stored in op_private. Private flags have the OPp prefix, and public flags begin with OPf.

op_seq is a sequence number allocated by the optimizer. It allows for, for instance, correct scoping of lexical variables by storing the sequence numbers of the beginning and end of scope operations inside the pad.

As for the remaining fields, we'll examine op_sibling in Section 6.1.2 and op_targ in Section 6.4

6.1.1. The different operations

Perl has currently 351 different operations, implementing all the built-in functions and operators, as well as the more structural operations required internally - entering and leaving a scope, compiling regular expressions and so on.

The array PL_op_desc in opcode.h describes each operation: it may be easier to follow the data from which this table is generated, at the end of opcode.pl. We'll take a longer look at that file later on in this chapter.

Many of the operators are familiar from Perl-space, such as concat and splice, but some are used purely internally: for instance, one of the most common, gvsv fetches a scalar variable; enter and leave are block control operators, and so on.

6.1.2. Different "flavours" of op

There are a number of different "flavours" of op structure, related to the arguments of an operator and how it fits together with other ops in the op tree. For instance, scalar is a unary operator, a UNOP. This extends the basic op structure above with a link to the argument:

struct unop {
    BASEOP
    OP *    op_first;
};
Binary operators, such as i_add, (integer addition) have both a first and a last:
struct binop {
    BASEOP
    OP *    op_first;
    OP *    op_last;
};

List operators are more interesting; they too have a first and a last, but they also have some ops in the middle, too. This is where op_sibling above comes in; it connects ops "sibling" ops on the same level in a list. For instance, look at the following code and the graph of its op tree:

open FILE, "foo";
print FILE "hi\n";
close FILE;

The dashed lines represent op_sibling connections. The root operator of every program is the list operator leave, and its children are the statements in the program, separated by nextstate (next statement) operators. open is also a list operator, as is print. The first child of print is pushmark, which puts a mark on the stack (see Section 6.2.1) so that Perl knows how many arguments on the stack belong to print. The rv2gv turns a reference to the filehandle FILE into a GV, so that print can print to it, and the final child is the constant "hi\n".

Some operators hold information about the program; these are COPs, or "code operators". Their definition is in cop.h:

struct cop {
    BASEOP
    char *  cop_label;  /* label for this construct */
#ifdef USE_ITHREADS
    char *  cop_stashpv;    /* package line was compiled in */
    char *  cop_file;   /* file name the following line # is from */
#else
    HV *    cop_stash;  /* package line was compiled in */
    GV *    cop_filegv; /* file the following line # is from */
#endif
    U32     cop_seq;    /* parse sequence number */
    I32     cop_arybase;    /* array base this line was compiled with */
    line_t      cop_line;       /* line # of this command */
    SV *    cop_warnings;   /* lexical warnings bitmask */
    SV *    cop_io;     /* lexical IO defaults */
};
	
COPs are inserted between every statement; they contain the label (for goto, next and so on) of the statement, the file name, package and line number of the statement and lexical hints such as the current value of $[, warnings and IO settings. Note that this doesn't contain the current CV or the padlist - these are kept on a special stack called the "context stack".

The final type of op is the null op: any op with type zero means that a previous op has been optimized away; we'll look at how this is done later in this chapter, but for now, you should skip over the null op when you see it in op trees.

6.1.3. Tying it all together

We've so far seen a little of how the op tree is connected together with op_first, op_last, op_sibling, and so on. Now we'll look at how the tree gets manufactured, as how it gets executed.

6.1.3.1. "Tree" order

After our investigation of the parser in the previous chapter, it should now be straightforward to see how the op tree is created. The parser calls routines in op.c which create the op structures, passing ops further "down" the parse tree as arguments. This threads together a tree as shown in the diagram above. For comparison, here is the what the example in that chapter ($a = $b + $c) really looks like as an op tree:

Again, you can see the places where an op was optimized away and became a null op. This is not so different from the simplified version we gave earlier.

6.1.3.2. Execution Order

The second thread through the op tree, indicated by the dotted line in our diagrams, is the execution order. This is the order in which Perl must actually perform the operations in order to run the program. The main loop of Perl is very, very simple, and you can see it in run.c:

    while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
        PERL_ASYNC_CHECK();
    }
That's it. That's all the Perl interpreter is. PL_op represents the op that's currently being executed. Perl calls the function pointer for that op and expects another op to be returned; this return value is then set to PL_op, which is executed in turn. Since everything apart from conditional operators (for obvious reasons) just return PL_op->op_next, the execution order through a program can be found by chasing the trail of op_next pointers from the start node to the root.

We can trace the execution order in several ways: if Perl is built with debugging, then we can say

perl -Dt -e 'open ...'

Alternatively, and perhaps more simply, the compiler module B::Terse (see Chapter 7) has an option to print the execution order, -exec. For instance, in our "open-print-close" example above, the execution order is:

% perl -MO=Terse,-exec -e 'open FILE, "foo"; ...'
OP (0x8111510) enter
COP (0x81121c8) nextstate
OP (0x8186f30) pushmark
SVOP (0x8186fe0) gv  GV (0x8111bd8) *FILE
SVOP (0x8186f10) const  PV (0x810dd98) "foo"
LISTOP (0x810a170) open [1]
COP (0x81114d0) nextstate
OP (0x81114b0) pushmark
SVOP (0x8118318) gv  GV (0x8111bd8) *FILE
UNOP (0x8111468) rv2gv
SVOP (0x8111448) const  PV (0x8111bfc) "hi\n"
LISTOP (0x8111488) print
COP (0x8111fe0) nextstate
SVOP (0x8111fc0) gv  GV (0x8111bd8) *FILE
UNOP (0x8111fa0) close
LISTOP (0x8111420) leave [1]
This program, just like every other program, starts with the enter and nextstate ops to enter a scope and begin a new statement respectively. Then a mark is placed on the argument stack: marks represent the start of a set of arguments, and a list operator can retrieve all the arguments by pushing values off the stack until it finds a mark. Hence, we're notifying Perl of the beginning of the arguments to the open operator.

The arguments in this case are merely the file handle to be opened and the file name; after operators put these two arguments on the stack, open can be called. This is the end of the first statement.

Next, the arguments to print begin. This is slightly more tricky, because while open can only take a true filehandle, print may take any sort of reference. Hence, gv returns the GV and then this is turned into the appropriate filehandle type by the rv2gv operator. After the filehandle come the arguments to be printed; in this case, a constant ("hi\n"). Now all the arguments have been placed on the stack, print can be called. This is the end of the second statement.

Finally, a filehandle is put on the stack and closed. Note that at this point, the connections between the operators - unary, binary, etc. - are not important; all manipulation of values comes not by looking at the children of the operators but by looking at the stack. The types of op are important for the construction of the tree in "tree order", but the stack and the op_next pointers are the only important things for the execution of the tree in execution order.

How is the execution order determined? The function linklist in op.c takes care of threading the op_next pointers in prefix order. It does so by recursively applying the following rule:

  • If there is a child for the current operator, visit the child first, then its siblings, then the current op.

Hence, the starting operator is always the first child of the root operator, (always enter) the second op to be executed is its sibling, nextstate, and then the children of the next op are visited. Similarly, the root itself (leave) is always the last operator to be executed. Null operators are skipped over during optimization.