Tutorial 3: Restructuring the Tree
Now that we have seen in Tutorial 1
how we can traverse the tree, and in Tutorial
2 how we can modify individual nodes in the tree, in this tutorial
we will look at modifying the structure of the tree itself.
The transform that we will be considering in this tutorial is one
that is used in phc itself. The transform is
called Remove_concat_null and can be found in
process_ast/Remove_concat_null.h. The purpose of the
transform is to remove string concatenation with the empty string.
For example,
<?php
$s = "foo" . "";
?>
is translated to
<?php
$s = "foo";
?>
The reason that this transform is implemented in phc is due to how the phc
parser deals with in-string syntax. For example, if you write
$a = "foo $b bar";
the corresponding tree generated by phc
is
$a = "foo " . $b . " bar";
In other words, the variables are pulled out of the string, and
the various components are then concatenated together. However, taken
to its logical conclusion, that means that if you write
$a = "foo $b";
the parser generates
$a = "foo " . $b . "";
Obviously, the second concatenation is unnecessary, and the
Remove_concat_null transform cleans this up. In this
tutorial we will explain how this transform can be written.
Introducing the Tree_transform API
Concatenation is a binary operator, so we are interested in nodes
of type AST_bin_op. If you check the grammar or,
alternatively, ast.h, you will find that
AST_bin_op has three attributes: a left and
a right expression (of type AST_expr) and
the operator itself (Token_op* op). Thus, we are
interested in nodes of type AST_bin_op whose
op equals the single dot (for string concatenation).
Based on the previous two tutorials, we might try something like
this:
class Remove_concat_null : public Tree_visitor
{
public:
void pre_bin_op(AST_bin_op* in)
{
// Find concat operators
if(*in->op->value == ".")
{
// ...
}
}
}
The problem is, what are we going to do inside the
if? Tree visitors can only inspect and modify
in; they cannot restructure the tree. In particular, we
cannot replace *in by a new node. For this purpose, phc offers a separate API, the tree
transformation API. It looks very similar to the tree visitor
API, but there are two important differences. First, the
pre and post methods can modify the
structure of the tree by returning new nodes. Second, there are no
“generic” methods in the tree transform API. So, it is not
possible to define a transformation that would replace all statements
by something else. (It is not clear how that would be useful, anyway.)
So, we need to write our transformation using the
Tree_transform API, defined in
generated/Tree_transform.h. Restructuring the class above
yields
class Remove_concat_null : public Tree_transform
{
public:
AST_expr* pre_bin_op(AST_bin_op* in)
{
// Find concat operators
if(*in->op->value == ".")
{
// ...
}
}
}
The differences between the previous version have been
highlighted. We inherit from a different class, and
pre_bin_op now has a return value, which is the node that
will replace *in. If you check the default implementation
of pre_bin_op in generated/Tree_transform.cpp,
you'll find:
AST_expr* Tree_transform::pre_bin_op(AST_bin_op* in)
{
return in;
}
The return in; is very important; as we
mentioned before, the return value of pre_bin_op will
replace *in in the tree. Therefore, if we don't want to
replace *in, or perhaps if we want to replace
*in only if a particular condition holds, we must return
in. This will replace *in by in
itself.
The second thing to note is that the return type of
pre_bin_op is AST_expr instead of
AST_bin_op. This means that we can replace a binary
operator node by another other expression node. The maketea theory explains exactly how
the signatures for the pre and post
methods are derived, but in most cases they are what you'd expect.
The easiest way to check is to simply look them up in
<phc/Tree_transform.h>.
We wanted to get rid of useless concatenation operators. To be
precise, if the binary operator is the concatenation operator, and the
left operand is the empty string, we want to replace the node by the
right operand; similarly, if the right operand is the empty string, we
want to replace the operator by its left operand. Here's the full
transform:
class Remove_concat_null : public Tree_transform
{
public:
AST_expr* post_bin_op(AST_bin_op* in)
{
Token_string* empty = new Token_string(new String(""), new String(""));
// Replace with right operand if left operand is the empty string
if(in->match(new AST_bin_op(empty, WILDCARD, ".")))
return in->right;
// Replace with left operand if right operand is the empty string
if(in->match(new AST_bin_op(WILDCARD, empty, ".")))
return in->left;
return in;
}
}
We already explained what match does in the previous tutorial, but we have not yet
explained the use of wildcards. If you are using a wildcard
(WILDCARD) in a pattern passed to match,
match will not take that subtree into account. Thus,
if(in->match(new AST_bin_op(empty, WILDCARD, ".")))
can be paraphrased as “is in a binary operator
with the empty string as the left operand and "." as the
operator (I don't care about the right operand)?“
Note that the constructor for Token_string has two
arugments: one corresponds to the value of the string, and one
corresponds to the representation of the string in the source (see
also the explanation of the token classes in Tutorial 2). For most strings, both of these
values are the same; however, in some cases they are different. For
example, value might be set to
“/home/joe/myscript.php, while
source_rep is set to __FILE__.
Running Transformations
Recall from the previous two tutorials that visitors are run
with a call to visit:
extern "C" void process_ast(AST_php_script* php_script)
{
SomeVisitor visitor;
php_script->visit(&visitor);
}
Likewise, transformations are run with a call to (you guessed
it :) transform:
extern "C" void process_ast(AST_php_script* php_script)
{
SomeTransform transform;
php_script->transform(&transform);
}
Note however than when invoked like this, the transform should
not replace the top-level node in the AST (the
AST_php_script node itself). A Subtlety
If you don't understand this section right now, don't worry about
it; you might find it useful to read it again after having gained some
experience with the transformation API.
We have implemented the transform as a post-transform
rather than a pre- transform. Why? Suppose we implemented the
transform as a pre-transform. Consider the following PHP expression
(bracketed explicitly for emphasis:)
("" . $a) . ""
The first binary operator we encounter is the second one (get
phc to print the tree if you don't see why.)
So, we apply the transform and replace the operator by its left
operand, which happens to be ("" . $a). We then continue
and transform the children of the that node, because that is
how the tree transform API is defined. But the children of that
node are "" and $a. So, that means that the
other binary operator itself will never be processed!
There are two solutions to this problem. The first is the one we
used above, and use a post-transform instead of a pre-transform. You
should try to reason out why this works, but a rule of thumb is that
unless there is a good reason to use a pre-transform, it's safer to
use the post-transform, because in the post-transform the children of
the node have already been transformed, so that you are looking at the
“final” version of the node.
The second solution is to use a pre-transform, but explicitly
tell phc to transform the new node in
turn. This is the less elegant solution, but sometimes this is the
only solution that will work (see for example the
Token_conversion transform in the phc source tree). To do this, you would replace
return in->right;
by
return in->right->pre_transform(this);
What's Next?
The next tutorial in this series, Tutorial 4, introduces a very important
notion in transforms: the use of state.
|