Home | Downloads | Documentation | Plugins | Spinoff Projects | Mailing List

Representing PHP

Most PHP constructs can immediately be represented in terms of the phc abstract grammar. There are a few constructs that present some difficulties. This document describes how these difficulties are resolved, and it explains some of the more difficult rules in the grammar.

Top Level Grammar Structure

The major difference between our abstract grammar for PHP and the “official” grammar (distributed in source code format with the PHP distribution) is the top-level structure. Stripped-down, the top-level of the PHP grammar looks something like
php_script ::= statement*

statement ::= class_def | method | if | while | ... other statements ...

method ::= statement*

class_def ::= member* 
member ::= method | attribute
Compare this to the top-level grammar structure that we have adopted:
php_script ::= class_def+

class_def ::= member*
member ::= method | attribute

method ::= statement*

statement ::= if | while | ... other statements ...

(This shows essentials only; see the grammar for the details).

This mismatch has two consequences. The first is that PHP allows scripts have methods that do not belong to any class, and statements that do not belong to any method. phc introduces a special class called %MAIN% for this purpose. All functions defined outside the scope of any class get added as a static method to %MAIN%, and all statements defined outside the scope of any method get added to a special method %run% (in %MAIN%). Thus, the following simple PHP script

<?php
   function hello()
   {
      echo "Hello world!";
   }

   hello();
?>

gets represented as

<?php
   class %MAIN%
   {
      static function hello()
      {
         %STDLIB%::echo("Hello world!");
      }
   
      static function %run%()
      {
         %MAIN%::hello();
      }
   }
?>

The second consequence is that PHP allows scripts to have function definitions inside other function definitions (or inside if-statements, while-loops, etc.). This is not correctly supported by phc; see limitations.

Method targets

Recall the grammar rules for method invocations:

method_invocation ::= target method_name actual_parameter* ;
method_name ::= METHOD_NAME | reflection ;

As explained above, phc thinks of a PHP script as consisting of a set of classes. That means that a function call must either be invoked on an object, or it must be a static method in some class. The grammar rule for target is

target ::= expr | CLASS_NAME ;

So, a target is either an expression (for example, in $x->foo()), or a class name (in CLASS::foo()).

When the user does not explicitly specify a target (for example, the call to hello() in the example above), phc will automatically insert a target. If the method that gets invoked is defined in %MAIN% (i.e., the user provided an implementation), the target will be set to %MAIN% (for example, the call to hello()). Otherwise, the target is set to %STDLIB% (for example, the call to echo). Like %MAIN%, %STDLIB% is a special class that collects all methods defined in the PHP standard library. (Incidentally, if PHP6 implements namespaces, namespaces will probably be represented similarly.)

Variables

The grammar rule for variables reads
variable ::= target? variable_name array_indices:(expr?)* string_index:expr?
variable_name ::= VARIABLE_NAME | reflection 

This is probably one of the more difficult rules in the grammar, so it is worth explaining in a bit more detail. The following table describe each element of the first rule in detail.

target? Just like function calls, variables can have a target, and just as for function calls, this target can be an expression (for an object, e.g., $x->y) or a class name (for a static class attribute, e.g. FOO::$y). Unlike function calls however, in variables the target is optional (indicated by the question mark). If no target is specified, the variable refers to a local variable in a method (see Global Variables for information on how we deal with the global statement).
variable_name Again, as for function calls, the name of the variable may be a literal VARIABLE_NAME ($x), or be given by an expression (which is wrapped up in an AST_reflection node). The latter possibility is referred to as “variable variables” in the PHP manual. For example, $$x is the variable whose name is currently stored in (another) variable called $x.
array_indices:(expr?)* A variable may have one or more array indices, for example $x[3][5]. The strange construct (expr?)* means: a list of (*) optional (?) expressions. For example, $x[4][] is a list of two expressions, but the second expression is not given. In PHP, this means “use the next available index”.
string_index:expr? Finally, a variable may contain one string index ($x{5}) that accesses an individual character from a string.

We illustrate the various possibilities using diagrams:

  • The simple case: $x tree
    Note that the name of the variable is x, not $x
  • Using a string index: $x{3} tree
  • Using array indices: $x[1][][2] tree
    (Note that the empty array index means “next available” in PHP)
  • Class constants: X::$y tree
    Note that here also the variable name is y, not $y. The fact that you must write $x->y but X::$y in PHP disappears in the abstract syntax.
  • Variable variables: $$x
    tree
    Note how the name of the variable (second component) is now given by another variable.
  • Object attributes: $x->y
    tree
    Note that the target is now given by a variable.
  • Variable object attributes: $x->$y
    tree
    Both the target and the variable name are given by (other) variables.

Comments

A number of nodes in the AST are dedicated “commented nodes”. Their corresponding C++ classes inherit from AST_commented_node, which introduces a List<String*>* attribute called comments. The commented nodes are class members (AST_member), statements (AST_statement), interface and class definitions (AST_interface_def, AST_class_def), switch cases (AST_switch_case) and catches (AST_catch).

When the parser encounters a comment in the input, it attaches it either to the previous node in the AST, or to the next, according to a variable attach_to_previous. This variable is set as follows:

  • It is reset to false at the start of each line
  • It is set to true after seeing a semicolon, or either of the keywords class or function

Thus, in

foo();
// Comment
bar();

the comment gets attached to bar(); (to be precise, to the corresponding AST_eval_expr node; the function call itself is an expression and phc does not associate comments with expressions), but in

foo(); // Comment
bar();

the comment gets attached to foo(); instead. The same applies to multiple comments:

foo(); /* A */ /* B */
// C
// D
bar();

In this snippet, A and B get attached to foo();, but C and D get attached to bar();. Also, in the following snippet,

// Comment
echo /* one */ 1 + /* two */ 2;

all comments get attached to the same node. This should work most of the time, if not all the time. In particular, it should never loose any comments. If something goes wrong with comments, please send us a sample program that shows where it goes wrong. Note that whitespace in multi-line comments gets dealt with in a less than satisfactory way; see limitations for details for details.

String parsing

Double quoted strings and those written using the HEREDOC syntax are treated specially by PHP: it parses variables used inside these strings and automatically expands them with their value. phc handles both the simple and complex syntax defined by PHP for variables in strings. We transform a string like

"Total cost is: $total (includes shipping of $shipping)"

into:

"Total cost is: " . $total . " (includes shipping of " . $shipping . ")"

which is represented in the phc abstract syntax tree by a number of strings and expressions concatenated together. Thus, as a programmer you don't need to do anything special to process variables inside strings. Any code you write for processing variables will also appropriately handle variables inside strings.

Currently, the unparser will not output strings of the first form, but will always output them in the second form (using concatenation). Future releases of phc may remedy this (alternatively, dedicated more advanced pretty-printing tools for PHP could be built on top of phc; see Spinoffs).

Global Variables

Global variable declarations are not explicitly recorded in the phc AST. Instead the local variable declared global is assigned a reference to the appropriate global variable, which will be a static class attribute of %MAIN% (see above description of the Top Level Grammar Structure).

For example, the following code

<?php
   $x = 100;

   function foobar()
   {
      global $x;
      $x = 200;
   }
   
   foobar();
?>

is represented internally as

<?php
   class %MAIN%
   {
      static $x = 100;
      
      static function foobar()
      {
         $x =& %MAIN%::$x;
         $x = 200;
      }
   
      static function %run%()
      {
         %MAIN%::foobar();
      }
   }
?>

Obviously, the phc unparser will output code using global declarations.

Note that local variables in %run% are really global variables; for that reason, any “local” variable in %run% get assigned a target of %MAIN% (if no target was specified in the program).

elseif

The abstract grammar does not have a construct for elseif. The following PHP code
<?php
   if($x)
      c1();
   elseif($y)
      c2();
   else
      c3();
?>
gets interpreted as
<?php
   if($x)
      c1();
   else
   {
      if($y)
         c2();
      else
         c3();
   }
?>
The higher the number of elseifs, the greater the level of nesting. This transformation is “hidden” by the unparser.

Miscellaneous Other Changes

  • If echo has multiple (comma separated) arguments, they get translated into multiple function calls (echo a, b; becomes echo a; echo b;). Fragments of inline HTML also become arguments to a function call to echo.
  • The keywords use, require, require_once, include, include_once, isset and empty all get translated into a function call to a function with the same name as the keyword
  • exit also becomes a call to the function exit; exit; and exit(); are interpreted as exit(0)
  • Class attribute declarations can only declare a single attribute per declaration in the abstract syntax; thus, var x, y; becomes var x; var y;. A similar comment applies to static_var
  • We do not support the += style of operators; a += 2; gets translated into a = a + 2;. It should be possible to reverse this translation in the unparser, but this is not currently implemented.

Comparison to the PHP grammar

Finally, the phc grammar is much simpler than the official grammar, and as a consequence more general. The class of programs that are valid according to the abstract grammar is larger than the class of programs actually accepted by the PHP parser. In other words, it is possible to represent a program in the abstract syntax that does not have a valid PHP equivalent. The advantage of our grammar is that is much, much easier to work with.

To compare, consider the tree for

$g->greet("TACS");

Using the phc abstract syntax, this looks like

syntax tree (phc)

However, in the official PHP grammar, the tree would look like

syntax tree (PHP)

Not only is the number of concepts used in the tree much larger (base_variable_with_function_calls, reference_variable, variable_property, etc. etc.), the concepts used in the phc tree map directly to constructs in the PHP language; that does not hold true for the PHP tree. Moreover, the fact that this expression is a method invocation (function call) is immediately obvious from the root of the expression in the phc tree; the root of the PHP tree says that the expression is a variable, and only deeper down the tree does it become apparent that the expression is in fact a function call.

$LastChangedDate: 2006-09-08 12:24:58 +0100 (Fri, 08 Sep 2006) $. Contents © the authors.