* Why Cache?

PHP is an interpreted language.  A principle difference between compiled languages (like C) and interpreted languages is that when an interpreted language is called upon to execute the source code contained in a file, it must parse and compile it prior to execution.  When PHP executes a file, it compiles the file's source code into a series of instructions, or "opcodes," and then executes those instructions. Because the process of converting source code into executable instructions is costly in time and memory, we can benefit by storing the instructions for later use. Ideally, a file is compiled only when it is accessed for the first time, or when it changes.

APC provides such a caching mechanism by overriding PHP's compile routine, so that a cache is always consulted before compiling a file. If that file's instructions are already stored in the cache, the compilation step is skipped and the stored instructions are used. Otherwise, the compilation of the file proceeds and its instructions are inserted into the cache.  A shared memory implementation is used so that in a system where many interpreters are being run simultaneously (like a number of Apache children), they each have access to the compilation results of the other interpreters.  This is a major difference between this caching implementation and something like the Bware Cache, where each child is responsible for caching its own compile results and it is impossible for that information to be shared with other children or to persist past the death of the interpreter that generated them. 

* Interfacing with the Zend Engine

When APC starts up it replaces the function pointer 'zend_compile_file' with its own compile function, 'apc_compile_file' (defined in "apc_iface.c"). On shutdown, it reverses the change. 

* Include Statements

When the PHP compiler encounters an include statement ('include' or 'include_once') it emits an "include-file" opcode, but does not access the file. Instead, the file is opened and compiled afterward, when PHP executes the include-file opcode. This recursive process continues until all the opcodes in the including file have been executed. Therefore, included files are compiled and executed at run-time, and the PHP compile function 'zend_compile_file' is called for each of them. Because we override this function, our 'apc_compile_file' function is also called for each of them.

* Compiler Results

PHP compiles a file's top-level statements -- those that are outside of its functions and classes -- into a sequence of opcodes and stores them in a variable of type 'zend_op_array'. PHP also compiles functions into sequences of opcodes, but they are stored in a global table that maps the functions' names to their opcodes. Note that a compiled file's opcodes do not contain or reference its functions or their opcodes except by name.

PHP handles classes the same way it does functions, although classes are marginally more complex because they generally have functions defined within them (their member functions, or "methods"). In this way, classes are similar to PHP files themselves.

The PHP compile function 'zend_compile_file' takes a file name as a parameter and returns a 'zend_op_array'. It also inserts the file's functions and classes into the appropriate global tables ('CG(function_table)' and 'CG(class_table)').

Because functions and classes are often inserted into the global tables during the compilation of the file in which they are defined, it is impossible to determine which functions and classes currently in the global tables are defined in a given file by examining that file's opcodes alone, or the opcodes of the functions and class in the global tables.

* Gathering the Compiler Results

Although the sequence of opcodes for the top-level statements in a file is readily available to us (it is returned by 'zend_compile_file'), we cannot easily determine which functions and classes in the global function and class tables were defined by the file we just compiled. To solve this problem we use two auxiliary tables, one each for functions and classes, to find the differences in the global function and class tables after compilation; we call them "accumulator tables".

Before calling 'zend_compile_file', our function accumulator table contains all user-defined functions that are currently in the global function table (at the start of a request, it is empty). After the request, every function in the global function table that is not in our accumulator table belongs to the file we just compiled. Now that we know which functions belong to the file, we add them to the accumulator table and continue. The corresponding process for classes is identical.

When a file is retrieved from the cache instead of compiled, its functions and classes are obviously added to both the global tables and the accumulator tables.

* Storing the Compiler Results

We store all the opcode arrays for a given file and its functions and classes in a cache shared by all the web server processes. These compiler data structures ('zend_op_array', 'zend_function', 'zend_class_entry') are serialized into opaque blocks of bytes and mapped to the file by its file name. When the file is subsequently retrieved from the cache, its data structures are deserialized and rebuilt.

* Wrinkles and Complications

APC has some capabilities that add significant complexity to the simple view of things given above. For example, when a file is cached it is assigned a "time-to-live" (TTL), and when it expires from old age, it must be recompiled from scratch (the TTL can be infinite). Files can also have individual TTLs.

What happens if a file is included more than once during a single request? In general, this is not a problem, because the file will be compiled, serialized, and cached when it is encountered for the first time; the second time, when it is retrieved from the cache and deserialized, its functions and classes are simply redundant.

If the file is deserialized when it is encountered for the first time and compiled and serialized when it is encountered the second time, however, there IS a problem. This can occur if a file expires in the middle of a request in which it is included multiple times. This time, when the file is compiled, its functions are already in the global tables (they were inserted during deserialization), causing us to erroneously believe that the file defines NO functions or classes.

Our solution is to remember which files we have seen -- and compiled or deserialized -- during a particular request and remember exactly which functions and classes belonged to that file. Before we process a file we have seen before, we remove its functions and classes from the global tables and from the accumulator tables. This essentially returns the relevant parts of the compiler to the state they were in before we compiled the file for the first time.  

* Class Inheritance

Another sticky issue is supporting class hierarchies. When PHP compiles a derived class (i.e., a class that extends another class), it indicates that its inheritance relationships must be resolved at run time. Thus, when the derived class declaration is encountered during execution, PHP adds the base class's functions and data members to the child class, i.e., it merges the function and properties tables. Because classes must be declared in descending order of the class hierarchy, base classes are guaranteed to be fully inherited (from their base classes) before any derived classes are declared (assuming the PHP code is correct!).

Unfortunately, we must resolve the class inheritance relationships ourselves, rather than at run time, because of the rare instance in which a file is included twice during a single request, the first time read from the cache, the second time compiled. (See "Wrinkles and Complications" above for more details on this problem.)

Our solution is to remember deferred inheritance relationships as we encounter them and resolve them on demand. When we find a derived class declaration opcode, we find the class's base class and remember this base-derived relationship. Whenever we deserialize a class from the cache, we resolve every deferred inheritance relationship in which it is the base class. Because this is performed in the reverse order of the class hierarchy, that is, bottom-up instead of top-down, we must recursively inherit not only from the base class to all its direct children, but to its grandchildren, great-grandchildren, and so on.

This is clearly not as efficient as the normal operation of PHP, but for most practical class hierarchies, the additional overhead should not be an issue. We welcome new ideas for implementing this.

* Limitations

Blah blah blah
