This file contains an overview of the design of the compiler.

See also overall_design.html for an overview of how the different sub-systems (compiler, library, runtime, etc.) fit together.


OUTLINE

The main job of the compiler is to translate Mercury into C, although it can also translate (subsets of) Mercury to some other languages: Mercury bytecode (for a planned bytecode interpreter), MSIL (for the Microsoft .NET platform) and RL (the Aditi Relational Language).

The top-level of the compiler is in the file mercury_compile.m, which is a sub-module of the top_level.m package. The basic design is that compilation is broken into the following stages:

Note that in reality the separation is not quite as simple as that. Although parsing is listed as step 1 and semantic analysis is listed as step 2, the last stage of parsing actually includes some semantic checks. And although optimization is listed as steps 3 and 5, it also occurs in steps 2, 4, and 6. For example, elimination of assignments to dead variables is done in mode analysis; middle-recursion optimization and the use of static constants for ground terms is done in code generation; and a few low-level optimizations are done in llds_out.m as we are spitting out the C code.

In addition, the compiler is actually a multi-targeted compiler with several different back-ends.

The modules in the compiler are structured by being grouped into "packages". A "package" is just a meta-module, i.e. a module that contains other modules as sub-modules. (The sub-modules are almost always stored in separate files, which are named only for their final module name.) We have a package for the top-level, a package for each main pass, and finally there are also some packages for library modules that are used by more than one pass.

Taking all this into account, the structure looks like this:

In addition to the packages mentioned above, there are also packages for the build system: make.m contains the support for the `--make' option, and recompilation.m contains the support for the `--smart-recompilation' option.


DETAILED DESIGN

This section describes the role of each module in the compiler. For more information about the design of a particular module, see the documentation at the start of that module's source code.


The action is co-ordinated from mercury_compile.m or make.m (if `--make' was specified on the command line).

Option handling

Option handling is part of the libs.m package.

The command-line options are defined in the module options.m. mercury_compile.m calls library/getopt.m, passing the predicates defined in options.m as arguments, to parse them. It then invokes handle_options.m to postprocess the option set. The results are stored in the io__state, using the type globals defined in globals.m.

Build system

Support for `--make' is in the make.m package, which contains the following modules:

make.m
Categorizes targets passed on the command line and passes them to the appropriate module to be built.
make.program_target.m
Handles whole program `mmc --make' targets, including executables, libraries and cleanup.
make.module_target.m
Handles targets built by a compilation action associated with a single module, for example making interface files,
make.dependencies.m
Compute dependencies between targets and between modules.
make.module_dep_file.m
Record the dependency information for each module between compilations.
make.util.m
Utility predicates.
options_file.m
Read the options files specified by the `--options-file' option. Also used by mercury_compile.m to collect the value of DEFAULT_MCFLAGS, which contains the auto-configured flags passed to the compiler.
The build process also invokes routines in compile_target_code.m, which is part of the backend_libs.m package (see below).


FRONT END

1. Parsing

The parse_tree.m package

The first part of parsing is in the parse_tree.m package, which contains the modules listed below (except for the library/*.m modules, which are in the standard library). This part produces the parse_tree.m data structure, which is intended to match up as closely as possible with the source code, so that it is suitable for tasks such as pretty-printing.

That's all the modules in the parse_tree.m package.

The hlds.m package

Once the stages listed above are complete, we then convert from the parse_tree data structure to a simplified data structure, which no longer attempts to maintain a one-to-one correspondence with the source code. This simplified data structure is called the High Level Data Structure (HLDS), which is defined in the hlds.m package.

The last stage of parsing is this conversion to HLDS:

The HLDS data structure itself is spread over four modules:

  1. hlds_data.m defines the parts of the HLDS concerned with function symbols, types, insts, modes and determinisms;
  2. hlds_goal.m defines the part of the HLDS concerned with the structure of goals, including the annotations on goals;
  3. hlds_pred.m defines the part of the HLDS concerning predicates and procedures;
  4. hlds_module.m defines the top-level parts of the HLDS, including the type module_info.

The module hlds_out.m contains predicates to dump the HLDS to a file.

The hlds.m package also contains some utility modules that contain various library routines which are used by other modules that manipulate the HLDS:

hlds_code_util.m
Utility routines for use during HLDS generation.
goal_form.m
Contains predicates for determining whether HLDS goals match various criteria.
goal_util.m
Contains various miscellaneous utility predicates for manipulating HLDS goals, e.g. for renaming variables.
passes_aux.m
Contains code to write progress messages, and higher-order code to traverse all the predicates defined in the current module and do something with each one.
hlds_error_util.m:
Utility routines for printing nicely formatted error messages for symptoms involving HLDS data structures. For symptoms involving only structures defined in prog_data, use parse_tree__error_util.
code_model.m:
Defines a type for classifying determinisms in ways useful to the various backends, and utility predicates on that type.
arg_info.m:
Utility routines that the various backends use to analyze procedures' argument lists and decide on parameter passing conventions.
hhf.m:
Facilities for translating the bodies of predicates to hyperhomogeneous form, for constraint based mode analysis.
inst_graph.m:
Defines the inst_graph data type, which describes the structures of insts for constraint based mode analysis, as well as predicates operating on that type.

2. Semantic analysis and error checking

This is the check_hlds.m package, with support from the mode_robdd.m package for constraint based mode analysis.

Any pass which can report errors or warnings must be part of this stage, so that the compiler does the right thing for options such as `--halt-at-warn' (which turns warnings into errors) and `--error-check-only' (which makes the compiler only compile up to this stage).

implicit quantification
quantification.m (XXX which for some reason is part of the hlds.m package rather than the check_hlds.m package) handles implicit quantification and computes the set of non-local variables for each sub-goal. It also expands away bi-implication (unlike the expansion of implication and universal quantification, this expansion cannot be done until after quantification). This pass is called from the `transform' predicate in make_hlds.m.

checking typeclass instances (check_typeclass.m)
check_typeclass.m both checks that instance declarations satisfy all the appropriate superclass constraints and performs a source-to-source transformation on the methods from the instance declarations. The transformed code is checked for type, mode, uniqueness, purity and determinism correctness by the later passes, which has the effect of checking the correctness of the instance methods themselves (ie. that the instance methods match those expected by the typeclass declaration). During the transformation, pred_ids and proc_ids are assigned to the methods for each instance.

In addition, while checking that the superclasses of a class are satisfied by the instance declaration, a set of constraint_proofs are built up for the superclass constraints. These are used by polymorphism.m when generating the base_typeclass_info for the instance.

type checking

assertions
assertion.m (XXX in the hlds.m package) is the abstract interface to the assertion table. Currently all the compiler does is type check the assertions and record for each predicate that is used in an assertion, which assertion it is used in. The set up of the assertion table occurs in post_typecheck__finish_assertion.

purity analysis
purity.m is responsible for purity checking, as well as defining the purity type and a few public operations on it. It also calls post_typecheck.m to complete the handling of predicate overloading for cases which typecheck.m is unable to handle, and to check for unbound type variables. Elimination of double negation is also done here; that needs to be done after quantification analysis and before mode analysis. Calls to `private_builtin__unsafe_type_cast/2' are converted into `generic_call(unsafe_cast, ...)' goals here.

polymorphism transformation
polymorphism.m handles introduction of type_info arguments for polymorphic predicates and introduction of typeclass_info arguments for typeclass-constrained predicates. This phase needs to come before mode analysis so that mode analysis can properly reorder code involving existential types. (It also needs to come before simplification so that simplify.m's optimization of goals with no output variables doesn't do the wrong thing for goals whose only output is the type_info for an existentially quantified type parameter.)

This phase also converts higher-order predicate terms into lambda expressions, and copies the clauses to the proc_infos in preparation for mode analysis.

The polymorphism.m module also exports some utility routines that are used by other modules. These include some routines for generating code to create type_infos, which are used by simplify.m and magic.m when those modules introduce new calls to polymorphic procedures.

When it has finished, polymorphism.m calls clause_to_proc.m to make duplicate copies of the clauses for each different mode of a predicate; all later stages work on procedures, not predicates.

mode analysis

constraint based mode analysis
This is an experimental alternative to the usual mode analysis algorithm. It works by building a system of boolean constraints about where (parts of) variables can be bound, and then solving those constraints.

indexing and determinism analysis

checking of unique modes (unique_modes.m)
unique_modes.m checks that non-backtrackable unique modes were not used in a context which might require backtracking. Note that what unique_modes.m does is quite similar to what modes.m does, and unique_modes calls lots of predicates defined in modes.m to do it.

stratification checking
The module stratify.m implements the `--warn-non-stratification' warning, which is an optional warning that checks for loops through negation.

simplification (simplify.m)
simplify.m finds and exploits opportunities for simplifying the internal form of the program, both to optimize the code and to massage the code into a form the code generator will accept. It also warns the programmer about any constructs that are so simple that they should not have been included in the program in the first place. (That's why this pass needs to be part of semantic analysis: because it can report warnings.) simplify.m converts complicated unifications into procedure calls. simplify.m calls common.m which looks for (a) construction unifications that construct a term that is the same as one that already exists, or (b) repeated calls to a predicate with the same inputs, and replaces them with assignment unifications. simplify.m also attempts to partially evaluate calls to builtin procedures if the inputs are all constants (this is const_prop.m in the transform_hlds.m package).

3. High-level transformations

This is the transform_hlds.m package.

The first pass of this stage does tabling transformations (table_gen.m). This involves the insertion of several calls to tabling predicates defined in mercury_builtin.m and the addition of some scaffolding structure. Note that this pass can change the evaluation methods of some procedures to eval_table_io, so it should come before any passes that require definitive evaluation methods (e.g. inlining).

The next pass of this stage is a code simplification, namely removal of lambda expressions (lambda.m):

(Is there any good reason why lambda.m comes after table_gen.m?)

Expansion of equivalence types (equiv_type_hlds.m)

Exception analysis. (exception_analysis.m)

The next pass is termination analysis. The various modules involved are:

Most of the remaining HLDS-to-HLDS transformations are optimizations:

The module transform.m contains stuff that is supposed to be useful for high-level optimizations (but which is not yet used).

The last two HLDS-to-HLDS transformations implement term size profiling (size_prof.m) and deep profiling (deep_profiling.m, in the ll_backend.m package). Both passes insert into procedure bodies, among other things, calls to procedures (some of which are impure) that record profiling information.


a. LLDS BACK-END

This is the ll_backend.m package.

3a. LLDS-specific HLDS -> HLDS transformations

Before LLDS code generation, there are a few more passes which annotate the HLDS with information used for LLDS code generation, or perform LLDS-specific transformations on the HLDS:
reducing the number of variables that have to be saved across procedure calls (saved_vars.m)
We do this by putting the code that generates the value of a variable just before the use of that variable, duplicating the variable and the code that produces it if necessary, provided the cost of doing so is smaller than the cost of saving and restoring the variable would be.
transforming procedure definitions to reduce the number of variables that need their own stack slots (stack_opt.m)
The main algorithm in stack_opt.m figures out when variable A can be reached from a cell pointed to by variable B, so that storing variable B on the stack obviates the need to store variable A on the stack as well. This algorithm relies on an implementation of the maximal matching algorithm in matching.m.
migration of builtins following branched structures (follow_code.m)
This transformation the results of follow_vars.m (see below)
simplification again (simplify.m, in the check_hlds.m package)
We run this pass a second time in case the intervening transformations have created new opportunities for simplification. It needs to be run immediately before code generation, because it enforces some invariants tha the LLDS code generator relies on.
annotation of goals with liveness information (liveness.m)
This records the birth and death of each variable in the HLDS goal_info.
allocation of stack slots
This is done by stack_alloc.m, with the assistance of the following modules:
allocating the follow vars (follow_vars.m)
Traverses backwards over the HLDS, annotating some goals with information about what locations variables will be needed in next. This allows us to generate more efficient code by putting variables in the right spot directly. This module is not called from mercury_compile.m; it is called from store_alloc.m.
allocating the store map (store_alloc.m)
Annotates each branched goal with variable location information so that we can generate correct code by putting variables in the same spot at the end of each branch.
computing goal paths (goal_path.m in the check_hlds.m package)
The goal path of a goal defines its position in the procedure body. This transformation attaches its goal path to every goal, for use by the debugger.

4a. Code generation.

code generation
Code generation converts HLDS into LLDS. For the LLDS back-end, this is also the point at which we insert code to handle debugging and trailing, and to do heap reclamation on failure. The main code generation module is code_gen.m. It handles conjunctions and negations, but calls sub-modules to do most of the other work:

code_gen.m also calls middle_rec.m to do middle recursion optimization, which is implemented during code generation.

The code generation modules make use of

code_info.m
The main data structure for the code generator.
var_locn.m
This defines the var_locn type, which is a sub-component of the code_info data structure; it keeps track of the values and locations of variables. It implements eager code generation.
exprn_aux.m
Various utility predicates.
code_util.m
Some miscellaneous preds used for code generation.
code_aux.m
Some miscellaneous preds which, unlike those in code_util, use code_info.
continuation_info.m
For accurate garbage collection, collects information about each live value after calls, and saves information about procedures.
trace.m
Inserts calls to the runtime debugger.
trace_params.m (in the libs.m package, since it is considered part of option handling)
Holds the parameter settings controlling the handling of execution tracing.
code generation for `pragma export' declarations (export.m)
This is handled seperately from the other parts of code generation. mercury_compile.m calls the procedures `export__produce_header_file' and `export__get_pragma_exported_procs' to produce C code fragments which declare/define the C functions which are the interface stubs for procedures exported to C.
generation of constants for RTTI data structures
This could also be considered a part of code generation, but for the LLDS back-end this is currently done as part of the output phase (see below).

The result of code generation is the Low Level Data Structure (llds.m), which may also contains some data structures whose types are defined in rtti.m. The code for each procedure is generated as a tree of code fragments which is then flattened (tree.m).

5a. Low-level optimization (LLDS).

The various LLDS-to-LLDS optimizations are invoked from optimize.m. They are:

The module opt_debug.m contains utility routines used for debugging these LLDS-to-LLDS optimizations.

Several of these optimizations (frameopt and use_local_vars) also use livemap.m, a module that finds the set of locations live at each label.

Use_local_vars numbering also introduces references to temporary variables in extended basic blocks in the LLDS representation of the C code. The transformation to insert the block scopes and declare the temporary variables is performed by wrap_blocks.m.

Depending on which optimization flags are enabled, optimize.m may invoke many of these passes multiple times.

Some of the low-level optimization passes use basic_block.m, which defines predicates for converting sequences of instructions to basic block format and back, as well as opt_util.m, which contains miscellaneous predicates for LLDS-to-LLDS optimization.

6a. Output C code


b. MLDS BACK-END

This is the ml_backend.m package.

The original LLDS code generator generates very low-level code, since the LLDS was designed to map easily to RISC architectures. We have developed a new back-end that generates much higher-level code, suitable for generating Java, high-level C, etc. This back-end uses the Medium Level Data Structure (mlds.m) as its intermediate representation.

3b. pre-passes to annotate/transform the HLDS

Before code generation there is a pass which annotates the HLDS with information used for code generation:

For the MLDS back-end, we've tried to keep the code generator simple. So we prefer to do things as HLDS to HLDS transformations where possible, rather than complicating the HLDS to MLDS code generator. Thus we have a pass which transforms the HLDS to handle trailing:

4b. MLDS code generation

5b. MLDS transformations

6b. MLDS output

There are currently four backends that generate code from MLDS: one generates C/C++ code, one generates assembler (by interfacing with the GCC back-end), one generates Microsoft's Intermediate Language (MSIL or IL), and one generates Java.

The MLDS->asm backend is logically part of the MLDS back-ends, but it is in a module of its own (mlds_to_gcc.m), rather than being part of the ml_backend package, so that we can distribute a version of the Mercury compiler which does not include it. There is a wrapper module called maybe_mlds_to_gcc.m which is generated at configuration time so that mlds_to_gcc.m will be linked in iff the GCC back-end is available.

The MLDS->IL backend is broken into several submodules.

After IL assembler has been emitted, ILASM in invoked to turn the .il file into a .dll or .exe.

The MLDS->Java backend is broken into two submodules.

After the Java code has been emitted, a Java compiler (normally javac) is invoked to turn the .java file into a .class file containing Java bytecodes.

c. Aditi-RL BACK-END

This is the aditi_backend.m package.

3c. Aditi-specific HLDS -> HLDS transformations

The Aditi back-end first performs some HLDS-to-HLDS transformations that are specific to the Aditi back-end:

4c. Aditi-RL generation

5c. Aditi-RL optimization

6c. Output Aditi-RL code


d. BYTECODE BACK-END

This is the bytecode_backend.m package.

The Mercury compiler can translate Mercury programs into bytecode for interpretation by a bytecode interpreter. The intent of this is to achieve faster turn-around time during development. However, the bytecode interpreter has not yet been written.


SMART RECOMPILATION

This is the recompilation.m package.

The Mercury compiler can record program dependency information to avoid unnecessary recompilations when an imported module's interface changes in a way which does not invalidate previously compiled code.


MISCELLANEOUS

The modules special_pred.m (in the hlds.m package) and unify_proc.m (in the check_hlds.m package) contain stuff for handling the special compiler-generated predicates which are generated for each type: unify/2, compare/3, and index/1 (used in the implementation of compare/3).

This module is part of the transform_hlds.m package.

dependency_graph.m:
This contains predicates to compute the call graph for a module, and to print it out to a file. (The call graph file is used by the profiler.) The call graph may eventually also be used by det_analysis.m, inlining.m, and other parts of the compiler which could benefit from traversing the predicates in a module in a bottom-up or top-down fashion with respect to the call graph.

The following modules are part of the backend_libs.m package.

builtin_ops:
This module defines the types unary_op and binary_op which are used by several of the different back-ends: bytecode.m, llds.m, and mlds.m.
c_util:
This module defines utility routines useful for generating C code. It is used by both llds_out.m and mlds_to_c.m.
name_mangle:
This module defines utility routines useful for mangling names to forms acceptable as identifiers in target languages.
compile_target_code.m
Invoke C, C#, IL, Java, etc. compilers and linkers to compile and link the generated code.

The following modules are part of the libs.m package.

process_util.m:
Predicates to deal with process creation and signal handling. This module is mainly used by make.m and its sub-modules.
timestamp.m
Contains an ADT representing timestamps used by smart recompilation and `mmc --make'.
graph_color.m
Graph colouring.
This is used by the LLDS back-end for register allocation
tree.m
A simple tree data type.
Used by the LLDS, RL, and IL back-ends for collecting together the different fragments of the generated code.

CURRENTLY UNDOCUMENTED

CURRENTLY USELESS

atsort.m (in the libs.m package)
Approximate topological sort. This was once used for traversing the call graph, but nowadays we use relation__atsort from library/relation.m.
lco.m (in the transform_hlds.m package):
This finds predicates whose implementations would benefit from last call optimization modulo constructor application. It does not apply the optimization and will not until the mode system is capable of expressing definite aliasing.

Last update was $Date: 2005/01/22 06:10:54 $ by $Author: juliensf $@cs.mu.oz.au.