commit 8221bebb559abd3137f85d4cbe75fdcdf754b6dd
parent 7fce7b06b453082e9680326f9604c3e39e888086
Author: Georges Dupéron <georges.duperon@gmail.com>
Date: Wed, 7 Jun 2017 13:19:33 +0200
Copied over some things from refs.tex, wrote more.
Diffstat:
4 files changed, 444 insertions(+), 33 deletions(-)
diff --git a/scribblings/abbreviations.rkt b/scribblings/abbreviations.rkt
@@ -1,5 +1,5 @@
#lang at-exp racket
-(provide typedracket Typedracket csharp CAML CLOS NIT CPP)
+(provide typedracket Typedracket csharp CAML CLOS NIT CPP DeBruijn)
(require scribble/base
scribble/core
@@ -25,3 +25,4 @@
(define CLOS "CLOS")
(define NIT "NIT")
(define CPP "C++")
+(define DeBruijn "De Bruijn")
diff --git a/scribblings/introduction.scrbl b/scribblings/introduction.scrbl
@@ -26,7 +26,22 @@
smalltalk-programmer-efficiency-cycle}}.
Given their broad role, the complexity of the transformations involved, and
- the stringent requirements, writing compilers is a difficult task.
+ the stringent requirements, writing compilers is a difficult task. Multiple
+ pitfalls await the compiler engineer, which we will discuss in more detail
+ below. This thesis aims to improve the compiler-writing toolkit currently
+ available, in order to help compiler developers produce compilers which are
+ closer to correctness, and easier to maintain.
+
+ @require[scribble/core scribble/html-properties scribble/latex-properties]
+ @elem[#:style (style "hrStyle"
+ (list (alt-tag "hr")
+ (css-addition
+ #".hrStyle { margin-bottom: 1em; }")
+ (tex-addition
+ (string->bytes/utf-8 #<<EOTEX
+\def\hrStyle#1{\noindent{\centerline{\rule[0.5ex]{0.5\linewidth}{0.5pt}}}}
+EOTEX
+ ))))]{}
The overall structure of a compiler will usually include a lexer and parser,
which turn the the program's source into an in-memory representation. This
@@ -40,41 +55,236 @@
performed separately. Finally, code in the target language or for the target
architecture is generated.
- Some pitfalls await the compiler-writer: it is easy to reuse excessively a
- single intermediate representation; and there is a high risk associated with
- the writing of large, monolithic passes, which are hard to test, debug, and
- extend. We will discuss these pitfalls in more detail in the following
- paragraphs. Both issues are prone to manifestations of some form or another of
+ We identify three pitfalls which await the compiler-writer:
+
+ @itemlist[
+ @item{It is easy to reuse excessively a single @usetech{intermediate
+ representation}, instead of properly distinguishing the features of the input
+ and output of each pass;}
+ @item{There is a high risk
+ associated with the definition of large, monolithic passes, which are hard to
+ test, debug, and extend;}
+ @item{The fundamental structure of the program being compiled is often a
+ graph, but compilers often work on an Abstract Syntax Tree, which requires
+ explicit handling of the backward and transversal arcs; This is a source of
+ bugs which could easily be avoided by using a higher-level abstraction
+ specifically aiming to represent a graph.}]
+
+ The two first issues are prone to manifestations of some form or another of
the ``god object'' anti-pattern@note{The ``god object'' anti-pattern describes
object-oriented classes which @emph{do} too much or @emph{know} too much. The
size of these classes tends to grow out of control, and there is usually a
tight coupling between the methods of the object, which in turn means that
performing small changes may require understanding the interactions between
random parts of a very large file, in order to avoid breaking existing
- functionality.}.
-
-
- The static analysis, optimisation and code generation phases could in
- principle work on the same intermediate representation. Several issues arise
- from this situation, however. First, new information gained by the static
- analysis may be added to the existing representation via mutation, or the
- optimiser could directly alter the @tech{IR}. This means that the @tech{IR}
- will initially contain holes (e.g. represented by @racketid[null] values),
- which will get filled in gradually. Manipulating these parts is then extremely
- risky, as it is easy to accidentally attempt to retrieve a value before it was
- actually computed. Using the same @tech{IR} throughout the compiler also makes
- it difficult for later passes to assume that some constructions have been
- eliminated by previous simplification passes. One has to rely on the order of
- execution of the passes in order to know what the data structure contains,
- instead of having this information indicated by the @tech{IR}'s type.
+ functionality.}. The last issue is merely caused by the choice of an
+ abstraction which does not accurately represent the domain. We will discuss
+ each of these ailments in more detail in the following sections, and detail
+ the undesirable symptoms associated with them.
+
+ @asection{
+ @atitle{Large monolithic passes}
+
+ Large, monolithic passes, which perform many transformations in parallel have
+ the advantage of possibly being faster than several smaller passes chained one
+ after another. Furthermore, as one begins writing a compiler, it is tempting
+ to incrementally extend an initial pass to perform more work, rather than
+ starting all over again with a new @usetech{intermediate representation}, and
+ a new scaffolding to support its traversal.
+
+ However, the drawback is that large compiler passes are harder to test (as
+ there are many more combinations of paths through the compiler's code to
+ test), harder to debug (as many unrelated concerns interact to some extent
+ with each other), and harder to extend (for example, adding a new special form
+ to the language will necessitate changes to several transformations, but if
+ these are mingled in a single pass, the changes may be scattered through it,
+ and interact with a significant amount of surrounding code). This higher
+ maintenance cost also comes with another drawback: formal verification of the
+ compiler will clearly be more difficult when large, tangled chunks of code
+ which handle different semantic aspects are involved.
+
+ @todo{Talk a bit about compcert here (one of the few/ the only formally
+ verified compilers).}
+
+ }
+
+ @asection{
+ @atitle{Overusing a single @usetech{intermediate representation}}
+
+ The static analysis, optimisation and code generation phases could in
+ principle work on the same @usetech{intermediate representation}. Several
+ issues arise from this situation, however.
+
+ In principle, new information gained by the static analysis may be added to
+ the existing representation via mutation, or the optimiser could directly
+ alter the @tech{ IR}. This means that the @tech{IR} will initially contain
+ holes (e.g. represented by @racketid[null] values), which will get filled in
+ gradually. Manipulating these parts is then risky, as it is easy to
+ accidentally attempt to retrieve a value before it was actually computed.
+ Using the same @tech{IR} throughout the compiler also makes it difficult for
+ later passes to assume that some constructions have been eliminated by
+ previous simplification passes, and correctness relies on a fixed order of
+ execution of the passes; parts of the code which access data introduced or
+ modified by other passes are more brittle and may be disrupted when code is
+ refactored (for example, when moving the computation of some information to a
+ later pass).
+
+ This situation becomes worse during the maintenance phase of the compiler's
+ lifecycle: when considering the data manipulated by a small portion of code
+ (in order to fix or improve said code), it is unclear which parts are supposed
+ to be filled in at that point, as well as which parts have been eliminated by
+ prior simplification passes.
+
+ Furthermore, a mutable @tech{IR} hinders parallel execution of compiler
+ passes. Indeed, some compiler passes will perform global transformations or
+ transversal analyses, and such code may be intrinsically difficult to @htodo{
+ parallelise}. @;{is "parallelise" the right word?} Many other passes however
+ are mere local transformations, and can readily be executed on distinct parts
+ of the abstract syntax tree, as long as there is no need to synchronise
+ concurrent accesses or modifications.
+
+ Using immutable intermediate representations (and performing shallow copies
+ when updating data) can help with this second issue. However, there is more to
+ gain if, instead of having many instances of the same type, each intermediate
+ representation is given a distinct, precise type. The presence or absence of
+ computed information can be known from the input and output type of a pass,
+ instead of relying on the order of execution of the passes in order to know
+ what the input data structure may contain.
+
+ }
+
+ @asection{
+ @atitle{Graphs}
+
+ Nontrivial programs are inherently graphs: they may contain mutually
+ recursive functions (which directly refer to each other, and therefore will
+ form a cycle in a representation of the program), circular and (possibly
+ mutually) recursive datatypes may syntactically contain (possibly indirect)
+ references to themselves, and the control flow graph of a function or method
+ may, as its name implies, contain instructions which perform conditional or
+ unconditional backwards branches.
+
+ However, nearly every compiler textbook will mention the use of
+ @tech[#:key "AST"]{ Abstract Syntax Trees} (ASTs) to represent the program.
+ This means that a structure, which intrinsically has the shape of a graph, is
+ encoded as a tree.
+
+ Edges in the graph which may embody backward references can be made explicit
+ in various ways:
+
+ @itemlist[
+ @item{By using a form of unique identifier like a name bearing some semantic
+ value (e.g. the fully qualified name of the type or function that is
+ referred to), an index into an array of nodes (e.g. the offset of an
+ instruction in a function's bytecode may be used to refer to it in the
+ control flow graph), an automatically-generated unique identifier.
+
+ Manipulation of these identifiers introduces a potential for some sorts of
+ bugs: name clashes can occur if the qualification chosen is not sufficient
+ to always distinguish nodes; @htodo{furthermore} compiler passes which
+ duplicate nodes (for example specialising functions) or merge them must be
+ careful to correctly update identifiers.}
+ @item{Alternatively, backward references may be encoded as a form of path
+ from the referring node. @DeBruijn indices can be used in such an encoding,
+ for example.
+
+ Once again, manipulating these references is risky, and @DeBruijn indices
+ are particularly brittle, for example when adding a wrapper around a node
+ (i.e. adding an intermediate node on the path from the root), the @DeBruijn
+ indices used in some of descendents of that node (but not all) must be
+ updated. It is understandably easy to incorrectly implement updates to these
+ indices, and a single off-by-one error can throw the graph's representation
+ into an inconsistent state.}
+ @item{The program's representation could also contain actual pointers
+ (thereby really representing the program as an ``Abstract Syntax Graph''),
+ using mutation to patch nodes after they are initially created.
+
+ @todo{Mutation: verification (two phases for invariants), generally frowned
+ upon, reference some of Roland's and others' work on freezing objects. (as
+ long as it is ensured that no improper manipulation of the objects is done
+ before freezing).}}
+ @item{The compiler could also manipulate lazy data structures, where the
+ actual value of a node in the graph is computed on the fly when that node is
+ accessed.
+
+ @todo{Lazy: harder to debug}}
+ @item{Finally, Higher-Order Abstract Syntax, or HOAS for short, is a
+ technique which encodes variable bindings as anonymous functions in the host
+ language (whose parameters reify bindings at the level of the host
+ language). Variable references are then nothing more than actual uses of the
+ variable at the host language's level. Substitution, a common operation in
+ compilers and interpreters, then becomes a simple matter of calling the
+ anonymous function with the desired substitute. HOAS has the additional
+ advantage that it enforces well-scopedness, as it is impossible to refer to
+ a variable outside of its scope in the host language.
+
+ Parametric HOAS, dubbed PHOAS, also allows encoding the type of the
+ variables in the representation. @todo{Can extra information other than the
+ type be stored?}
+
+ There are a few drawbacks with HOAS and PHOAS:
+
+ The ``target'' of a backward reference must be above all uses in the tree.
+ This might not always be true. For example, pre/post-conditions could, in an
+ early pass in the compiler, be located outside of the normal scope of a
+ function's signature, but still refer to the function's parameters. If the
+ pre/post-condition language allows breaking encapsulation, these could even
+ refer to some temporary variables declared inside the function.
+
+ @;{
+ @; True for HOAS, not sure for PHOAS.
+ @todo{The ``target'' of a backward reference does not initially contain
+ additional data (e.g. the variable name to be used for error messages, its
+ static or concrete type and so on) although extending the encoding to
+ support this should be feasible.}
+ }
+
+ @todo{PHOAS naturally lends itself to the implementation of substitutions,
+ and therefore is well-suited to the writing of interpreters. However, the
+ representation cannot be readily traversed and accesses like would be done
+ with normal structures, and therefore the model could be counter-intuitive
+ for some programmers.}
+
+ @todo{It seems difficult to encode an arbitrary number of variables bound in
+ a single construct (e.g. to represent bound type names across the whole
+ program, or an arbitrary number of mutually-recursive functions declared
+ via @racketid[let … and … in …], with any number of @racketid[and] clauses
+ in the compiled language.}}
+ ]
+
+ Although some of these seem like viable solutions (e.g. explicitly freezing
+ objects), they still involve low-level mechanisms to create the graph;
+ furthermore code traversing the graph needs to be deal with cycles, in order
+ to avoid running into an infinite loop (or infinite recursion).
+
+ anecdotally
+
+
+ updates: all logical pointers to an updated node must be updated too.
+
+ @htodo{Think about ensuring that nodes from two distinct graphs are not mixed
+ in unexpected ways (placing a dummy phantom type somewhere should be enough
+ to prevent it).}
+
+ }
+
+ @asection{
+ @atitle{Expressing the data dependencies of a path via row types}
+ }
+
@;{
The static analysis, optimisation and code generation phases will often work
- on that intermediate representation.
+ on that @usetech{intermediate representation}.
These transformations are often non-trivial and may require aggregating and
analysing data scattered across the program.
- triggering anti-patterns like ``god object''
+ We build upon the achievements of the Nanopass Compiler Framework project,
+ which is presented in more detail in section XYZ. Simply put, Nanopass helps
+ the programmer define a myriad of compiler passes, each doing a very small
+ amount of code (and therefore easy to test and maintain), and each with a
+ different input and output type.
+
}
}
\ No newline at end of file
diff --git a/scribblings/state-of-the-art.scrbl b/scribblings/state-of-the-art.scrbl
@@ -1,6 +1,8 @@
#lang scribble/manual
-@require["util.rkt"]
+@require["util.rkt"
+ scribble/core
+ scribble/latex-properties]
@(use-mathjax)
@title[#:style (with-html5 manual-doc-style)
@@ -409,3 +411,150 @@
}
}
}@;{Algrbraic datatypes for compilers (phc-adt)}
+
+@asection{
+ @atitle[
+ #:style (style #f
+ (list
+ (short-title "Writing compilers using many small passes")))
+ ]{Writing compilers using many small passes (a.k.a following the Nanopass
+ Compiler Framework philosophy)}
+}
+
+@asection{
+ @atitle{Representation and transformation of graphs}
+
+ @todo{There already were a few references in my proposal for JFLA.}
+ @todo{Look for articles about graph rewriting systems.}
+
+ @asection{
+ @atitle{Cycles in intermediate representations of programs}
+ The following sections present the many ways in which cycles within the
+ AST, CFG and other intermediate representations can be represented.
+
+ @asection{
+ @atitle{Mutable data structures}
+
+ @itemlist[
+ @item{Hard to debug}
+ @item{When e.g. using lazy-loading, it is easy to mistakenly load a
+ class or method after the Intermediate Representation was
+ frozen. Furthermore, unless a @tt{.freeze()} method actually
+ enforces this conceptual change from a mutable to an immutable
+ representation, it can be unclear at which point the IR (or parts of
+ it) is guaranteed to be complete and its state frozen. This is another
+ factor making maintenance of such code difficult.}]
+ Quote from@~cite{ramsey_applicative_2006}:
+
+ @quotation{
+ We are using ML to build a compiler that does low-level optimization. To
+ support optimizations in classic imperative style, we built a control-flow
+ graph using mutable pointers and other mutable state in the nodes. This
+ decision proved unfortunate: the mutable flow graph was big and complex,
+ and it led to many bugs. We have replaced it by a smaller, simpler,
+ applicative flow graph based on Huet’s (1997) zipper. The new flow graph
+ is a success; this paper presents its design and shows how it leads to a
+ gratifyingly simple implementation of the dataflow framework developed by
+ Lerner, Grove, and Chambers (2002).}
+ }
+
+ @asection{
+ @atitle{Unique identifiers used as a replacement for pointers}
+
+ @htodo{Check that the multi-reference worked correctly here}
+ Mono uses that@~cite["mono-cecil-website" "mono-cecil-source"], it is very
+ easy to use an identifier which is supposed to reference a missing
+ object, or an object from another version of the AST. It is also very
+ easy to get things wrong when duplicating nodes (e.g. while specializing
+ methods based on their caller), or when merging or removing nodes.
+
+ }
+
+ @asection{
+ @atitle{Explicit use of other common graph representations}
+
+ Adjacency lists, @DeBruijn indices.
+
+ @itemlist[
+ @item{ Error prone when updating the graph (moving nodes around, adding,
+ duplicating or removing nodes).}
+ @item{Needs manual @htodo{caretaking}}]
+
+ }
+
+ @asection{
+ @atitle{Using lazy programming languages}
+
+ @itemlist[
+ @item{Lazy programming is harder to debug.
+ @(linebreak)
+ Quote@~cite{nilsson1993lazy}:
+ @aquote{
+ Traditional debugging techniques are, however, not suited for lazy
+ functional languages since computations generally do not take place in the
+ order one might expect.
+ }
+
+ Quote@~cite{nilsson1993lazy}:
+ @aquote{
+ Within the field of lazy functional programming, the lack of suitable
+ debugging tools has been apparent for quite some time. We feel that
+ traditional debugging techniques (e.g. breakpoints, tracing, variable
+ watching etc.) are not particularly well suited for the class of lazy
+ languages since computations in a program generally do not take place in the
+ order one might expect from reading the source code.
+ }
+
+ Quote@~cite{wadler1998functional}:
+ @aquote{
+ To be usable, a language system must be accompanied by a debugger and a
+ profiler. Just as with interlanguage working, designing such tools is
+ straightforward for strict languages, but trickier for lazy languages.
+ }
+
+ Quote@~cite{wadler1998functional}:
+ @aquote{
+ Constructing debuggers and profilers for lazy languages is recognized as
+ difficult. Fortunately, there have been great strides in profiler research,
+ and most implementations of Haskell are now accompanied by usable time and
+ space profiling tools. But the slow rate of progress on debuggers for lazy
+ languages makes us researchers look, well, lazy.
+ }
+
+ Quote@~cite{morris1982real}:
+ @aquote{
+ How does one debug a program with a surprising evaluation order? Our
+ attempts to debug programs submitted to the lazy implementation have been
+ quite entertaining. The only thing in our experience to resemble it was
+ debugging a multi-programming system, but in this case virtually every
+ parameter to a procedure represents a new process. It was difficult to
+ predict when something was going to happen; the best strategy seems to be
+ to print out well-defined intermediate results, clearly labelled.
+ }}
+ @item{So-called ``infinite'' data structures constructed lazily have
+ problems with equality and serialization. The latter is especially
+ important for serializing and de-serializing Intermediate
+ Representations for the purpose of testing, and is also very important
+ for code generation: the backend effectively needs to turn the
+ infinite data structure into a finite one. The Revised$^6$ Report on
+ Scheme requires the @racket{equal?} predicate to correctly handle
+ cyclic data structures, but efficient algorithms implementing this
+ requirement are nontrivial@~cite{adams2008efficient}. Although any
+ representation of cyclic data structures will have at some point to
+ deal with equality and serialization, it is best if these concerns are
+ abstracted away as much as possible.}]
+ }
+
+ @asection{
+ @atitle{True graph representations using immutable data structures}
+ @itemlist[
+ @item{Roslyn@~cite{overbey2013immutable} : immutable trees with ``up'' pointers}
+ @item{The huet zipper@~cite{huet1997zipper}. Implementation in untyped Racket,
+ but not Typed
+ Racket@note{
+ @url{http://docs.racket-lang.org/zippers/}
+ @(linebreak)
+ @url{https://github.com/david-christiansen/racket-zippers}}}]
+ }
+ }
+}
+\ No newline at end of file
diff --git a/scribblings/util.rkt b/scribblings/util.rkt
@@ -23,7 +23,8 @@
include-asection
struct-update
part-style-update
- epigraph)
+ epigraph
+ usetech)
(require racket/stxparam
racket/splicing
@@ -263,12 +264,57 @@
(coloured-elem "gray" "]" (superscript "Todo"))))
(define (aquote . content)
- (nested-flow (style #f '())
- (list (paragraph (style #f '()) content))))
+ (apply nested
+ #:style (style "quote"
+ (list (css-addition
+ (string->bytes/utf-8 #<<EOCSS
+.quote {
+ background: #eee;
+ padding: 0.5em 1em;
+ margin-left: 2em;
+ margin-right: 2em;
+}
+EOCSS
+ ))))
+ content #;(list (paragraph content))))
(define (quotation . content)
- (nested-flow (style #f '())
- (list (paragraph (style #f '()) content))))
+ (apply nested
+ #:style (style "quotation"
+ (list (css-addition
+ (string->bytes/utf-8 #<<EOCSS
+.quotation {
+ background: #eee;
+ padding: 0.75em 1em;
+ margin-left: 2em;
+ margin-right: 2em;
+ quotes: "“" "”" "‘" "’";
+}
+
+.quotation > p:last-child {
+ margin-bottom: 0;
+}
+
+.quotation:before {
+ content: open-quote;
+ color:gray;
+ font-size: 200%;
+ float: left;
+ margin-left: -0.45em;
+ margin-top: -0.25em;
+}
+.quotation:after {
+ content: close-quote;
+ color:gray;
+ font-size: 200%;
+ float: right;
+ margin-right: -0.25em;
+ margin-top: -0.75em;
+}
+EOCSS
+ ))))
+ content #;(list (paragraph (style #f '())
+ content))))
(define (~cite* #:precision [precision #f] . rest)
(if precision
@@ -352,4 +398,8 @@ EOTEX
(apply nested #:style (style "epigraphStyle" epigraph-additions)
rest)
(nested #:style (style "epigraphAuthorStyle" epigraph-additions)
- author)))
-\ No newline at end of file
+ author)))
+
+;; For now, do not perform any check. Later on, we may verify automatically that
+;; a usetech always happens after the corresponding deftech.
+(define usetech list)
+\ No newline at end of file