www

Unnamed repository; edit this file 'description' to name the repository.
Log | Files | Refs | README

commit 8221bebb559abd3137f85d4cbe75fdcdf754b6dd
parent 7fce7b06b453082e9680326f9604c3e39e888086
Author: Georges Dupéron <georges.duperon@gmail.com>
Date:   Wed,  7 Jun 2017 13:19:33 +0200

Copied over some things from refs.tex, wrote more.

Diffstat:
Mscribblings/abbreviations.rkt | 3++-
Mscribblings/introduction.scrbl | 258+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
Mscribblings/state-of-the-art.scrbl | 152++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
Mscribblings/util.rkt | 64+++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
4 files changed, 444 insertions(+), 33 deletions(-)

diff --git a/scribblings/abbreviations.rkt b/scribblings/abbreviations.rkt @@ -1,5 +1,5 @@ #lang at-exp racket -(provide typedracket Typedracket csharp CAML CLOS NIT CPP) +(provide typedracket Typedracket csharp CAML CLOS NIT CPP DeBruijn) (require scribble/base scribble/core @@ -25,3 +25,4 @@ (define CLOS "CLOS") (define NIT "NIT") (define CPP "C++") +(define DeBruijn "De Bruijn") diff --git a/scribblings/introduction.scrbl b/scribblings/introduction.scrbl @@ -26,7 +26,22 @@ smalltalk-programmer-efficiency-cycle}}. Given their broad role, the complexity of the transformations involved, and - the stringent requirements, writing compilers is a difficult task. + the stringent requirements, writing compilers is a difficult task. Multiple + pitfalls await the compiler engineer, which we will discuss in more detail + below. This thesis aims to improve the compiler-writing toolkit currently + available, in order to help compiler developers produce compilers which are + closer to correctness, and easier to maintain. + + @require[scribble/core scribble/html-properties scribble/latex-properties] + @elem[#:style (style "hrStyle" + (list (alt-tag "hr") + (css-addition + #".hrStyle { margin-bottom: 1em; }") + (tex-addition + (string->bytes/utf-8 #<<EOTEX +\def\hrStyle#1{\noindent{\centerline{\rule[0.5ex]{0.5\linewidth}{0.5pt}}}} +EOTEX + ))))]{} The overall structure of a compiler will usually include a lexer and parser, which turn the the program's source into an in-memory representation. This @@ -40,41 +55,236 @@ performed separately. Finally, code in the target language or for the target architecture is generated. - Some pitfalls await the compiler-writer: it is easy to reuse excessively a - single intermediate representation; and there is a high risk associated with - the writing of large, monolithic passes, which are hard to test, debug, and - extend. We will discuss these pitfalls in more detail in the following - paragraphs. Both issues are prone to manifestations of some form or another of + We identify three pitfalls which await the compiler-writer: + + @itemlist[ + @item{It is easy to reuse excessively a single @usetech{intermediate + representation}, instead of properly distinguishing the features of the input + and output of each pass;} + @item{There is a high risk + associated with the definition of large, monolithic passes, which are hard to + test, debug, and extend;} + @item{The fundamental structure of the program being compiled is often a + graph, but compilers often work on an Abstract Syntax Tree, which requires + explicit handling of the backward and transversal arcs; This is a source of + bugs which could easily be avoided by using a higher-level abstraction + specifically aiming to represent a graph.}] + + The two first issues are prone to manifestations of some form or another of the ``god object'' anti-pattern@note{The ``god object'' anti-pattern describes object-oriented classes which @emph{do} too much or @emph{know} too much. The size of these classes tends to grow out of control, and there is usually a tight coupling between the methods of the object, which in turn means that performing small changes may require understanding the interactions between random parts of a very large file, in order to avoid breaking existing - functionality.}. - - - The static analysis, optimisation and code generation phases could in - principle work on the same intermediate representation. Several issues arise - from this situation, however. First, new information gained by the static - analysis may be added to the existing representation via mutation, or the - optimiser could directly alter the @tech{IR}. This means that the @tech{IR} - will initially contain holes (e.g. represented by @racketid[null] values), - which will get filled in gradually. Manipulating these parts is then extremely - risky, as it is easy to accidentally attempt to retrieve a value before it was - actually computed. Using the same @tech{IR} throughout the compiler also makes - it difficult for later passes to assume that some constructions have been - eliminated by previous simplification passes. One has to rely on the order of - execution of the passes in order to know what the data structure contains, - instead of having this information indicated by the @tech{IR}'s type. + functionality.}. The last issue is merely caused by the choice of an + abstraction which does not accurately represent the domain. We will discuss + each of these ailments in more detail in the following sections, and detail + the undesirable symptoms associated with them. + + @asection{ + @atitle{Large monolithic passes} + + Large, monolithic passes, which perform many transformations in parallel have + the advantage of possibly being faster than several smaller passes chained one + after another. Furthermore, as one begins writing a compiler, it is tempting + to incrementally extend an initial pass to perform more work, rather than + starting all over again with a new @usetech{intermediate representation}, and + a new scaffolding to support its traversal. + + However, the drawback is that large compiler passes are harder to test (as + there are many more combinations of paths through the compiler's code to + test), harder to debug (as many unrelated concerns interact to some extent + with each other), and harder to extend (for example, adding a new special form + to the language will necessitate changes to several transformations, but if + these are mingled in a single pass, the changes may be scattered through it, + and interact with a significant amount of surrounding code). This higher + maintenance cost also comes with another drawback: formal verification of the + compiler will clearly be more difficult when large, tangled chunks of code + which handle different semantic aspects are involved. + + @todo{Talk a bit about compcert here (one of the few/ the only formally + verified compilers).} + + } + + @asection{ + @atitle{Overusing a single @usetech{intermediate representation}} + + The static analysis, optimisation and code generation phases could in + principle work on the same @usetech{intermediate representation}. Several + issues arise from this situation, however. + + In principle, new information gained by the static analysis may be added to + the existing representation via mutation, or the optimiser could directly + alter the @tech{ IR}. This means that the @tech{IR} will initially contain + holes (e.g. represented by @racketid[null] values), which will get filled in + gradually. Manipulating these parts is then risky, as it is easy to + accidentally attempt to retrieve a value before it was actually computed. + Using the same @tech{IR} throughout the compiler also makes it difficult for + later passes to assume that some constructions have been eliminated by + previous simplification passes, and correctness relies on a fixed order of + execution of the passes; parts of the code which access data introduced or + modified by other passes are more brittle and may be disrupted when code is + refactored (for example, when moving the computation of some information to a + later pass). + + This situation becomes worse during the maintenance phase of the compiler's + lifecycle: when considering the data manipulated by a small portion of code + (in order to fix or improve said code), it is unclear which parts are supposed + to be filled in at that point, as well as which parts have been eliminated by + prior simplification passes. + + Furthermore, a mutable @tech{IR} hinders parallel execution of compiler + passes. Indeed, some compiler passes will perform global transformations or + transversal analyses, and such code may be intrinsically difficult to @htodo{ + parallelise}. @;{is "parallelise" the right word?} Many other passes however + are mere local transformations, and can readily be executed on distinct parts + of the abstract syntax tree, as long as there is no need to synchronise + concurrent accesses or modifications. + + Using immutable intermediate representations (and performing shallow copies + when updating data) can help with this second issue. However, there is more to + gain if, instead of having many instances of the same type, each intermediate + representation is given a distinct, precise type. The presence or absence of + computed information can be known from the input and output type of a pass, + instead of relying on the order of execution of the passes in order to know + what the input data structure may contain. + + } + + @asection{ + @atitle{Graphs} + + Nontrivial programs are inherently graphs: they may contain mutually + recursive functions (which directly refer to each other, and therefore will + form a cycle in a representation of the program), circular and (possibly + mutually) recursive datatypes may syntactically contain (possibly indirect) + references to themselves, and the control flow graph of a function or method + may, as its name implies, contain instructions which perform conditional or + unconditional backwards branches. + + However, nearly every compiler textbook will mention the use of + @tech[#:key "AST"]{ Abstract Syntax Trees} (ASTs) to represent the program. + This means that a structure, which intrinsically has the shape of a graph, is + encoded as a tree. + + Edges in the graph which may embody backward references can be made explicit + in various ways: + + @itemlist[ + @item{By using a form of unique identifier like a name bearing some semantic + value (e.g. the fully qualified name of the type or function that is + referred to), an index into an array of nodes (e.g. the offset of an + instruction in a function's bytecode may be used to refer to it in the + control flow graph), an automatically-generated unique identifier. + + Manipulation of these identifiers introduces a potential for some sorts of + bugs: name clashes can occur if the qualification chosen is not sufficient + to always distinguish nodes; @htodo{furthermore} compiler passes which + duplicate nodes (for example specialising functions) or merge them must be + careful to correctly update identifiers.} + @item{Alternatively, backward references may be encoded as a form of path + from the referring node. @DeBruijn indices can be used in such an encoding, + for example. + + Once again, manipulating these references is risky, and @DeBruijn indices + are particularly brittle, for example when adding a wrapper around a node + (i.e. adding an intermediate node on the path from the root), the @DeBruijn + indices used in some of descendents of that node (but not all) must be + updated. It is understandably easy to incorrectly implement updates to these + indices, and a single off-by-one error can throw the graph's representation + into an inconsistent state.} + @item{The program's representation could also contain actual pointers + (thereby really representing the program as an ``Abstract Syntax Graph''), + using mutation to patch nodes after they are initially created. + + @todo{Mutation: verification (two phases for invariants), generally frowned + upon, reference some of Roland's and others' work on freezing objects. (as + long as it is ensured that no improper manipulation of the objects is done + before freezing).}} + @item{The compiler could also manipulate lazy data structures, where the + actual value of a node in the graph is computed on the fly when that node is + accessed. + + @todo{Lazy: harder to debug}} + @item{Finally, Higher-Order Abstract Syntax, or HOAS for short, is a + technique which encodes variable bindings as anonymous functions in the host + language (whose parameters reify bindings at the level of the host + language). Variable references are then nothing more than actual uses of the + variable at the host language's level. Substitution, a common operation in + compilers and interpreters, then becomes a simple matter of calling the + anonymous function with the desired substitute. HOAS has the additional + advantage that it enforces well-scopedness, as it is impossible to refer to + a variable outside of its scope in the host language. + + Parametric HOAS, dubbed PHOAS, also allows encoding the type of the + variables in the representation. @todo{Can extra information other than the + type be stored?} + + There are a few drawbacks with HOAS and PHOAS: + + The ``target'' of a backward reference must be above all uses in the tree. + This might not always be true. For example, pre/post-conditions could, in an + early pass in the compiler, be located outside of the normal scope of a + function's signature, but still refer to the function's parameters. If the + pre/post-condition language allows breaking encapsulation, these could even + refer to some temporary variables declared inside the function. + + @;{ + @; True for HOAS, not sure for PHOAS. + @todo{The ``target'' of a backward reference does not initially contain + additional data (e.g. the variable name to be used for error messages, its + static or concrete type and so on) although extending the encoding to + support this should be feasible.} + } + + @todo{PHOAS naturally lends itself to the implementation of substitutions, + and therefore is well-suited to the writing of interpreters. However, the + representation cannot be readily traversed and accesses like would be done + with normal structures, and therefore the model could be counter-intuitive + for some programmers.} + + @todo{It seems difficult to encode an arbitrary number of variables bound in + a single construct (e.g. to represent bound type names across the whole + program, or an arbitrary number of mutually-recursive functions declared + via @racketid[let … and … in …], with any number of @racketid[and] clauses + in the compiled language.}} + ] + + Although some of these seem like viable solutions (e.g. explicitly freezing + objects), they still involve low-level mechanisms to create the graph; + furthermore code traversing the graph needs to be deal with cycles, in order + to avoid running into an infinite loop (or infinite recursion). + + anecdotally + + + updates: all logical pointers to an updated node must be updated too. + + @htodo{Think about ensuring that nodes from two distinct graphs are not mixed + in unexpected ways (placing a dummy phantom type somewhere should be enough + to prevent it).} + + } + + @asection{ + @atitle{Expressing the data dependencies of a path via row types} + } + @;{ The static analysis, optimisation and code generation phases will often work - on that intermediate representation. + on that @usetech{intermediate representation}. These transformations are often non-trivial and may require aggregating and analysing data scattered across the program. - triggering anti-patterns like ``god object'' + We build upon the achievements of the Nanopass Compiler Framework project, + which is presented in more detail in section XYZ. Simply put, Nanopass helps + the programmer define a myriad of compiler passes, each doing a very small + amount of code (and therefore easy to test and maintain), and each with a + different input and output type. + } } \ No newline at end of file diff --git a/scribblings/state-of-the-art.scrbl b/scribblings/state-of-the-art.scrbl @@ -1,6 +1,8 @@ #lang scribble/manual -@require["util.rkt"] +@require["util.rkt" + scribble/core + scribble/latex-properties] @(use-mathjax) @title[#:style (with-html5 manual-doc-style) @@ -409,3 +411,150 @@ } } }@;{Algrbraic datatypes for compilers (phc-adt)} + +@asection{ + @atitle[ + #:style (style #f + (list + (short-title "Writing compilers using many small passes"))) + ]{Writing compilers using many small passes (a.k.a following the Nanopass + Compiler Framework philosophy)} +} + +@asection{ + @atitle{Representation and transformation of graphs} + + @todo{There already were a few references in my proposal for JFLA.} + @todo{Look for articles about graph rewriting systems.} + + @asection{ + @atitle{Cycles in intermediate representations of programs} + The following sections present the many ways in which cycles within the + AST, CFG and other intermediate representations can be represented. + + @asection{ + @atitle{Mutable data structures} + + @itemlist[ + @item{Hard to debug} + @item{When e.g. using lazy-loading, it is easy to mistakenly load a + class or method after the Intermediate Representation was + frozen. Furthermore, unless a @tt{.freeze()} method actually + enforces this conceptual change from a mutable to an immutable + representation, it can be unclear at which point the IR (or parts of + it) is guaranteed to be complete and its state frozen. This is another + factor making maintenance of such code difficult.}] + Quote from@~cite{ramsey_applicative_2006}: + + @quotation{ + We are using ML to build a compiler that does low-level optimization. To + support optimizations in classic imperative style, we built a control-flow + graph using mutable pointers and other mutable state in the nodes. This + decision proved unfortunate: the mutable flow graph was big and complex, + and it led to many bugs. We have replaced it by a smaller, simpler, + applicative flow graph based on Huet’s (1997) zipper. The new flow graph + is a success; this paper presents its design and shows how it leads to a + gratifyingly simple implementation of the dataflow framework developed by + Lerner, Grove, and Chambers (2002).} + } + + @asection{ + @atitle{Unique identifiers used as a replacement for pointers} + + @htodo{Check that the multi-reference worked correctly here} + Mono uses that@~cite["mono-cecil-website" "mono-cecil-source"], it is very + easy to use an identifier which is supposed to reference a missing + object, or an object from another version of the AST. It is also very + easy to get things wrong when duplicating nodes (e.g. while specializing + methods based on their caller), or when merging or removing nodes. + + } + + @asection{ + @atitle{Explicit use of other common graph representations} + + Adjacency lists, @DeBruijn indices. + + @itemlist[ + @item{ Error prone when updating the graph (moving nodes around, adding, + duplicating or removing nodes).} + @item{Needs manual @htodo{caretaking}}] + + } + + @asection{ + @atitle{Using lazy programming languages} + + @itemlist[ + @item{Lazy programming is harder to debug. + @(linebreak) + Quote@~cite{nilsson1993lazy}: + @aquote{ + Traditional debugging techniques are, however, not suited for lazy + functional languages since computations generally do not take place in the + order one might expect. + } + + Quote@~cite{nilsson1993lazy}: + @aquote{ + Within the field of lazy functional programming, the lack of suitable + debugging tools has been apparent for quite some time. We feel that + traditional debugging techniques (e.g. breakpoints, tracing, variable + watching etc.) are not particularly well suited for the class of lazy + languages since computations in a program generally do not take place in the + order one might expect from reading the source code. + } + + Quote@~cite{wadler1998functional}: + @aquote{ + To be usable, a language system must be accompanied by a debugger and a + profiler. Just as with interlanguage working, designing such tools is + straightforward for strict languages, but trickier for lazy languages. + } + + Quote@~cite{wadler1998functional}: + @aquote{ + Constructing debuggers and profilers for lazy languages is recognized as + difficult. Fortunately, there have been great strides in profiler research, + and most implementations of Haskell are now accompanied by usable time and + space profiling tools. But the slow rate of progress on debuggers for lazy + languages makes us researchers look, well, lazy. + } + + Quote@~cite{morris1982real}: + @aquote{ + How does one debug a program with a surprising evaluation order? Our + attempts to debug programs submitted to the lazy implementation have been + quite entertaining. The only thing in our experience to resemble it was + debugging a multi-programming system, but in this case virtually every + parameter to a procedure represents a new process. It was difficult to + predict when something was going to happen; the best strategy seems to be + to print out well-defined intermediate results, clearly labelled. + }} + @item{So-called ``infinite'' data structures constructed lazily have + problems with equality and serialization. The latter is especially + important for serializing and de-serializing Intermediate + Representations for the purpose of testing, and is also very important + for code generation: the backend effectively needs to turn the + infinite data structure into a finite one. The Revised$^6$ Report on + Scheme requires the @racket{equal?} predicate to correctly handle + cyclic data structures, but efficient algorithms implementing this + requirement are nontrivial@~cite{adams2008efficient}. Although any + representation of cyclic data structures will have at some point to + deal with equality and serialization, it is best if these concerns are + abstracted away as much as possible.}] + } + + @asection{ + @atitle{True graph representations using immutable data structures} + @itemlist[ + @item{Roslyn@~cite{overbey2013immutable} : immutable trees with ``up'' pointers} + @item{The huet zipper@~cite{huet1997zipper}. Implementation in untyped Racket, + but not Typed + Racket@note{ + @url{http://docs.racket-lang.org/zippers/} + @(linebreak) + @url{https://github.com/david-christiansen/racket-zippers}}}] + } + } +} +\ No newline at end of file diff --git a/scribblings/util.rkt b/scribblings/util.rkt @@ -23,7 +23,8 @@ include-asection struct-update part-style-update - epigraph) + epigraph + usetech) (require racket/stxparam racket/splicing @@ -263,12 +264,57 @@ (coloured-elem "gray" "]" (superscript "Todo")))) (define (aquote . content) - (nested-flow (style #f '()) - (list (paragraph (style #f '()) content)))) + (apply nested + #:style (style "quote" + (list (css-addition + (string->bytes/utf-8 #<<EOCSS +.quote { + background: #eee; + padding: 0.5em 1em; + margin-left: 2em; + margin-right: 2em; +} +EOCSS + )))) + content #;(list (paragraph content)))) (define (quotation . content) - (nested-flow (style #f '()) - (list (paragraph (style #f '()) content)))) + (apply nested + #:style (style "quotation" + (list (css-addition + (string->bytes/utf-8 #<<EOCSS +.quotation { + background: #eee; + padding: 0.75em 1em; + margin-left: 2em; + margin-right: 2em; + quotes: "“" "”" "‘" "’"; +} + +.quotation > p:last-child { + margin-bottom: 0; +} + +.quotation:before { + content: open-quote; + color:gray; + font-size: 200%; + float: left; + margin-left: -0.45em; + margin-top: -0.25em; +} +.quotation:after { + content: close-quote; + color:gray; + font-size: 200%; + float: right; + margin-right: -0.25em; + margin-top: -0.75em; +} +EOCSS + )))) + content #;(list (paragraph (style #f '()) + content)))) (define (~cite* #:precision [precision #f] . rest) (if precision @@ -352,4 +398,8 @@ EOTEX (apply nested #:style (style "epigraphStyle" epigraph-additions) rest) (nested #:style (style "epigraphAuthorStyle" epigraph-additions) - author))) -\ No newline at end of file + author))) + +;; For now, do not perform any check. Later on, we may verify automatically that +;; a usetech always happens after the corresponding deftech. +(define usetech list) +\ No newline at end of file