Copied over some things from refs.tex, wrote more. - www - Unnamed repository; edit this file 'description' to name the repository.

commit 8221bebb559abd3137f85d4cbe75fdcdf754b6dd
parent 7fce7b06b453082e9680326f9604c3e39e888086
Author: Georges Dupéron <georges.duperon@gmail.com>
Date:   Wed,  7 Jun 2017 13:19:33 +0200

Copied over some things from refs.tex, wrote more.

Diffstat:
M scribblings/abbreviations.rkt  | 3 ++-
M scribblings/introduction.scrbl  | 258 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
M scribblings/state-of-the-art.scrbl  | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
M scribblings/util.rkt  | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------

4 files changed, 444 insertions(+), 33 deletions(-)
diff --git a/scribblings/abbreviations.rkt b/scribblings/abbreviations.rkt
@@ -1,5 +1,5 @@
 #lang at-exp racket
-(provide typedracket Typedracket csharp CAML CLOS NIT CPP)
+(provide typedracket Typedracket csharp CAML CLOS NIT CPP DeBruijn)
 
 (require scribble/base
          scribble/core
@@ -25,3 +25,4 @@
 (define CLOS "CLOS")
 (define NIT "NIT")
 (define CPP "C++")
+(define DeBruijn "De Bruijn")
diff --git a/scribblings/introduction.scrbl b/scribblings/introduction.scrbl
@@ -26,7 +26,22 @@
    smalltalk-programmer-efficiency-cycle}}.
 
  Given their broad role, the complexity of the transformations involved, and
- the stringent requirements, writing compilers is a difficult task.
+ the stringent requirements, writing compilers is a difficult task. Multiple
+ pitfalls await the compiler engineer, which we will discuss in more detail
+ below. This thesis aims to improve the compiler-writing toolkit currently
+ available, in order to help compiler developers produce compilers which are
+ closer to correctness, and easier to maintain.
+
+ @require[scribble/core scribble/html-properties scribble/latex-properties]
+ @elem[#:style (style "hrStyle"
+                      (list (alt-tag "hr")
+                            (css-addition
+                             #".hrStyle { margin-bottom: 1em; }")
+                            (tex-addition
+                             (string->bytes/utf-8 #<<EOTEX
+\def\hrStyle#1{\noindent{\centerline{\rule[0.5ex]{0.5\linewidth}{0.5pt}}}}
+EOTEX
+                                                  ))))]{}
 
  The overall structure of a compiler will usually include a lexer and parser,
  which turn the the program's source into an in-memory representation. This
@@ -40,41 +55,236 @@
  performed separately. Finally, code in the target language or for the target
  architecture is generated.
 
- Some pitfalls await the compiler-writer: it is easy to reuse excessively a
- single intermediate representation; and there is a high risk associated with
- the writing of large, monolithic passes, which are hard to test, debug, and
- extend. We will discuss these pitfalls in more detail in the following
- paragraphs. Both issues are prone to manifestations of some form or another of
+ We identify three pitfalls which await the compiler-writer:
+
+ @itemlist[
+ @item{It is easy to reuse excessively a single @usetech{intermediate
+    representation}, instead of properly distinguishing the features of the input
+   and output of each pass;}
+ @item{There is a high risk
+   associated with the definition of large, monolithic passes, which are hard to
+   test, debug, and extend;}
+ @item{The fundamental structure of the program being compiled is often a
+   graph, but compilers often work on an Abstract Syntax Tree, which requires
+   explicit handling of the backward and transversal arcs; This is a source of
+   bugs which could easily be avoided by using a higher-level abstraction
+   specifically aiming to represent a graph.}]
+ 
+ The two first issues are prone to manifestations of some form or another of
  the ``god object'' anti-pattern@note{The ``god object'' anti-pattern describes
   object-oriented classes which @emph{do} too much or @emph{know} too much. The
   size of these classes tends to grow out of control, and there is usually a
   tight coupling between the methods of the object, which in turn means that
   performing small changes may require understanding the interactions between
   random parts of a very large file, in order to avoid breaking existing
-  functionality.}.
-
-
- The static analysis, optimisation and code generation phases could in
- principle work on the same intermediate representation. Several issues arise
- from this situation, however. First, new information gained by the static
- analysis may be added to the existing representation via mutation, or the
- optimiser could directly alter the @tech{IR}. This means that the @tech{IR}
- will initially contain holes (e.g. represented by @racketid[null] values),
- which will get filled in gradually. Manipulating these parts is then extremely
- risky, as it is easy to accidentally attempt to retrieve a value before it was
- actually computed. Using the same @tech{IR} throughout the compiler also makes
- it difficult for later passes to assume that some constructions have been
- eliminated by previous simplification passes. One has to rely on the order of
- execution of the passes in order to know what the data structure contains,
- instead of having this information indicated by the @tech{IR}'s type.
+  functionality.}. The last issue is merely caused by the choice of an
+ abstraction which does not accurately represent the domain. We will discuss
+ each of these ailments in more detail in the following sections, and detail
+ the undesirable symptoms associated with them.
+
+ @asection{
+  @atitle{Large monolithic passes}
+   
+  Large, monolithic passes, which perform many transformations in parallel have
+  the advantage of possibly being faster than several smaller passes chained one
+  after another. Furthermore, as one begins writing a compiler, it is tempting
+  to incrementally extend an initial pass to perform more work, rather than
+  starting all over again with a new @usetech{intermediate representation}, and
+  a new scaffolding to support its traversal.
+
+  However, the drawback is that large compiler passes are harder to test (as
+  there are many more combinations of paths through the compiler's code to
+  test), harder to debug (as many unrelated concerns interact to some extent
+  with each other), and harder to extend (for example, adding a new special form
+  to the language will necessitate changes to several transformations, but if
+  these are mingled in a single pass, the changes may be scattered through it,
+  and interact with a significant amount of surrounding code). This higher
+  maintenance cost also comes with another drawback: formal verification of the
+  compiler will clearly be more difficult when large, tangled chunks of code
+  which handle different semantic aspects are involved.
+
+  @todo{Talk a bit about compcert here (one of the few/ the only formally
+   verified compilers).}
+
+ }
+
+ @asection{
+  @atitle{Overusing a single @usetech{intermediate representation}}
+
+  The static analysis, optimisation and code generation phases could in
+  principle work on the same @usetech{intermediate representation}. Several
+  issues arise from this situation, however.
+
+  In principle, new information gained by the static analysis may be added to
+  the existing representation via mutation, or the optimiser could directly
+  alter the @tech{ IR}. This means that the @tech{IR} will initially contain
+  holes (e.g. represented by @racketid[null] values), which will get filled in
+  gradually. Manipulating these parts is then risky, as it is easy to
+  accidentally attempt to retrieve a value before it was actually computed.
+  Using the same @tech{IR} throughout the compiler also makes it difficult for
+  later passes to assume that some constructions have been eliminated by
+  previous simplification passes, and correctness relies on a fixed order of
+  execution of the passes; parts of the code which access data introduced or
+  modified by other passes are more brittle and may be disrupted when code is
+  refactored (for example, when moving the computation of some information to a
+  later pass).
+
+  This situation becomes worse during the maintenance phase of the compiler's
+  lifecycle: when considering the data manipulated by a small portion of code
+  (in order to fix or improve said code), it is unclear which parts are supposed
+  to be filled in at that point, as well as which parts have been eliminated by
+  prior simplification passes.
+
+  Furthermore, a mutable @tech{IR} hinders parallel execution of compiler
+  passes. Indeed, some compiler passes will perform global transformations or
+  transversal analyses, and such code may be intrinsically difficult to @htodo{
+   parallelise}. @;{is "parallelise" the right word?} Many other passes however
+  are mere local transformations, and can readily be executed on distinct parts
+  of the abstract syntax tree, as long as there is no need to synchronise
+  concurrent accesses or modifications.
+  
+  Using immutable intermediate representations (and performing shallow copies
+  when updating data) can help with this second issue. However, there is more to
+  gain if, instead of having many instances of the same type, each intermediate
+  representation is given a distinct, precise type. The presence or absence of
+  computed information can be known from the input and output type of a pass,
+  instead of relying on the order of execution of the passes in order to know
+  what the input data structure may contain.
+
+ }
+
+ @asection{
+  @atitle{Graphs}
+
+  Nontrivial programs are inherently graphs: they may contain mutually
+  recursive functions (which directly refer to each other, and therefore will
+  form a cycle in a representation of the program), circular and (possibly
+  mutually) recursive datatypes may syntactically contain (possibly indirect)
+  references to themselves, and the control flow graph of a function or method
+  may, as its name implies, contain instructions which perform conditional or
+  unconditional backwards branches.
+
+  However, nearly every compiler textbook will mention the use of
+  @tech[#:key "AST"]{ Abstract Syntax Trees} (ASTs) to represent the program.
+  This means that a structure, which intrinsically has the shape of a graph, is
+  encoded as a tree.
+
+  Edges in the graph which may embody backward references can be made explicit
+  in various ways:
+
+  @itemlist[
+ @item{By using a form of unique identifier like a name bearing some semantic
+    value (e.g. the fully qualified name of the type or function that is
+    referred to), an index into an array of nodes (e.g. the offset of an
+    instruction in a function's bytecode may be used to refer to it in the
+    control flow graph), an automatically-generated unique identifier.
+
+    Manipulation of these identifiers introduces a potential for some sorts of
+    bugs: name clashes can occur if the qualification chosen is not sufficient
+    to always distinguish nodes; @htodo{furthermore} compiler passes which
+    duplicate nodes (for example specialising functions) or merge them must be
+    careful to correctly update identifiers.}
+ @item{Alternatively, backward references may be encoded as a form of path
+    from the referring node. @DeBruijn indices can be used in such an encoding,
+    for example.
+
+    Once again, manipulating these references is risky, and @DeBruijn indices
+    are particularly brittle, for example when adding a wrapper around a node
+    (i.e. adding an intermediate node on the path from the root), the @DeBruijn
+    indices used in some of descendents of that node (but not all) must be
+    updated. It is understandably easy to incorrectly implement updates to these
+    indices, and a single off-by-one error can throw the graph's representation
+    into an inconsistent state.}
+ @item{The program's representation could also contain actual pointers
+    (thereby really representing the program as an ``Abstract Syntax Graph''),
+    using mutation to patch nodes after they are initially created.
+
+    @todo{Mutation: verification (two phases for invariants), generally frowned
+     upon, reference some of Roland's and others' work on freezing objects. (as
+     long as it is ensured that no improper manipulation of the objects is done
+     before freezing).}}
+ @item{The compiler could also manipulate lazy data structures, where the
+    actual value of a node in the graph is computed on the fly when that node is
+    accessed.
+
+    @todo{Lazy: harder to debug}}
+ @item{Finally, Higher-Order Abstract Syntax, or HOAS for short, is a
+    technique which encodes variable bindings as anonymous functions in the host
+    language (whose parameters reify bindings at the level of the host
+    language). Variable references are then nothing more than actual uses of the
+    variable at the host language's level. Substitution, a common operation in
+    compilers and interpreters, then becomes a simple matter of calling the
+    anonymous function with the desired substitute. HOAS has the additional
+    advantage that it enforces well-scopedness, as it is impossible to refer to
+    a variable outside of its scope in the host language.
+
+    Parametric HOAS, dubbed PHOAS, also allows encoding the type of the
+    variables in the representation. @todo{Can extra information other than the
+     type be stored?}
+
+    There are a few drawbacks with HOAS and PHOAS:
+
+    The ``target'' of a backward reference must be above all uses in the tree.
+    This might not always be true. For example, pre/post-conditions could, in an
+    early pass in the compiler, be located outside of the normal scope of a
+    function's signature, but still refer to the function's parameters. If the
+    pre/post-condition language allows breaking encapsulation, these could even
+    refer to some temporary variables declared inside the function.
+
+    @;{
+     @; True for HOAS, not sure for PHOAS.
+     @todo{The ``target'' of a backward reference does not initially contain
+      additional data (e.g. the variable name to be used for error messages, its
+      static or concrete type and so on) although extending the encoding to
+      support this should be feasible.}
+    }
+
+    @todo{PHOAS naturally lends itself to the implementation of substitutions,
+     and therefore is well-suited to the writing of interpreters. However, the
+     representation cannot be readily traversed and accesses like would be done
+     with normal structures, and therefore the model could be counter-intuitive
+     for some programmers.}
+
+    @todo{It seems difficult to encode an arbitrary number of variables bound in
+     a single construct (e.g. to represent bound type names across the whole
+     program, or an arbitrary number of mutually-recursive functions declared
+     via @racketid[let … and … in …], with any number of @racketid[and] clauses
+     in the compiled language.}}
+ ]
+  
+  Although some of these seem like viable solutions (e.g. explicitly freezing
+  objects), they still involve low-level mechanisms to create the graph;
+  furthermore code traversing the graph needs to be deal with cycles, in order
+  to avoid running into an infinite loop (or infinite recursion).
+
+  anecdotally
+
+  
+  updates: all logical pointers to an updated node must be updated too.
+
+  @htodo{Think about ensuring that nodes from two distinct graphs are not mixed
+   in unexpected ways (placing a dummy phantom type somewhere should be enough
+   to prevent it).}
+ 
+ }
+
+ @asection{
+  @atitle{Expressing the data dependencies of a path via row types}
+ }
+ 
 
  @;{
   The static analysis, optimisation and code generation phases will often work
-  on that intermediate representation.
+  on that @usetech{intermediate representation}.
 
   These transformations are often non-trivial and may require aggregating and
   analysing data scattered across the program.
 
-  triggering anti-patterns like ``god object''
+  We build upon the achievements of the Nanopass Compiler Framework project,
+  which is presented in more detail in section XYZ. Simply put, Nanopass helps
+  the programmer define a myriad of compiler passes, each doing a very small
+  amount of code (and therefore easy to test and maintain), and each with a
+  different input and output type.
+  
  }
 }
 \ No newline at end of file
diff --git a/scribblings/state-of-the-art.scrbl b/scribblings/state-of-the-art.scrbl
@@ -1,6 +1,8 @@
 #lang scribble/manual
 
-@require["util.rkt"]
+@require["util.rkt"
+         scribble/core
+         scribble/latex-properties]
 @(use-mathjax)
 
 @title[#:style (with-html5 manual-doc-style)
@@ -409,3 +411,150 @@
   }
  }
  }@;{Algrbraic datatypes for compilers (phc-adt)}
+
+@asection{
+ @atitle[
+ #:style (style #f
+                (list
+                 (short-title "Writing compilers using many small passes")))
+ ]{Writing compilers using many small passes (a.k.a following the Nanopass
+  Compiler Framework philosophy)}
+}
+
+@asection{
+ @atitle{Representation and transformation of graphs}
+
+ @todo{There already were a few references in my proposal for JFLA.}
+ @todo{Look for articles about graph rewriting systems.}
+
+ @asection{
+  @atitle{Cycles in intermediate representations of programs}
+  The following sections present the many ways in which cycles within the
+  AST, CFG and other intermediate representations can be represented.
+
+  @asection{
+   @atitle{Mutable data structures}
+
+   @itemlist[
+ @item{Hard to debug}
+ @item{When e.g. using lazy-loading, it is easy to mistakenly load a
+     class or method after the Intermediate Representation was
+     frozen. Furthermore, unless a @tt{.freeze()} method actually
+     enforces this conceptual change from a mutable to an immutable
+     representation, it can be unclear at which point the IR (or parts of
+     it) is guaranteed to be complete and its state frozen. This is another
+     factor making maintenance of such code difficult.}]
+   Quote from@~cite{ramsey_applicative_2006}:
+
+   @quotation{
+    We are using ML to build a compiler that does low-level optimization. To
+    support optimizations in classic imperative style, we built a control-flow
+    graph using mutable pointers and other mutable state in the nodes. This
+    decision proved unfortunate: the mutable flow graph was big and complex,
+    and it led to many bugs. We have replaced it by a smaller, simpler,
+    applicative flow graph based on Huet’s (1997) zipper. The new flow graph
+    is a success; this paper presents its design and shows how it leads to a
+    gratifyingly simple implementation of the dataflow framework developed by
+    Lerner, Grove, and Chambers (2002).}
+  }
+
+  @asection{
+   @atitle{Unique identifiers used as a replacement for pointers}
+
+   @htodo{Check that the multi-reference worked correctly here}
+   Mono uses that@~cite["mono-cecil-website" "mono-cecil-source"], it is very
+   easy to use an identifier which is supposed to reference a missing
+   object, or an object from another version of the AST. It is also very
+   easy to get things wrong when duplicating nodes (e.g. while specializing
+   methods based on their caller), or when merging or removing nodes.
+
+  }
+
+  @asection{
+   @atitle{Explicit use of other common graph representations}
+
+   Adjacency lists, @DeBruijn indices.
+
+   @itemlist[
+ @item{ Error prone when updating the graph (moving nodes around, adding,
+     duplicating or removing nodes).}
+ @item{Needs manual @htodo{caretaking}}]
+
+  }
+
+  @asection{
+   @atitle{Using lazy programming languages}
+
+   @itemlist[
+ @item{Lazy programming is harder to debug.
+     @(linebreak)
+     Quote@~cite{nilsson1993lazy}:
+     @aquote{
+      Traditional debugging techniques are, however, not suited for lazy
+      functional languages since computations generally do not take place in the
+      order one might expect.
+     }
+
+     Quote@~cite{nilsson1993lazy}:
+     @aquote{
+      Within the field of lazy functional programming, the lack of suitable
+      debugging tools has been apparent for quite some time. We feel that
+      traditional debugging techniques (e.g. breakpoints, tracing, variable
+      watching etc.) are not particularly well suited for the class of lazy
+      languages since computations in a program generally do not take place in the
+      order one might expect from reading the source code.
+     }
+
+     Quote@~cite{wadler1998functional}:
+     @aquote{
+      To be usable, a language system must be accompanied by a debugger and a
+      profiler. Just as with interlanguage working, designing such tools is
+      straightforward for strict languages, but trickier for lazy languages.
+     }
+
+     Quote@~cite{wadler1998functional}:
+     @aquote{
+      Constructing debuggers and profilers for lazy languages is recognized as
+      difficult. Fortunately, there have been great strides in profiler research,
+      and most implementations of Haskell are now accompanied by usable time and
+      space profiling tools. But the slow rate of progress on debuggers for lazy
+      languages makes us researchers look, well, lazy.
+     }
+
+     Quote@~cite{morris1982real}:
+     @aquote{
+      How does one debug a program with a surprising evaluation order? Our
+      attempts to debug programs submitted to the lazy implementation have been
+      quite entertaining. The only thing in our experience to resemble it was
+      debugging a multi-programming system, but in this case virtually every
+      parameter to a procedure represents a new process. It was difficult to
+      predict when something was going to happen; the best strategy seems to be
+      to print out well-defined intermediate results, clearly labelled.
+    }}
+ @item{So-called ``infinite'' data structures constructed lazily have
+     problems with equality and serialization. The latter is especially
+     important for serializing and de-serializing Intermediate
+     Representations for the purpose of testing, and is also very important
+     for code generation: the backend effectively needs to turn the
+     infinite data structure into a finite one. The Revised$^6$ Report on
+     Scheme requires the @racket{equal?} predicate to correctly handle
+     cyclic data structures, but efficient algorithms implementing this
+     requirement are nontrivial@~cite{adams2008efficient}. Although any
+     representation of cyclic data structures will have at some point to
+     deal with equality and serialization, it is best if these concerns are
+     abstracted away as much as possible.}]
+  }
+
+  @asection{
+   @atitle{True graph representations using immutable data structures}
+   @itemlist[
+ @item{Roslyn@~cite{overbey2013immutable} : immutable trees with ``up'' pointers}
+ @item{The huet zipper@~cite{huet1997zipper}. Implementation in untyped Racket,
+     but not Typed
+     Racket@note{
+      @url{http://docs.racket-lang.org/zippers/}
+      @(linebreak)
+      @url{https://github.com/david-christiansen/racket-zippers}}}]
+  }
+ }
+}
+\ No newline at end of file
diff --git a/scribblings/util.rkt b/scribblings/util.rkt
@@ -23,7 +23,8 @@
          include-asection
          struct-update
          part-style-update
-         epigraph)
+         epigraph
+         usetech)
 
 (require racket/stxparam
          racket/splicing
@@ -263,12 +264,57 @@
         (coloured-elem "gray" "]" (superscript "Todo"))))
 
 (define (aquote . content)
-  (nested-flow (style #f '())
-               (list (paragraph (style #f '()) content))))
+  (apply nested
+         #:style (style "quote"
+                        (list (css-addition
+                               (string->bytes/utf-8 #<<EOCSS
+.quote {
+  background: #eee;
+  padding: 0.5em 1em;
+  margin-left: 2em;
+  margin-right: 2em;
+}
+EOCSS
+                                                    ))))
+         content #;(list (paragraph content))))
 
 (define (quotation . content)
-  (nested-flow (style #f '())
-               (list (paragraph (style #f '()) content))))
+  (apply nested
+         #:style (style "quotation"
+                        (list (css-addition
+                               (string->bytes/utf-8 #<<EOCSS
+.quotation {
+  background: #eee;
+  padding: 0.75em 1em;
+  margin-left: 2em;
+  margin-right: 2em;
+  quotes: "“" "”" "‘" "’";
+}
+
+.quotation > p:last-child {
+  margin-bottom: 0;
+}
+
+.quotation:before {
+  content: open-quote;
+  color:gray;
+  font-size: 200%;
+  float: left;
+  margin-left: -0.45em;
+  margin-top: -0.25em;
+}
+.quotation:after {
+  content: close-quote;
+  color:gray;
+  font-size: 200%;
+  float: right;
+  margin-right: -0.25em;
+  margin-top: -0.75em;
+}
+EOCSS
+                                                    ))))
+         content #;(list (paragraph (style #f '())
+                                    content))))
 
 (define (~cite* #:precision [precision #f] . rest)
   (if precision
@@ -352,4 +398,8 @@ EOTEX
           (apply nested #:style (style "epigraphStyle" epigraph-additions)
                  rest)
           (nested #:style (style "epigraphAuthorStyle" epigraph-additions)
-                  author)))
-\ No newline at end of file
+                  author)))
+
+;; For now, do not perform any check. Later on, we may verify automatically that
+;; a usetech always happens after the corresponding deftech.
+(define usetech list)
+\ No newline at end of file

	www Unnamed repository; edit this file 'description' to name the repository.
	Log \| Files \| Refs \| README

M	scribblings/abbreviations.rkt	\|	3	++-
M	scribblings/introduction.scrbl	\|	258	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
M	scribblings/state-of-the-art.scrbl	\|	152	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
M	scribblings/util.rkt	\|	64	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------