introduction.scrbl (46127B)
1 #lang scribble/manual 2 3 @require["util.rkt" 4 scriblib/render-cond 5 racket/string 6 (for-label racket 7 (only-in srfi/1 zip))] 8 @(use-mathjax) 9 10 @htodo{No parsers} 11 12 @title[#:style (with-html5 manual-doc-style) 13 #:version (version-text)]{Introduction} 14 15 @(define (sec-number→string tag #:doc [doc #f]) 16 (delayed-element 17 (λ (the-renderer the-part the-resolve-info) 18 (define sec-tag (make-section-tag tag #:doc doc)) 19 (define resolved-sec-tag (resolve-get the-part the-resolve-info sec-tag)) 20 (if resolved-sec-tag 21 (let () 22 (unless (and (vector? resolved-sec-tag) 23 (>= (vector-length resolved-sec-tag) 3)) 24 (error (format (string-append 25 "Resolved section tag was not a" 26 " vector of more than 3 elements: ~a") 27 resolved-sec-tag))) 28 (define sec-number (vector-ref resolved-sec-tag 2)) 29 (unless (list? sec-number) 30 (error "Third element of a resolved section tag was not a list.")) 31 (if (= (length sec-number) 1) 32 (format "Chapter ~a" (car sec-number)) 33 (format "Section ~a" (string-join (map number->string 34 (reverse sec-number)) 35 ".")))) 36 (elem #:style (style "badlink" '()) 37 tag " #:doc " (format "~s" doc)))) 38 (λ () 39 "Section ??") 40 (λ () 41 "Section ??"))) 42 43 @(define (sec-number-ref tag #:doc [doc #f]) 44 (seclink tag #:doc doc (sec-number→string tag #:doc doc))) 45 @(define sec-n-ref sec-number-ref) 46 47 @asection{ 48 @atitle{Warm-up} 49 50 @todo{What does a compiler do} 51 52 @todo{This thesis aims to build a framework which helps write compilers.} 53 54 @todo{Our focus is on compilers, not virtual machines or other run-time 55 systems. We are also not concerned with parsers — there are lots of existing 56 approaches and libraries which help writing parsers, although parsing in 57 general is not yet a solved problem on all accounts.} 58 59 @todo{IR = the macroscopic structure of the program (i.e. the meta-model 60 (explain what it is)) + the code of functions and/or methods (statements and 61 expressions, basic blocks of statements, or bytecode instructions)}} 62 63 @asection{ 64 @atitle{The challenges of writing compilers} 65 66 @epigraph[#:width "8cm" 67 @elem{@italic{That Mitchell & Webb Look, Series 3} — BBC Two}]{ 68 Brain surgery? — It’s not exactly rocket science, is it?} 69 70 Compilers are an essential part of today's software systems. Compilers 71 translate high-level languages with complex semantics into lower-level 72 languages. A compiler will parse the program, transform it in various ways, 73 perform some more or less advanced static checks, and optimise the input 74 program before producing an output in the desired target language. A compiler 75 must be correct, reusable and fast. It must be correct because programmers are 76 concerned with logical errors in their own code, and should not fear that the 77 compiler introduces erroneous behaviour on its own. It must be also 78 well-architectured: extensible, because the language is likely to evolve over 79 time, modular in the hope that some components can be improved independently 80 of the rest of the compiler (e.g. replacing or improving an optimisation 81 phase, or changing the compiler's front-end, to support another input 82 language), and more generally reusable, so that parts can be repurposed to 83 build other compilers, or other tools (analysers, IDEs, code instrumentation 84 and so on). Finally, a fast compiler is desirable because the programmer's 85 edit-build-test cycle should be as frequent as 86 possible@todo{@~cite["smalltalk-programmer-efficiency-cycle"]}. 87 88 Given their broad role, the complexity of the transformations involved, and 89 the stringent requirements, writing compilers is a difficult task. Multiple 90 pitfalls await the compiler engineer, which we will discuss in more detail 91 below. This thesis aims to improve the compiler-writing toolkit currently 92 available, in order to help compiler developers produce compilers which are 93 closer to correctness, and easier to maintain. 94 95 @require[scribble/core scribble/html-properties scribble/latex-properties] 96 97 @hr 98 99 The overall structure of a compiler will usually include a lexer and parser, 100 which turn the program's source into an in-memory representation. This initial 101 representation will often be translated into an @deftech[#:key "IR"]{ 102 intermediate representation} (IR) better suited to the subsequent steps. At 103 some early point, the program will be analysed for syntactical or semantic 104 inconsistencies (ranging from missing parentheses to duplicate definitions of 105 the same variable), and may also perform a more thorough static analysis. 106 Finally, code in the target language or for the target architecture is 107 generated. The translation can additionally include optimisation phases in 108 several spots: during code generation, using locally-recognisable patterns, or 109 for example earlier, using the results of the program-wide analysis performed 110 separately. 111 112 We identify three pitfalls which await the compiler-writer: 113 114 @itemlist[ 115 @item{It is easy to reuse excessively a single @usetech{intermediate 116 representation} type, instead of properly distinguishing the specifics of 117 the input and output type of each pass;} 118 @item{There is a high risk 119 associated with the definition of large, monolithic passes, which are hard to 120 test, debug, and extend;} 121 @item{The fundamental structure of the program being compiled is often a 122 graph, but compilers often work on an Abstract Syntax Tree, which requires 123 explicit handling of the backward arcs; This is a source of bugs which could 124 be avoided by using a better abstraction.}] 125 126 The first two issues are prone to manifestations of some form or another of 127 the ``god object'' anti-pattern@note{The ``god object'' anti-pattern describes 128 object-oriented classes which @emph{do} too much or @emph{know} too much. The 129 size of these classes tends to grow out of control, and there is usually a 130 tight coupling between the methods of the object, which in turn means that 131 performing small changes may require understanding the interactions between 132 random parts of a very large file, in order to avoid breaking existing 133 functionality.}. The last issue is merely caused by the choice of an 134 abstraction which does not accurately represent the domain. We will discuss 135 each of these ailments in more detail in the following sections, and detail 136 the undesirable symptoms associated with them. 137 138 @asection{ 139 @atitle{Large monolithic passes} 140 141 Large, monolithic passes, which perform many transformations simultaneously 142 have the advantage of possibly being faster than several smaller passes 143 chained one after another. Furthermore, as one begins writing a compiler, it 144 is tempting to incrementally extend an initial pass to perform more work, 145 rather than starting all over again with a new @usetech{intermediate 146 representation}, and a new scaffolding to support its traversal. 147 148 However, the drawback is that large compiler passes are harder to test (as 149 there are many more combinations of paths through the compiler's code to 150 test), harder to debug (as many unrelated concerns interact to some extent 151 with each other), and harder to extend (for example, adding a new special form 152 to the language will necessitate changes to several transformations, but if 153 these are mingled in a single pass, the changes may be scattered through it, 154 and interact with a significant amount of surrounding code). This higher 155 maintenance cost also comes with another drawback: formal verification of the 156 compiler will clearly be more difficult when large, tangled chunks of code 157 which handle different semantic aspects are involved. 158 159 @todo{Talk a bit about compcert here (one of the few/ the only formally 160 verified compilers).} 161 162 } 163 164 @asection{ 165 @atitle{Overusing a single @usetech{intermediate representation}} 166 167 The static analysis, optimisation and code generation phases could in 168 principle work on the same @usetech{intermediate representation}. However, 169 several issues arise from this situation. 170 171 In principle, new information gained by the static analysis may be added to 172 the existing representation via mutation, or the optimiser could directly 173 alter the @tech{IR}. This means that the @tech{IR} will initially contain 174 holes (e.g. represented by @racketid[null] values), which will get filled in 175 gradually. Manipulating these parts is then risky, as it is easy to 176 accidentally attempt to retrieve a value before it was actually computed. 177 Using the same @tech{IR} throughout the compiler also makes it difficult for 178 later passes to assume that some constructions have been eliminated by 179 previous simplification passes, and correctness relies on a fixed order of 180 execution of the passes; parts of the code which access data introduced or 181 modified by other passes are more brittle and may be disrupted when code is 182 refactored (for example, when moving the computation of some information to a 183 later pass). 184 185 This situation becomes worse during the maintenance phase of the compiler's 186 lifecycle: when considering the data manipulated by a small portion of code 187 (in order to fix or improve said code), it is unclear which parts are supposed 188 to be filled in at that point, as well as which parts have been eliminated by 189 prior simplification passes. 190 191 Furthermore, a mutable @tech{IR} hinders parallel execution of compiler 192 passes. Indeed, some compiler passes will perform global transformations or 193 analyses, and such code may be intrinsically difficult to 194 parallelise. @htodo{is "parallelise" the right word?} Many other passes 195 however are mere local transformations, and can readily be executed on 196 distinct parts of the abstract syntax tree, as long as there is no need to 197 synchronise concurrent accesses or modifications. 198 199 Using immutable intermediate representations (and performing shallow copies 200 when updating data) can help with this second issue. However, there is more to 201 gain if, instead of having many instances of the same type, each intermediate 202 representation is given a distinct, precise type. The presence or absence of 203 computed information can be known from the input and output type of a pass, 204 instead of relying on the order of execution of the passes in order to know 205 what the input data structure may contain. 206 207 } 208 209 @asection{ 210 @atitle{Graphs} 211 212 @htodo{full stop in the middle of sentence} 213 Nontrivial programs are inherently graphs: they may contain mutually 214 recursive functions (which directly refer to each other, and therefore will 215 form a cycle in a representation of the program), circular and (possibly 216 mutually) recursive datatypes may syntactically contain (possibly indirect) 217 references to themselves, and the control flow graph of a function or method 218 can, as its name implies, contain instructions which perform conditional or 219 unconditional backwards branches. 220 221 However, nearly every compiler textbook will mention the use of 222 @tech[#:key "AST"]{Abstract Syntax Trees} (ASTs) to represent the program. 223 This means that a structure, which intrinsically has the shape of a graph, is 224 encoded as a tree. 225 226 Edges in the graph which may embody backward references can be made explicit 227 in various ways: 228 229 @itemlist[ 230 @item{By using a form of unique identifier like a name bearing some semantic 231 value (e.g. the fully qualified name of the type or function that is 232 referred to), an index into an array of nodes (e.g. the offset of an 233 instruction in a function's bytecode may be used to refer to it in the 234 control flow graph), an automatically-generated unique identifier. 235 236 Manipulation of these identifiers introduces a potential for some sorts of 237 bugs: name clashes can occur if the chosen qualification is not sufficient 238 to always distinguish nodes. Thus compiler passes which 239 duplicate nodes (for example specialising functions) or merge them must be 240 careful to correctly update identifiers. 241 242 Anecdotally, we can mention a bug in the @|monocecil| library (which allows 243 easy manipulation of @|dotnet| bytecode). When ``resolving'' a reference to 244 a primitive type, it can happen in some cases that @|monocecil| returns a 245 @tt{Type} metadata object which references a type with the correct name, but 246 @todo{exported} from the wrong @|DLL| library.} 247 @item{Alternatively, backward references may be encoded as a form of path 248 from the referring node. @DeBruijn indices can be used in such an encoding, 249 for example. 250 251 Once again, manipulating these references is risky, and @DeBruijn indices 252 are particularly brittle, for example when adding a wrapper around a node 253 (i.e. adding an intermediate node on the path from the root), the @DeBruijn 254 indices used in some of the descendents of that node (but not all) must be 255 updated. It is understandably easy to incorrectly implement updates of these 256 indices, and a single off-by-one error can throw the graph's representation 257 into an inconsistent state.} 258 @item{The program's representation could also contain actual pointers 259 (thereby really representing the program as an ``Abstract Syntax Graph''), 260 using mutation to patch nodes after they are initially created. 261 262 In order to prevent undesired mutation of the graph after it is created, it 263 is possible to ``freeze'' the objects contained within@todo{references}. 264 This intuitively gives guarantees similar to those obtained from a 265 purely immutable representation. However, the use of mutation could obstruct 266 formal verification efforts, as some invariants will need to take into 267 account the two phases of an object's lifetime (during the creation of the 268 containing graph, and after freezing it). More generally speaking, it is 269 necessary to ensure that no mutation of objects happens during the graph 270 construction, with the exception of the mutations required to patch cycles.} 271 @item{The compiler could also manipulate lazy data structures, where the 272 actual value of a node in the graph is computed on the fly when that node is 273 accessed. 274 275 Lazy programs are however harder to 276 debug@~cite["nilsson1993lazy" "wadler1998functional" "morris1982real"], 277 as the computation of the various parts of the data manipulated does not 278 occur in an intuitive order. Among other things, accidental infinite 279 recursion could be triggered by totally unrelated code which merely reads a 280 value.} 281 @item{Finally, Higher-Order Abstract Syntax, or @|HOAS| for short, is a 282 technique which encodes variable bindings as anonymous functions in the host 283 language (whose parameters reify bindings at the level of the host 284 language). Variable references are then nothing more than actual uses of the 285 variable at the host language's level. Substitution, a common operation in 286 compilers and interpreters, then becomes a simple matter of calling the 287 anonymous function with the desired substitute. @|HOAS| has the additional 288 advantage that it enforces well-scopedness, as it is impossible to refer to 289 a variable outside of its scope in the host language. 290 291 Parametric @|HOAS|, dubbed @|PHOAS|, also allows encoding the type of the 292 variables in the representation. @todo{Can extra information other than the 293 type be stored?} 294 295 There are a few drawbacks with @|HOAS| and @|PHOAS|: 296 297 The ``target'' of a backward reference must be above all uses in the tree 298 (i.e. a node may be the target of either backward references, forward 299 references, but not a mix of both). This might not always be the case. For 300 example, pre/post-conditions could, in an early pass in the compiler, be 301 located outside of the normal scope of a function's signature, but still 302 refer to the function's parameters. If the pre/post-condition language 303 allows breaking encapsulation, these could even refer to some temporary 304 variables declared inside the function. 305 306 @;{ 307 @; True for HOAS, not sure for PHOAS. 308 @todo{The ``target'' of a backward reference does not initially contain 309 additional data (e.g. the variable name to be used for error messages, its 310 static or concrete type and so on) although extending the encoding to 311 support this should be feasible.} 312 } 313 314 @todo{@|PHOAS| naturally lends itself to the implementation of substitutions, 315 and therefore is well-suited to the writing of interpreters. However, the 316 representation cannot be readily traversed and accessed as would be done 317 with normal structures, and therefore the model could be 318 counterintuitive.} 319 320 @todo{It seems difficult to encode an arbitrary number of variables bound in 321 a single construct (e.g. to represent bound type names across the whole 322 program, or an arbitrary number of mutually-recursive functions declared 323 via @racketid[let … and … in …], with any number of @racketid[and] clauses 324 in the compiled language.}}] 325 326 Although some of these seem like viable solutions (e.g. explicitly freezing 327 objects), they still involve low-level mechanisms to create the graph. When 328 functionally replacing a node with a new version of itself, all references to 329 it must be updated to point to the new version. Furthermore, code traversing 330 the graph to gather information or perform updates needs to deal with 331 cycles, in order to avoid running into an infinite loop (or infinite 332 recursion). Finally, backward edges are not represented in the same way as 333 forward edges, introducing an arbitrary distinction when fetching data from 334 the neighbours of a node. This last aspect reduces the extensibility of code 335 which manipulates graphs where access to fields is not done uniformly: 336 supposing new features of the language to be compiled require turning a 337 ``seamless'' edge into one which must be explicitly resolved in some way (e.g. 338 because this node, in the updated @tech{IR}, may now be part of cycles), this 339 change of interface will likely break old code which relied on seamless access 340 to that field. 341 342 We think that the compiler engineer could benefit from abstracting over these 343 implementation details, and think of compiler passes in terms of graph 344 transformations. Programmers using functional languages often write list 345 transformations using @htodo{higher-order} functions like @racket[map], 346 @racket[foldl], @racket[filter], @racket[zip] and so on, instead of explicitly 347 writing recursive functions.@htodo{mention FxCop and/or StyleCop, which have 348 some hand-coded (?) functions to traverse all nodes of a certain type.} 349 350 The graph can be manipulated by updating some or all nodes of a given type, 351 using an old-node to new-node transformation function. This transformation 352 function could produce more than one node, by referencing the extra nodes from 353 the replacement one. It should furthermore be possible to locally navigate 354 through the graph, from its root and from the node currently being 355 transformed. This interface would allow to seamlessly handle cycles — 356 transformations which apply over a whole collection of nodes need not be 357 concerned with cycles — and still allow local navigation, without 358 distinguishing backward and forward edges. 359 360 @htodo{Be a bit more verbose on what cool stuff it allows} 361 362 @htodo{Think about ensuring that nodes from two distinct graphs are not mixed 363 in unexpected ways (placing a dummy phantom type somewhere should be enough 364 to prevent it).} 365 366 } 367 368 @asection{ 369 @atitle{Expressing the data dependencies of a pass via row types} 370 371 It is easy enough to test a compiler by feeding it sample programs and 372 checking that the compiled output behaves as expected. However, @htodo{ 373 triggering} a specific set of conditions inside a given pass, in order to 374 achieve reasonably complete code coverage, may prove to be a harder task: 375 previous passes may modify the input program in unexpected ways, and obtaining 376 a certain data configuration at some point in the compiler requires the 377 developer to mentally execute the compiler's code @emph{backwards} from that 378 point, in order to determine the initial conditions which will produce the 379 desired configuration many steps later. This means that extensively testing 380 corner cases which may occur in principle, but are the result of unlikely 381 combinations of features in the input program, is a cumbersome task. @htodo{ 382 i.e. unit testing vs other forms which test larger components.} 383 384 If the compiler consists of many small passes, whose inputs and outputs are 385 serializable, then it becomes possible to thoroughly test a single pass in 386 isolation, by supplying an artificial, crafted input, and checking some 387 properties of its output. 388 389 However, a compiler written following the @|nanopass| philosophy will feature 390 many small passes which read and update only a @htodo{small: repetition} small 391 part of their input. Specifying actual values for the irrelevant parts of the 392 data not only makes the test cases more verbose than they need to be, but also 393 hides out of plain sight which variations of the input matter (and may thus 394 allow the detection of new errors), and which parts of the input are mere 395 placeholders whose actual value will not influence the outcome of the pass, 396 aside from being copied over without changes. 397 398 It is desirable to express, in a statically verifiable way, which parts of 399 the input are relevant, and which parts are copied verbatim (modulo updated 400 sub-elements). Furthermore, it would be useful to be able to only specify the 401 relevant parts of tests, and omit the rest (instead of supplying dummy 402 values). 403 404 @htodo{embed within each graph a ``mapper'': specify one (or more?) mappings 405 (not anchored to the root) and the mapper updates these nodes, keeping the 406 rest intact. A pass then may expect a ``mappable'' object, regardless of the 407 actual shape of the irrelevant parts (including those on the paths between 408 the root and the relevant nodes).} 409 410 Row polymorphism allows us to satisfy both expectations. The irrelevant 411 fields of a record and the irrelevant cases of a variant type can be 412 abstracted away under a single row type variable. ``Row'' operations on 413 records allow reading and updating relevant fields, while keeping the part 414 abstracted by the row type variable intact. When invoking a compiler pass, the 415 row type variables may be instantiated to the full set of extra fields present 416 in the real @tech{IR}, when the pass is called as part of the actual 417 compilation; it is also possible, when the pass is called during testing, to 418 instantiate them to an empty set of fields (or to use a single field 419 containing a unique identifier, used to track ``object identity'').} 420 421 @asection{ 422 @atitle{Verification} 423 424 @todo{Needs a transition from the previous section, or this should be moved 425 elsewhere.} 426 427 The implementation presented in this thesis cannot be immediately extended to 428 support end-to-end formal verification of the compiler being written. However, 429 it contributes to pave the way for writing formally verified compilers: 430 firstly, the smaller passes are easier to verify. Secondly, the use of 431 intermediate representations which closely match the input and output data can 432 be used, given a formal semantic of each @tech{IR}, to assert that a 433 transformation pass is systematically preserving the semantics of the input. 434 Thirdly, the use of a typed language instead of the currently ``untyped'' 435 @|nanopass| framework means that a lot of properties can be ensured by relying 436 on the type system. @|Typedracket|'s type checker is not formally verified 437 itself and would have to be trusted (alternatively, the adventurous researcher 438 could attempt to derive well-typedness proofs automatically by hijacking the 439 type checker to generate traces of the @htodo{type checking} steps involved, 440 or manually, only using the type checker as a helper tool to detect and weed 441 out issues during development. Fourthly, the explicit specification of the 442 dependencies of passes on their input via row types is a form of frame 443 specification@htodo{reference}, and can significantly ease the verification 444 effort, as the engineer can rely on the fact that irrelevant parts were not 445 modified in the output. These @htodo{frame properties}@htodo{or rather frame 446 specifications?} are statically enforced by our extension of @|typedracket|'s 447 type system, which we formalise in chapter @todo{??}. This relies on trusting 448 @|typedracket| itself once again, and on the correctness of the implementation 449 of our translation from the extended type system to @|typedracket|'s core type 450 system. Fifthly, we provide means to express graph transformations as such 451 instead of working with an encoding of graphs as abstract syntax trees (or 452 directed acyclic graphs), with explicit backward references. 453 We are hopeful that eliminating this mismatch will be beneficial to the formal 454 verification of the transformation passes. 455 456 These advantages would be directly available to engineers attempting a formal 457 proof of their compiler, while trusting the correctness of @|typedracket|, as 458 well as that of our framework. The implementation of our framework is not 459 hardened, and @|typedracket| itself suffers from a small number of known 460 sources of unsoundness@note{See 461 @url["https://github.com/racket/typed-racket/issues"].}, however. In order to 462 do an end-to-end verification of a compiler, it would be necessary to port our 463 approach to a language better suited to formal verification. Alternatively, 464 Racket could in principle be extended to help formalisation efforts. Two 465 approaches come to mind: the first consists in proving that the macros 466 correctly implement the abstraction which they attempt to model; the second 467 would be to have the macros inject annotations and hints indicating properties 468 that must be proven, in the same way that type annotations are currently 469 inserted. These hints could then be used by the prover to generate proof 470 obligations, which could then be solved manually or automatically. 471 472 @htodo{atitle{…}} 473 474 Mixing macros and language features which help obtaining static guarantees is 475 a trending topic in the Racket ecosystem and in other communities. 476 477 @|Typedracket| is still in active development, and several other projects 478 were presented recently. 479 480 @|Hackett|@note{@url["https://github.com/lexi-lambda/hackett"]}, mainly 481 developed by @lastname{King}, is a recent effort to bring a 482 @|haskell98|-like type system to Racket. 483 484 @|Hackett| is built on the techniques developed by @lastname{Chang}, 485 @lastname{ Knauth} and @lastname{Greenman} in 486 @~cite["chang2017type-systems-as-macros"], which lead to the @|turnstile| 487 library@note{@url["https://bitbucket.org/stchang/macrotypes.git"]}. 488 @|Turnstile| is a helper library which facilitates the creation of typed 489 languages in Racket. Macros are amended with typing rules, which are used to 490 thread type information through uses of macros, definition forms and other 491 special forms. Type checking is therefore performed during macro-expansion, 492 and does not rely on an external type checker which would work on the expanded 493 code. As a result, new languages built with Racket and @|turnstile| are not 494 limited to a pre-existing type system, but can rather devise their own from 495 the ground up. This approach brings a lot of flexibility, the drawback being 496 that more trust is put in the language designer. 497 498 The work presented in this thesis@htodo{, which started before @|turnstile| 499 was publicly announced,} aims to follow a different path than that followed 500 by @|turnstile|: we chose to implement our extended type system as an 501 abstraction over the existing type system of @|typedracket|. This means that 502 we do not rely so much on the correctness of our typing rules: instead of 503 verifying ourselves the well-typedness of compilers written using our 504 framework, we inject type annotations in the expanded code, which are then 505 verified by @|typedracket|. Therefore, we are confident that type errors will 506 be caught by @|typedracket|, safe in the knowledge that the code obtained 507 after macro-expansion is type-safe@note{We actually use on a few occasions 508 @racket[unsafe-cast]. Such a blunt instrument is however only used in cases 509 where @|typedracket| already has the required type information, but the type 510 checker fails to deduce the equivalence between two formulations of the same 511 type, or does not acknowledge that the term being checked has the expected 512 type. These issues are being worked on by the developers of @|typedracket|, 513 however, and we hope to be able to remove these uses of @racket[unsafe-cast] 514 in later versions.}. This increased serenity comes at the cost of 515 flexibility, as @|typedracket|'s type system was not able to express the type 516 constructs that we wanted to add. We therefore had to resort to a few hacks to 517 translate our types into constructs that could be expressed using 518 @|typedracket|. 519 520 The approach of building more complex type constructs atop a small trusted 521 kernel has been pursued by 522 @|cur|@note{@url["https://github.com/wilbowma/cur"]}, developed by @lastname{ 523 Bowman}. @|Cur| is a dependently typed language which permits theorem proving 524 and verified programming. It is based on a small kernel (the Curnel), which 525 does not contain language features which can be expressed by macro 526 transformations. Most notably, the prover's tactics are defined using 527 metaprogramming tools, and are not part of the core language. 528 529 Another tool worth mentioning is 530 Rosette@note{@url["https://github.com/emina/rosette"]}@todo{reference}. 531 Rosette, mainly developed by @lastname{Torlak}, is a solver-aided language: it 532 tightly integrates an SMT solver with some Racket constructs, so that powerful 533 queries can be performed and answered, for example ``what input values to the 534 function f will generate outputs which satisfy the predicate p?''. It can also 535 generate simple functions which satisfy given conditions. These features allow 536 it to be used both as a helper tool during development, for engineers coming 537 from various domains, and as a verifier, as the solver can be used to assert 538 that a function will never give an erroneous output, given a set of 539 constraints on its input. 540 541 @htodo{Check that I'm not saying anything wrong about Turnstile here.} The 542 idea of expressing the type of macro transformations at some level (e.g. by 543 indicating the type of the resulting expression in terms of the type of the 544 input terms, as is allowed by @|turnstile|) is not new: in 2001, Ganz, Sabry 545 and Taha already presented in 2001 546 MacroML@~cite["ganz2001macros-as-multi-stage-computations"], an experimental 547 language which allows type checking programs before macro-expansion. However, 548 it seems that there is interest, both in the Racket community and elsewhere, 549 for languages with powerful metaprogramming facilities, coupled with an 550 expressive type system. We are hopeful that this will lead to innovations 551 concerning the formal verification of programs making heavy use of complex 552 macros. 553 } 554 555 556 @;{ 557 The static analysis, optimisation and code generation phases will often work 558 on that @usetech{intermediate representation}. 559 560 These transformations are often non-trivial and may require aggregating and 561 analysing data scattered across the program. 562 563 We build upon the achievements of the @|nanopass-c-f| project, 564 which is presented in more detail in section XYZ. Simply put, @|nanopass| 565 helps the programmer define a myriad of compiler passes, each doing a very 566 small amount of code (and therefore easy to test and maintain), and each with 567 a different input and output type. 568 569 } 570 } 571 572 @asection{ 573 @atitle{Summary} 574 575 @epigraph[#:width "8cm" 576 @elem{Unknown}]{ 577 Once upon a time…} 578 579 @asection{ @atitle{Extensible type system} We implemented a different type 580 system on top of @|typedracket|, using macros. Macros have been used not only 581 to extend a language's syntax (control structures, contract annotations, and 582 so on), but also to reduce the amount of ``boilerplate'' code and obtain 583 clearer syntax for common or occasional tasks. Macros have further been used 584 to extend the language with new paradigms, like adding an object system 585 (CLOS@~cite["bobrow_common_1988"]@todo{is it really implemented using 586 macros?}, Racket classes@~cite["flatt_scheme_classes_2006"]) or supporting 587 logic programming (Racket 588 @cond-element[ 589 [html @elem{ 590 @seclink["top" #:doc '(lib "datalog/scribblings/datalog.scrbl")]{Datalog} 591 and 592 @seclink["top" #:doc '(lib "racklog/racklog.scrbl")]{Racklog}}] 593 [else @elem{ 594 Datalog@note{@url{http://docs.racket-lang.org/datalog/}} and 595 Racklog@note{@url{http://docs.racket-lang.org/racklog/}}}]], 596 Rosette@~cite["torlak_growing_rosette_2013" "torlak_rosette_symbolic_vm"]). In 597 the past, @|typedracket|@~cite["tobin-hochstadt_design_2008"] has proved that 598 a type system can be successfully fitted onto an existing ``untyped'' 599 language, using macros. We implemented the @racketmodname[type-expander] 600 library atop @|typedracket|, without altering @|typedracket| itself. Our 601 @racketmodname[type-expander] library allows macros to be used in contexts 602 where a type is expected. 603 604 This shows that an existing type system can be made extensible using macros, 605 without altering the core implementation of the type system. We further use 606 these type expanders to build new kinds of types which were not initially 607 supported by @|typedracket|: non-nominal algebraic types, with row 608 polymorphism. @|Typedracket| has successfully been extended in the past (for 609 example by adding type primitives for Racket's class system, which 610 incidentally also support row polymorphism), but these extensions required 611 modifications to the trusted core of @|typedracket|. Aside from a small hack 612 (needed to obtain non-nominal algebraic types which remain equivalent across 613 multiple files), our extension integrates seamlessly with other built-in 614 types, and code using these types can use idiomatic @|typedracket| features. 615 616 @Typedracket was not initially designed with this extension in mind, nor, 617 that we know of, was it designed with the goal of being extensible. We 618 therefore argue that a better choice of primitive types supported by the 619 core implementation could enable many extensions without the need to resort 620 to hacks the like of which was needed in our case. In other words, a better 621 design of the core types with extensibility in mind would have made our job 622 easier. 623 624 In particular, Types in Typed 625 Clojure@~cite["practical_types_for_clojure_2016"] support fine-grained typing 626 of heterogeneous hash tables, this would likely allow us to build much more 627 easily the ``strong duck typing'' primitives on which our algebraic data types 628 are based, and without the need to resort to hacks. 629 630 In languages making heavy uses of macros, it would be beneficial to design 631 type systems with a well-chosen set of primitive types, on top of which more 632 complex types can be built using macros. 633 634 Building the type system via macros atop a small kernel is an approach that 635 has been pursued by Cur, a dependently-typed language developed with Racket, 636 in which the tactics language is entirely built using macros, and does not 637 depend on Cur's trusted type-checking core.} 638 639 @asection{ 640 @atitle{Compiler-writing framework} 641 642 Our goal was to introduce a compiler-writing framework, which: 643 @itemlist[ 644 @item{Supports writing a compiler using many small passes (in the spirit of 645 @|nanopass|)} 646 @item{Allows writing the compiler in a strongly-typed language} 647 @item{Uses immutable data structures for the Intermediate Representations 648 (ASTs)} 649 @item{Supports backwards branches in the AST, making it 650 rather an Abstract Syntax Graph (this is challenging due to the use of 651 immutable data structures).} 652 @item{Provides easy manipulation of the Intermediate Representations: local 653 navigation from node to node, global higher-order operations over many 654 nodes, easy construction, easy serialization, with the guarantee that at 655 no point an incomplete representation can be manipulated. These operations 656 should handle seamlessly backwards arcs.} 657 @item{Enforces structural invariants (either at compile-time or at 658 run-time), and ensures via the type system that unchecked values cannot be 659 used where a value respecting the invariant is expected.} 660 @item{Has extensive support for testing the compiler, by allowing the 661 generation of random ASTs @todo{(note that the user guides the random 662 generation, it's not fully automatic like quickcheck)}, making it 663 possible to read and write ASTs from files and compare them, and allows 664 compiler passes to consume ASTs containing only the relevant fields (using 665 row polymorphism).}]} 666 667 @htodo{ 668 testing (random stuff) 669 670 TODO:@~cite{quickcheck} and other things related to test generation 671 (I have something in Zotero) 672 673 TODO: and the l-sets (find something about that) 674 675 TODO: Problems with ``mocks'' and strong typing (null values everywhere, 676 bleargh). 677 } 678 679 @asection{ 680 @atitle{Overview} 681 682 The rest of this document is structured as follows: 683 684 @(define (sec-n-ref-self tag) 685 (sec-n-ref tag #:doc '(lib "phc-thesis/scribblings/phc-thesis.scrbl"))) 686 @list{ 687 @itemlist[ 688 @item{@sec-n-ref-self["State_of_the_art"] presents related work. It discusses 689 approaches to extensible type systems in 690 @sec-n-ref-self["related-type-expander"]. @sec-n-ref-self["related-adt"] 691 considers how structures, variants and polymorphism may exhibit different 692 properties in different languages, and makes the case for bounded row 693 polymorphism as a well-suited tool for building compilers. The 694 @|nanopass-c-f| is presented in @sec-n-ref-self["related-nanopass"], and 695 techniques used to encode and manipulate graphs are studied in 696 @sec-n-ref-self["related-cycles"].} 697 @item{In @sec-n-ref-self["initial-examples-chap"], we give some example uses 698 of the compiler framework described in this thesis. We indicate the 699 pitfalls, bugs and boilerplate avoided thanks to its use. @todo{Move this 700 above the related work?}} 701 @item{@sec-n-ref-self["tr-chap"] gives an overview of the features of 702 @|typedracket|'s type system. It then formalises the features relevant to 703 our work, by giving the subtyping rules, as well as the builder and 704 accessor primitives for values of these types.} 705 @item{@sec-n-ref-self["type-expander-chap"] explains how we implemented 706 support for type-level macros as an extension to @|typedracket|, without 707 modifying the core implementation. We detail the reduction rules which 708 allow translating a type making use of expanders into a plain type.} 709 @item{@sec-n-ref-self["adt-chap"] discusses the flavour of algebraic 710 datatypes which we chose to implement atop @|typedracket|, as well as its 711 extension with row types. We explain in detail our goals and constraints, 712 and present two implementations — a first attempt, which unfortunately 713 required verbose annotations for some ``row operations'', and a second 714 solution, which greatly reduces the number of annotations required, by 715 using a different implementation strategy. We give formal semantics for 716 these extensions, give the reduction rules which translate them back to 717 @|typedracket|'s type system, and show that this reduction preserves the 718 original semantics. We then finally define a layer of syntactic sugar which 719 allows convenient use of these new type primitives.} 720 @item{@sec-n-ref-self["typed-nanotrees-chap"] builds upon these algebraic 721 datatypes and row types to define a typed version of the @|nanopass-c-f| 722 which operates on abstract syntax trees. We explain the notions of @htodo{ 723 tech}@tech{tree nodes}, @htodo{tech}@tech{mapping functions} and @htodo{ 724 tech}@tech{cata-morphisms}, show how these interact with row typing, and 725 give an overview of the implementation. We then extend this with @tech{ 726 detached mappings}: this feature allows the user to map a function over 727 all nodes of a given type in a graph, regardless of the structure of the 728 graph outisde of the relevant parts which are manipulated by the mapping 729 function. This allows test data to not only omit irrelevant branches in the 730 abstract syntax tree by omitting the field pointing to these branches, but 731 also irrelevant parts located above the interesting nodes. In other words, 732 this feature allows cutting and removing the top of the abstract syntax 733 tree, and glue together the resulting forest. We also formally describe the 734 result of applying a set of detached or regular mapping functions to an 735 input tree.} 736 @item{@sec-n-ref-self["typed-nanodags-chap"] extends this typed version of 737 @|nanopass| to handle directed acyclic graphs. We start by considering 738 concerns such as equality of nodes (for which the previous chapter assumed 739 a predicate existed, without actually implementing it), and hash consing. 740 This allows us to prepare the ground for the extension presented in the 741 next chapter, namely graphs.} 742 @item{We can then introduce support for cycles in 743 @sec-n-ref-self["typed-nanographs-chap"]: instead of describing abstract 744 syntax tree transformations, it then becomes possible to describe graphs 745 transformations. To this end, we introduce new kinds of nodes: @tech{ 746 placeholders}, which are used during the construction of the graph, @tech{ 747 with-indices} nodes, which encode references to neighbouring nodes as 748 indices into arrays containing all nodes of a given type, for the current 749 graph, and @tech{with-promises} nodes, which hide away this implementation 750 detail by lazily resolving all references, using a uniform API. This allows 751 all fields of a given node to be accessed in the same way, while still 752 allowing logical cycles built atop a purely immutable representation. 753 754 We give an overview of how our implementation handles cycles, using 755 worklists which gather and return @tech{placeholders} when the mapping 756 functions perform recursive calls, and subsequently turn the results into 757 into @tech{with-indices} nodes, then into @tech{with-promises} ones. 758 We then update our notion of equality and hash consing to account for 759 cycles, and update the formal semantics too.} 760 @item{@;{The fourth feature…} 761 Extra safety can be obtained by introducing some structural invariants 762 which constrain the shape of the graph. For example, it is possible to 763 ensure the well-scopedness of variable references. Another desirable 764 property would be the absence of cycles, either to better model the 765 @tech{IR}, knowing that cyclic references are not allowed at some point by 766 the target language, or to detect early on changes in the @tech{IR} which 767 may break code assuming the absence of cycles. A third option would be to 768 ensure that ``parent'' pointers are correct, and that the node containing 769 them is indeed referenced by the parent (i.e., ensure the @emph{presence} 770 of well-formed cycles). @sec-n-ref-self["structural-invariants-chap"] 771 presents an extension of our graph manipulation framework, which allows the 772 specification of structural invariants. These can in some cases be checked 773 statically, in other cases it may be necessary to follow a macro-enforced 774 discipline, and as a last resort, a dynamic check may be performed. 775 776 @htodo{+ singleton / all nodes of a given type referenced by a central 777 point with bounded length / only one node with a given value for a given 778 property.} 779 780 We further explain how we use phantom types to reflect at the type level 781 which invariants were checked on an instance of a graph. The types used to 782 represent that an instance satisfies an invariant are chosen so that 783 instances with stronger invariants are subtypes of instances with weaker 784 invariants.} 785 @item{Finally, in @sec-n-ref-self["future-extensions-chap"] we succinctly 786 present some extensions which could be added to the framework presented in 787 the previous chapters. We discuss how it would be possible to 788 garbage-collect unused parts of the graph when only a reference to an 789 internal node is kept, and the root is logically unreachable. Another 790 useful feature would be the ability to define general-purpose graph 791 algorithms (depth-first traversal, topological sort, graph colouring, and 792 so on), operating on a subset of the graph's fields. This would allow to 793 perform these common operations while considering only a subgraph of the 794 one being manipulated. Finally, we mention the possibility to implement an 795 α-equivalence comparison operator.@;{α-normal form}} 796 @item{In @sec-n-ref-self["results-chap"], we present more examples and 797 revisit the initial examples illustrating our goals in the light of the 798 previous chapters. 799 800 We ported the most complete compiler written with @|nanopass| (a Scheme 801 compiler) to our framework; we summarise our experience and compare our 802 approach with @|nanopass|, by indicating the changes required, in 803 particular how many additional type annotations were necessary.} 804 @item{Finally, we conclude and list future work directions. 805 @htodo{(turnstile, improvements to Racket itself, etc.)}}]}}} 806