This is a very nice wiki. I am implementing an ML dialect myself, and studying the implementation of MLkit, MosML, and Mlton has been very enlightening. I work with the maintainers of the former two, but my compiler has ended up being more like Mlton in its design.
Oh wow, I had Professor Fluet in college, and taught me functional programming concepts in haskell and lisp. Cool to see his work on the front page of Hn.
In addition to being a smart guy he was a really good professor. I really enjoyed the class (survey of many different paradigms), even if I did come away jaded and thinking that all the standard languages I have used in my career since are poor approximations of what they could be!
The highest-performing compiler for ML's is MLton. The highest-correctness compiler is CakeML. Anyone looking for an interesting project in compilers and verification should consider porting some MLton techniques to CakeML.
I don't know compilers, but this seems like a lot more steps than I would have expected. Is it actually an unusually large number of transformations, or is that just how compilers are done?
Some are single pass like Turbo Pascal. They directly generate machine code while parsing (no AST!). Niklaus Wirth (Pascal/Modula/Oberon guy) wrote a book following this approach [1].
Some have multiple passes like Chez Scheme. They have many simple passes, which they call a "nanopass". Andrew Keep has a great talk on this approach [2].
In practice, most compilers today are multi-pass, though probably not as many Chez. If we look at Rust, they go from AST -> HIR -> MIR -> LLVM IR -> machine code [3]. There are probably more things going on from LLVM to machine code, but I'm not knowledgeable enough to comment on it.
I think the trade-off here is clear: less passes -> shorter compile times, more passes -> faster code generated, more modular compiler. Martin Odersky (Scala guy) has a paper attempting to get the best of both [4].
I see lots of comments about how the multiple passes have to do with limited resources like memory. In MLton’s case, (it’s a whole program optimizing compiler) each trasnlation phase has nothing to do with resource contention. And in general, limited resources is typically handled by separate compilation, not compiler passes.
The more important design choices here have to do with
1) Typed assembly languages (rather than untyped assymbly languages)
2) Type directed translation (rather than syntax directed translation)
3) The ability to reason about the correctness of each step of the translation as the program is gradually lowered from SML to assembly.
For more on the typed assembly language work, check out the work of Morissett and Harper that came out of CMU (specifically the TIL / TAL parts of the ConCert project). The XML type system was developed a little bit before that by Harper and Lillibridge. The general idea of a language being a mathematical object with an elaboration and a semantics where you can reason about progress and preservation at each step is a result of the vision of Robin Milner (there’s a good overview here http://homepages.inf.ed.ac.uk/stg/Milner2012/R_Harper-html4-...)
It depends on the language: how far away is it from machine code? (among other things)
C compilers are very close to machine code, so it takes few passes to write a simple non-optimizing C compiler. I'm pretty sure that after parsing, the original C compilers were one pass. We talked about them a bit here [1]
ML is further from the machine model, so compiling it takes more work. For example, it has closures, while C doesn't, and that's something extra that you have to deal with, probably in multiple stages of the compiler.
A common theme in functional languages like ML and Lisp is "desugaring passes" or "lowering passes", which basically means turning a language construct into a more basic one, so that you can treat things more uniformly at later stages. The more language features you have, the more potential for desugaring/lowering. OCaml and Haskell in particular have a ton of features that can be treated like this.
There are many different approaches. Chezscheme does a gazillion steps using the nanopass framework, but is still fast enough to compile a metric shit-tonne of code without any noticeable delay.
i have seen some work regarding a real-time version of mlton. i would love to see this work done more because nothing would please me more to be able to do embedded systems with an ML language.
[+] [-] Athas|8 years ago|reply
[+] [-] rwmj|8 years ago|reply
[+] [-] mattnewton|8 years ago|reply
In addition to being a smart guy he was a really good professor. I really enjoyed the class (survey of many different paradigms), even if I did come away jaded and thinking that all the standard languages I have used in my career since are poor approximations of what they could be!
[+] [-] bjg|8 years ago|reply
[+] [-] rwmj|8 years ago|reply
[+] [-] hajile|8 years ago|reply
[+] [-] nickpsecurity|8 years ago|reply
https://cakeml.org
[+] [-] tytytytytytytyt|8 years ago|reply
[+] [-] sevensor|8 years ago|reply
[+] [-] undecidabot|8 years ago|reply
Some are single pass like Turbo Pascal. They directly generate machine code while parsing (no AST!). Niklaus Wirth (Pascal/Modula/Oberon guy) wrote a book following this approach [1].
Some have multiple passes like Chez Scheme. They have many simple passes, which they call a "nanopass". Andrew Keep has a great talk on this approach [2].
In practice, most compilers today are multi-pass, though probably not as many Chez. If we look at Rust, they go from AST -> HIR -> MIR -> LLVM IR -> machine code [3]. There are probably more things going on from LLVM to machine code, but I'm not knowledgeable enough to comment on it.
I think the trade-off here is clear: less passes -> shorter compile times, more passes -> faster code generated, more modular compiler. Martin Odersky (Scala guy) has a paper attempting to get the best of both [4].
[1] http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf
[2] https://www.youtube.com/watch?v=Os7FE3J-U5Q
[3] https://blog.rust-lang.org/2016/04/19/MIR.html
[4] https://infoscience.epfl.ch/record/228518/files/paper.pdf
[+] [-] strangemonad|8 years ago|reply
The more important design choices here have to do with 1) Typed assembly languages (rather than untyped assymbly languages) 2) Type directed translation (rather than syntax directed translation) 3) The ability to reason about the correctness of each step of the translation as the program is gradually lowered from SML to assembly.
For more on the typed assembly language work, check out the work of Morissett and Harper that came out of CMU (specifically the TIL / TAL parts of the ConCert project). The XML type system was developed a little bit before that by Harper and Lillibridge. The general idea of a language being a mathematical object with an elaboration and a semantics where you can reason about progress and preservation at each step is a result of the vision of Robin Milner (there’s a good overview here http://homepages.inf.ed.ac.uk/stg/Milner2012/R_Harper-html4-...)
[+] [-] chubot|8 years ago|reply
C compilers are very close to machine code, so it takes few passes to write a simple non-optimizing C compiler. I'm pretty sure that after parsing, the original C compilers were one pass. We talked about them a bit here [1]
ML is further from the machine model, so compiling it takes more work. For example, it has closures, while C doesn't, and that's something extra that you have to deal with, probably in multiple stages of the compiler.
A common theme in functional languages like ML and Lisp is "desugaring passes" or "lowering passes", which basically means turning a language construct into a more basic one, so that you can treat things more uniformly at later stages. The more language features you have, the more potential for desugaring/lowering. OCaml and Haskell in particular have a ton of features that can be treated like this.
Slide 14 here has a (partial) diagram of the Scala compiler, which is extremely deep: https://www.slideshare.net/Odersky/compilers-are-databases
[1] https://news.ycombinator.com/item?id=16610938
[+] [-] bjoli|8 years ago|reply
[+] [-] Athas|8 years ago|reply
[+] [-] nikofeyn|8 years ago|reply