top | item 14123100

Why ML/OCaml are good for writing compilers (1998)

241 points| monssoen | 9 years ago |flint.cs.yale.edu | reply

147 comments

order
[+] hongbo_zhang|9 years ago|reply
For web developers who are looking for an industrial strength functional language instead of JS, OCaml probably has the best story here.

Actually it has two OCaml->JS compilers of very high quality The first one, js_of_ocaml, could bootstrap the whole compiler several years ago(probably the first one there).

The recent one, https://github.com/bloomberg/bucklescript, push the JS compilation into next level, it generates fairly readable code, good FFI story, and its compilation is extremely fast, check out the compiler in JS version(http://bloomberg.github.io/bucklescript/js-demo/), and imagine how fast it would be for the compiler in native version. BuckleScript has a good story for Windows, and generates fairly efficient code, see benchmark here: https://github.com/neonsquare/bucklescript-benchmark BuckleScript is already used in production by big companies: for example Facebook messenger.com 25% is powered by BuckleScript, the new WebAssembly spec interpreter by Google is also partly cross compiled into JS by BuckleScript.

Disclaimer: I am one of the authors of BuckleScript

[+] atombender|9 years ago|reply
I'm optimistic about Reason, Facebook's new syntax "skin" on top of OCaml. I find OCaml's syntax to be quite gnarly; of the MLs, F# is probably the cleanest and most modern-feeling. Something like F# without the .NET stuff could have been amazing.
[+] jasim|9 years ago|reply
And the OCaml-as-JS community is small and very welcoming. How often can you get first-class support from the compiler makers as a newbie? Hongbo Zhang and Jordan Walke and Cheng Lou are all on the Discord channel and are extremely helpful.

Jump in fellas, the language is powerful, and the community is nice!

[+] mafribe|9 years ago|reply
I'd be interested to learn how web-dev in OCaml->JS compares to web-dev with Scala.js. The latter has the amazing JVM eco-system to hand.

(I'm not interested in a Scala vs Ocaml language comparison. I know both languages very well. I'd be interested in the quality of JS support.)

[+] nikofeyn|9 years ago|reply
better than clojure and f#?
[+] Dangeranger|9 years ago|reply
After learning Elm I wanted to understand ML/OCaml a bit more, so I worked through some documentation from the OCaml site and walked away pleasantly surprised.

After using it for a couple of weeks I am confused why ML/OCaml aren't more popular. They are safe, functional, stable, fast, and have great tooling. They seem poised to take over the functional domain.

While the syntax took a little getting used to ( emphasis on little) once you are used to it, it's very natural. Union types are wonderful, and the implicit type safety within programs was nice.

[+] throwaway7645|9 years ago|reply
Crappy windows support for OCaml. F# is similar to OCaml, but is very difficult for beginners and those not familiar with .NET. I also don't see a lot of beginner material for OCaml.
[+] nv-vn|9 years ago|reply
I think MLs are kind of in limbo in regards to functional vs. imperative and OO vs. procedural programming. While they offer good OO, it's utilized very little. The functional features are used a ton, but don't dare approach the complexity of Scala, Haskell, etc., which is disappointing to a lot of more advanced functional programmers. They have reasonably good facilities for imperative programming, but these are mostly frowned upon. The whole language is, to an extent, a compromise over various paradigms that are very nearly mutually exclusive outside of ML.
[+] akhilcacharya|9 years ago|reply
>After using it for a couple of weeks I am confused why ML/OCaml aren't more popular

For me, the issue is the GIL, although that is being worked on as we speak.

[+] lacampbell|9 years ago|reply
Ocaml is a great language. But the sad fact is I'm not going to be as productive in it as I would be in worse languages that have more libraries and tools.
[+] sitkack|9 years ago|reply
Because of package manager and library issues. Things aren't popular because of their inherent qualities or flaws. Popularity is driven by itself and external factors.
[+] rbehrends|9 years ago|reply
As a rule, functional programming languages tend to have trouble with gaining adoption. It's simply a programming paradigm that a great many programmers don't seem to be comfortable with.

Now, the ML family of languages (SML, OCaml, F#) is technically a family of multi-paradigm, functional-first languages, but that doesn't help with clearing the "functional" hurdle in popular perception.

[+] scythe|9 years ago|reply
Like C++ and Perl, there's a bit of caution required to avoid writing unreadable OCaml code. A disadvantage of pattern matching is that it's relatively easy to write a function where you define a variable and then first access it 500 lines later. IDE support for obscure languages is also often weak or requires extensions or idiosyncratic IDEs which are unfamiliar to many people (and often you want some kind of tool to collapse a 500-line function). OCaml has good vim plugins but not everyone uses vim.

That's my experience with it anyway.

[+] zem|9 years ago|reply
the toolchain was pretty damn bad for years. it's only very recently gotten better with the rise of opam, oasis and some decent build systems.
[+] ue_|9 years ago|reply
I've heard the concurrency situation with OCaml isn't very good, though I know nothing about that. Weirdly, this kind of bias and the idea that it doesn't have as many libraries for my particular domain (web programming) has put me off it, as much as I'd like to use it.

Using functional languages without Lisp-like macros will always be sort of weird to me, like I'm missing out on something.

[+] rbehrends|9 years ago|reply
I'll note that some of the aspects don't necessarily work out like that in practice:

1. The GC part is true, but one has to remember that this was written at a time when GC was still a bit of an unusual feature in mainstream languages.

2. Tail recursion doesn't really make much of a difference for walking trees, which is recursive, but (mostly) not tail recursive.

3. OCaml in particular uses 63/31-bit ints due to implementation details, which isn't a good fit for 64/32-bit integers. The strings and bignum part is mostly right, though.

4. ADTs can be good or bad for describing ASTs. Once you enrich ASTs with semantics shared by all variants (such as source coordinates), inheritance can become a better fit than ADTs.

8. Type inference doesn't really extend to module signatures, which you have to write out explicitly (though tooling such as `ocamlc -i` allows you to let the compiler help you write them). I also generally find it better to explicitly annotate functions with types. Not only does it make the code more readable later in its life, but you get fewer truly impenetrable type error messages because you forgot parentheses or a semicolon somewhere.

That said, there are several good points still.

[+] tom_mellior|9 years ago|reply
> 2. Tail recursion doesn't really make much of a difference for walking trees, which is recursive, but (mostly) not tail recursive.

Unless you, as the article notes, "know how to take advantage of it". Here's a fully tail-recursive binary tree traversal in OCaml:

    type 'a tree = Leaf of 'a | Branch of 'a tree * 'a tree

    let iter f tree =
      let rec iter_rec f worklist tree =
        match tree with
        | Leaf a ->
          (* Perform the action on this element. *)
          f a;
          (* Consult the worklist for more things to do. *)
          begin match worklist with
          | [] -> ()
          | next_tree::worklist' -> iter_rec f worklist' next_tree
          end
        | Branch (left, right) ->
          (* Visit the left subtree, save the right for visiting later. *)
          iter_rec f (right::worklist) left
      in
      iter_rec f [] tree
Usage example:

    let mytree =
      Branch (Branch (Leaf 1, Leaf 2),
              Branch (Leaf 3, Branch (Leaf 4, Leaf 5)))

    let () = iter (Printf.printf "%d\n") mytree
Yes, people do write traversals like this in OCaml, though with less verbosity than this example I whipped up.

> 3. OCaml in particular uses 63/31-bit ints due to implementation details, which isn't a good fit for 64/32-bit integers.

I think the article means here that you just use int for all the kinds of numerical identifiers that compilers give to things like instructions, basic blocks, pseudo-registers, etc., without doing the kind of micro-optimization that C++ programmers would do, guessing whether the number of blocks is safe to store in an unsigned short etc.

For representing constants from the program, which is what you seem to be referring to, the article does suggest using bignums, not OCaml's native ints.

[+] coolsunglasses|9 years ago|reply
>2. Tail recursion doesn't really make much of a difference for walking trees, which is recursive, but (mostly) not tail recursive.

Non-strictness helps here more than TCO in a strict language.

>4. ADTs can be good or bad for describing ASTs. Once you enrich ASTs with semantics shared by all variants (such as source coordinates), inheritance can become a better fit than ADTs.

Since this article was written we have better ways of augmenting/annotating ASTs. There's a lot of this out there, but here's one example: https://brianmckenna.org/blog/type_annotation_cofree

There are other alternatives that are like inheritance but with better reasoning properties as well. Finally tagless comes to mind.

>I also generally find it better to explicitly annotate functions with types.

This Haskeller whole-heartedly agrees for all the reasons stated.

[+] willtim|9 years ago|reply
ADTs and pattern matching are much more convenient and higher-level in practice than using OOP with inheritance. The visitor pattern, essentially just a fold, is the best one can do in an OOP language. With type-class abstractions and data type generic programming, the gap widens further.

In Haskell, my current FP language of choice, I can implement a complex transform such as Lambda lifting in a few 10's of lines of readable idiomatic code.

[+] naasking|9 years ago|reply
> 3. OCaml in particular uses 63/31-bit ints due to implementation details, which isn't a good fit for 64/32-bit integers

I'm not sure what "isn't a good fit" is supposed to mean.

[+] nv-vn|9 years ago|reply
The problem with this article is that it's missing an answer to why one would choose ML/OCaml over Haskell. Haskell has many more features, a more advanced type system, arguably superior syntax, and much better library support. However, I believe that OCaml/SML are often a better choice for a number of reasons.

First of all, OCaml/SML are the best choice in terms of example code for compilers. They're historically the choice of many compiler/interpreter/type theory texts (Types and Programming Languages, Modern Compiler Implementation in ML, and an ML is even used as a language to interpret in Essentials of Programming Languages). Andrej Bauer's PLZOO is also written in OCaml. Equally important is the fact that there are a variety of ML implementations, all of which are much more approachable than GHC. The OCaml compiler's codebase is a reasonable size that an individual could get a good idea of how it works in a few weeks or so. SMLNJ, MLKit, MLton, CakeML are all open source and on Github, and all seem to be fairly approachable in comparison to the monolith that is GHC. And that's not even mentioning other compiler in ML (Haxe, Rust's bootstrap compiler, Facebook's Hack compiler, etc.). The fact that there are real-world compilers with perfectly approachable code bases (even without great familiarity with the language; compilers in Haskell might require an in-depth understanding of many of the core type classes and language extensions available) that are open source is highly attractive to novice compiler writers.

Additionally, the feature set in MLs is a good choice for compilers. While they lack some of the cooler features of Haskell, MLs make up for it in simplicity; lots of the features in GHC's type system (especially with language extensions) mean very little for 90% of compiler writers, and getting rid of them from the get-go helps keep the code small and easy to reason about (even if you won't have as much type safety in the compiler itself). This also means that there are a lot less ways to do a single thing, which can be nice when you're not sure exactly how you're going to implement a certain feature. However, one thing I really find incredibly useful is OCaml's polymorphic variants. These are pretty much perfect for statically enforcing a nanopass-like system in your compiler and are a great way of designing your AST/data types in your compiler. I feel like this gets passed up a ton (as far as I know I'm the first person who's used them to create nanopasses), but it's quite convenient and makes OCaml a good competitor for Scheme in this regard.

[+] dleslie|9 years ago|reply
Also, Haskell is something of a soup of DSL-operators that require one spend significant time researching what they mean and what behaviour they induce. Even with such knowledge, it suffers from the Perl-ish woe of write-once and read-never.
[+] jroesch|9 years ago|reply
This article is from 1998, the landscape was much different. Haskell had a handful of the language features it has today, and was lacking many of the innovations in its runtime and libraries.
[+] throwaway7645|9 years ago|reply
Haskell may indeed be one of the most advanced languages out there in terms of raw power, but it is very complex (how many monad tutorials does it seriously take to teach one of the most core pieces of the language) and how much category theory do you need to know to be moderately effective? Also, the ecosystem could use some work. An example is the main string library isn't used in favor of a different one. Using the first and obvious one leads to performance worse than python and perl even after you compile. I'm being nitpicky, but anytime someone writes a blog post comparing a programming language to playing darksouls (game where you die thousands of times) I'd say you have an issue.
[+] ratmice|9 years ago|reply
> These are pretty much perfect for statically enforcing a nanopass-like system in your compiler and are a great way of designing your AST/data types in your compiler.

Nice, I was just asking about this on the nanopass list the other day, do you happen to have a publicly available example of this anywhere?

[+] vmasto|9 years ago|reply
For anyone interested and isn't aware yet Facebook is developing Reason, a layer over OCaml. I've been fiddling with it for the past couple of weeks and coming from JavaScript I personally found the experience generally enjoyable.

https://facebook.github.io/reason/

[+] mhink|9 years ago|reply
Hah! Thanks for the link- I discovered this gem just now in their list of comparisons to JS syntax:

  Javascript    | Reason
  --------------+----------------------------
  const x = y;  | let x = y;
  let x = y;    | reference cells
  var x = y;    | No equivalent (thankfully)
[+] adriancooney|9 years ago|reply
That was an extremely pleasant and convincing introduction, thank you for that. I'm definitely going to give this a try in future projects.
[+] StrykerKKD|9 years ago|reply
I agree that Ocaml is just extremely well suited for making new programig languages. If you are interested in Ocaml+programming languages check out the plzoo: http://plzoo.andrej.com/

I personally think that Ocaml is really good at this, because I started converting the Scheme examples from the PLAI book to Ocaml and it's just felt right(maybe because I'm not fan of the scheme syntax).

[+] chairmanwow|9 years ago|reply
Currently taking a compilers course that uses SML/NJ and it has been an absolute delight. The functional paradigm is a little strange to get used to at first, but after a while its strong suits make themselves known. The trivial type inferences and pattern matching capabilities make it easy to efficiently describe complicated and precise program situations.
[+] kornakiewicz|9 years ago|reply
I'm a relatively young developer (three years older than Java) and don't fully get the thing about exceptions (point 7). It sounds very familiar to the solution I know from Java, and it does not make safety - or even feeling about it - any better. If you can still write code that can throw exception and an explicit assurance about it is the only way to prevent the crash, it doesn't change anything, actually.

P.S. I'm not really familiar to ML/OCaml, but have decent experience with large code bases in languages that are not very keen to protect you from yourself.

[+] gizmo686|9 years ago|reply
Exceptions in ML languages are very similar to those in Java. The reason for that is simple: they are simply a good way of dealing with computations that might fail. Having said that, exceptions are best used (in any language) where you want to deal with the failure way up in the call stack. If you catch the exception right where it occurs, you should just use a safe method that reports a failure in the return value.

Speaking as a Haskell programmer, never use exceptions. You can get away with this advice because the Either monad allows you to have the behavior of exceptions (namely, at any point you can "fail" a computation and have the error automatically propagate up to the handler). However, this approach relies heavily on having a type system more advanced than OCaml's in order to be reasonable.

[+] agumonkey|9 years ago|reply
FP is based on recursion, and recursive types, inductive algorithms are essential for linguistic processing and transformation.
[+] hyperpallium|9 years ago|reply
How are simple parsers written in ML (or ocaml)?

You can't use the coding style used for recursive descent in the Dragon compiler book, without using mutable variables.

Do you have to use parser combinators, which have their own limitations?

[+] mafribe|9 years ago|reply
Parser combinators tend to work by recursive descent, so cannot handle left recursive grammars [1], and tend to be really slow. The latter is not a problem for many applications, but removing left recursion can be irritating even for small grammars. It is possible to build combinator parsers that can handle all context-free grammars [2], but I'm not sure any of Ocaml's are built that way.

In any case, Ocaml has parser generators that are fast, do bottom-up parsing (hence handle left-recursion without issue) and not based on parser combinators, e.g. ocamlyacc [3].

I'd use parser combinators for quick prototypes, and, if measurement shows a performance problem, replace them with a generated (e.g. by ocamlyacc) parser. As far as I remember the parser in Ocaml's (superfast) compiler is generated by ocamlyacc.

[1] https://en.wikipedia.org/wiki/Left_recursion

[2] T. Ridge, Simple, functional, sound and complete parsing for all context-free grammars. http://www.tom-ridge.com/resources/ridge11parsing-cpp.pdf

[3] https://caml.inria.fr/pub/docs/manual-ocaml/lexyacc.html

[+] ms013|9 years ago|reply
In production code we tend to use ocamlyacc or menhir. There is nothing about ocaml/ML that prohibits the use of the kind of parser generators one would expect in any other language.
[+] rubiquity|9 years ago|reply
You can do mutation in both Standard ML and OCaml. If I recall, Robert Harper's book on Standard ML starts off with a recursive descent parser.
[+] nv-vn|9 years ago|reply
The most common way is using a parser generator (ocamlyacc or menhir, both are basically interchangeable 99% of the time). Parser combinators are a choice, but honestly I think recursive descent parsers are more common than combinators in OCaml.
[+] jackmott|9 years ago|reply
OCaml has the potential to be great for almost everything with some work and a bigger ecosystems. F# as well (very similar). i can only imagine how great the world would be if a standard ml had accidentally become the web browser language and all that mind share had gone into evolving it and optimizing it and its tools.