Great post. Recently I wrote an interpreter for Scheme using Haskell using the guide "Write Yourself a Scheme" [1]. It was great fun, and it seems that it will be fairly straightforward to write a new domain-specific language using Haskell, using the same principles. That said, it was never clear to me how something like that would work in a non-purely-functional language, and so the fact that the OP used Rust is very exciting.
This is a great tutorial, although the one complaint I had was the errors in the pdf version (can't remember the specifics).. took me a while to realise the html version was far superior.
Regardless of my complaints, it was an interesting experience.
I started making a language a few months ago. I found Douglas Crockford's parser from his Top Down Operator Precedence paper pretty useful, which is here: http://javascript.crockford.com/tdop/index.html
When looking for resources on making toy languages, most of them said not to bother with a parser and use a pre-built solution. I think they're probably wrong - the parser is actually pretty easy (relative to the rest of the process of getting a language working) and at a stretch, fun.
I agree. The parsing step is distracting. I have TONS of questions and all the tutorials do the parsing stuff, and stop at doing integer +/*-, maybe vars, maybe if, maybe basic functions!
And from http://www.itu.dk/people/sestoft/plc/. So I get rid of the parsing, and go directly at "how do type checking?", "How I map vars to the environment?", "How make the debugger???", "So, I can have a AST for INTS, BOOLS... now how let the user create your own types???", "How make possible to do macros? I'm not a LISP!", "I'm on F#. How lift the stuff F# already have to do not doing it again???", "What to do: Erlang actors or GO CSP? and how?", "How do pattern matching", "Auto-instrument the code interpreter with dtrace or similar, and why I think this is good?", "I wanna be a reactive language?", "I decide the language be relational. I strongly believe that is amazing. I don't know yet why and how prove it" and a huge list like this...
I don't have a clean answer yet for some of this questions!
Also, I get rid of the idea of make a compiler, and do a interpreter instead. That is another distraction. I suspect that if later on I turn the interpreter to a bytecode, the step to full compilation will be simple, and without solve several questions about the full language the task of rewrite both the parsing and the compiler is not cool.
Yes! I write my own recursive descent parsers because my programming languages tend to have incremental execution stories to them (you can see an example at http://research.microsoft.com/en-us/people/smcdirm/managedti...). And writing a recursive descent parser really isn't that hard, plus it gives you more flexibility in tweaking error recovery. It also turns out that in industry, auto generated parsers are rarely used (Java is an exception) given performance and error recovery concerns.
On the topic of precedence parsing, they have some really good properties in an incremental context: you can start parsing anywhere and the results will be the same. The problem is expressiveness, and while I came up with some hacks to push them beyond operators, pure recursive descent is much easier to work with (as I learned from Martin Odersky's scalac compiler).
For me the big advantage of using a parser generator is that they let you write things a bit more declaratively. Bottom up parser generators also have the advantage of being able to recognize some grammars that top down parsers can't and they also detect grammar ambiguities at compile time.
That said, handwritten top down parsers are very tweakable and flexible which can be a big plus. In particular, error recovery and things like semicolon insertion. You also get more control over code size and parser performance.
I think it's more flexible and maintainable this way. I didn't have to wrestle with odd grammar rules. Implementing the JS "automatic semicolon insertion" with parser generator tools seems like it would have been a cluster headache.
The more you do yourself, the more you will learn. If you're just learning and not trying to build the next big programming language, try and do some of the low level stuff yourself. Write a parser, write a garbage collector and a virtual machine, re-write the compiler in the language itself. Go hog wild.
Nice article, inspired me to invest time in the future to learn Rust.
Also working on JavaScript native compiler + developing a new language on top as subset for JavaScript: https://github.com/InfiniteFoundation/dopple (front page info is outdated though).
After writing a lexer, parser, compiler and virtual machine in C++ and ported parts to Python, Java and FreeBASIC - I think this is a good starting point.
Most of the concepts are pretty simple - it's when you get into if/while/for statements. Especially if you have nested if/while/for statements. Conceptually easy to implement using recursion but during the compilation phase keeping track of where you are at can be...difficult. I'll be the first to admit my implementation sucked but it worked.
I've been working on Python-like systems language with a compiler in Python targeting LLVM IR, and it's been cool (though so much work to get a halfway usable language!).
Did you create an interpreter for the language before you started on the compiler? I think it's not much overhead and you can run tests with it to make sure you get all the lexing and parsing done correctly. Then you just replace the evaluation with code generation and BAM! you've got a compiler.
I would argue that this statement is a mistake that should have been foreseen:
"negative integers isn't literals!"
Specifically, twos complement signed integers will be a bitch of you continue down that path.
Actually, it's an incredibly good idea, cause you can then just implement constant folding on the AST, and that (unary negate)1 turns into an actual -1, as well as a lot of other good optimizations.
The design of Swift was influenced by Rust. However, most of the resemblances are superficial or relate only to areas where Rust in turn was heavily influenced by other languages. For example, Rust's memory management model could not work in Swift (because this is an area where Swift remains compatible with Objective-C).
I started one a while ago, but I became fascinated by the "middle end" and bought a compiler textbook. Now I have SSA optmizations but no strings still.
Very interesting read. I am myself writing a series on the same subject and I was pretty inspired by your way of writing. Your article is paced a lot faster than what I have written so far and I really like the style.
I submitted my own posts to HN also, but I did not get the same attention as you did, congratulations!
This was a really cool read :). I'm working on my own DSL for generating code. I'll make a Show HN soon, but it was really cool to see someone else doing something so similar.
whitespace sensitivity is a pretty big pain to deal with in the current toolkits available for building out a programming language, unfortunately.
The pain is compared to the absolute and total simplicity of defining a grammar when you're in a whitespace-insensitive setting (parser generators are easy to express and efficient in deterministic grammars).
We have a lot of good parsing tools out there, but we might be missing the right tools for building out good lexers
It would be cool if we could add a "Rust" tag to this entry. I almost didn't give this link a second look because I thought it was an older link. Glad I did give it a second look.
[+] [-] anindyabd|11 years ago|reply
[1] http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_H...
[+] [-] rasur|11 years ago|reply
Regardless of my complaints, it was an interesting experience.
[+] [-] thomasfoster96|11 years ago|reply
When looking for resources on making toy languages, most of them said not to bother with a parser and use a pre-built solution. I think they're probably wrong - the parser is actually pretty easy (relative to the rest of the process of getting a language working) and at a stretch, fun.
[+] [-] mamcx|11 years ago|reply
I'm doing one, and I'm only working on the AST.
I take the idea from this blog:
http://www.trelford.com/blog/post/interpreter.aspx
And from http://www.itu.dk/people/sestoft/plc/. So I get rid of the parsing, and go directly at "how do type checking?", "How I map vars to the environment?", "How make the debugger???", "So, I can have a AST for INTS, BOOLS... now how let the user create your own types???", "How make possible to do macros? I'm not a LISP!", "I'm on F#. How lift the stuff F# already have to do not doing it again???", "What to do: Erlang actors or GO CSP? and how?", "How do pattern matching", "Auto-instrument the code interpreter with dtrace or similar, and why I think this is good?", "I wanna be a reactive language?", "I decide the language be relational. I strongly believe that is amazing. I don't know yet why and how prove it" and a huge list like this...
I don't have a clean answer yet for some of this questions!
Also, I get rid of the idea of make a compiler, and do a interpreter instead. That is another distraction. I suspect that if later on I turn the interpreter to a bytecode, the step to full compilation will be simple, and without solve several questions about the full language the task of rewrite both the parsing and the compiler is not cool.
[+] [-] seanmcdirmid|11 years ago|reply
On the topic of precedence parsing, they have some really good properties in an incremental context: you can start parsing anywhere and the results will be the same. The problem is expressiveness, and while I came up with some hacks to push them beyond operators, pure recursive descent is much easier to work with (as I learned from Martin Odersky's scalac compiler).
[+] [-] ufo|11 years ago|reply
That said, handwritten top down parsers are very tweakable and flexible which can be a big plus. In particular, error recovery and things like semicolon insertion. You also get more control over code size and parser performance.
[+] [-] tachyonbeam|11 years ago|reply
I think it's more flexible and maintainable this way. I didn't have to wrestle with odd grammar rules. Implementing the JS "automatic semicolon insertion" with parser generator tools seems like it would have been a cluster headache.
[+] [-] ioddly|11 years ago|reply
[+] [-] wldlyinaccurate|11 years ago|reply
[+] [-] fcanela|11 years ago|reply
[+] [-] stringy|11 years ago|reply
The exact wording is: "They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen."
[1]: http://www.ietf.org/rfc/rfc1034.txt
[+] [-] gue5t|11 years ago|reply
[+] [-] Pirate-of-SV|11 years ago|reply
I just added a custom domain/CNAME to my github pages so it should be reachable at http://blog.ppelgren.se
[+] [-] Hemospectrum|11 years ago|reply
Are you perhaps using a browser or a DNS stack that rejects this type of domain name?
[+] [-] lelandbatey|11 years ago|reply
[+] [-] Tenjou|11 years ago|reply
Also working on JavaScript native compiler + developing a new language on top as subset for JavaScript: https://github.com/InfiniteFoundation/dopple (front page info is outdated though).
[+] [-] nadams|11 years ago|reply
Most of the concepts are pretty simple - it's when you get into if/while/for statements. Especially if you have nested if/while/for statements. Conceptually easy to implement using recursion but during the compilation phase keeping track of where you are at can be...difficult. I'll be the first to admit my implementation sucked but it worked.
[+] [-] alediaferia|11 years ago|reply
[+] [-] Pirate-of-SV|11 years ago|reply
[+] [-] dochtman|11 years ago|reply
I've been working on Python-like systems language with a compiler in Python targeting LLVM IR, and it's been cool (though so much work to get a halfway usable language!).
https://github.com/djc/runa if anyone's interested.
[+] [-] Pirate-of-SV|11 years ago|reply
[+] [-] jwatte|11 years ago|reply
[+] [-] dezgeg|11 years ago|reply
[+] [-] Vendan|11 years ago|reply
[+] [-] 0942v8653|11 years ago|reply
EDIT: s/is/looks/. I meant the syntax.
[+] [-] Hemospectrum|11 years ago|reply
[+] [-] mjburgess|11 years ago|reply
A lot of language enthusiasts just winced.
[+] [-] loop2|11 years ago|reply
[+] [-] marcofiset|11 years ago|reply
I submitted my own posts to HN also, but I did not get the same attention as you did, congratulations!
I'm looking forward to the next iteration ;)
[+] [-] gpsarakis|11 years ago|reply
It is actually a transcompilation to Bash from a functional language, using Python for the intermediate processing.
[+] [-] Gurrewe|11 years ago|reply
[+] [-] Pirate-of-SV|11 years ago|reply
[+] [-] jmcdonald-ut|11 years ago|reply
[+] [-] Pirate-of-SV|11 years ago|reply
It would be cool with a DSL to generate low level code like X86 ASM, LLVM IR or JVM bytecode. That would be so meta!
[+] [-] rtpg|11 years ago|reply
The pain is compared to the absolute and total simplicity of defining a grammar when you're in a whitespace-insensitive setting (parser generators are easy to express and efficient in deterministic grammars).
We have a lot of good parsing tools out there, but we might be missing the right tools for building out good lexers
[+] [-] mataug|11 years ago|reply
[+] [-] msie|11 years ago|reply