Make Your Own Programming Language

[+] anindyabd|11 years ago|reply

Great post. Recently I wrote an interpreter for Scheme using Haskell using the guide "Write Yourself a Scheme" [1]. It was great fun, and it seems that it will be fairly straightforward to write a new domain-specific language using Haskell, using the same principles. That said, it was never clear to me how something like that would work in a non-purely-functional language, and so the fact that the OP used Rust is very exciting.

[1] http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_H...

[+] rasur|11 years ago|reply

This is a great tutorial, although the one complaint I had was the errors in the pdf version (can't remember the specifics).. took me a while to realise the html version was far superior.

Regardless of my complaints, it was an interesting experience.

[+] thomasfoster96|11 years ago|reply

I started making a language a few months ago. I found Douglas Crockford's parser from his Top Down Operator Precedence paper pretty useful, which is here: http://javascript.crockford.com/tdop/index.html

When looking for resources on making toy languages, most of them said not to bother with a parser and use a pre-built solution. I think they're probably wrong - the parser is actually pretty easy (relative to the rest of the process of getting a language working) and at a stretch, fun.

[+] mamcx|11 years ago|reply

I agree. The parsing step is distracting. I have TONS of questions and all the tutorials do the parsing stuff, and stop at doing integer +/*-, maybe vars, maybe if, maybe basic functions!

I'm doing one, and I'm only working on the AST.

I take the idea from this blog:

http://www.trelford.com/blog/post/interpreter.aspx

And from http://www.itu.dk/people/sestoft/plc/. So I get rid of the parsing, and go directly at "how do type checking?", "How I map vars to the environment?", "How make the debugger???", "So, I can have a AST for INTS, BOOLS... now how let the user create your own types???", "How make possible to do macros? I'm not a LISP!", "I'm on F#. How lift the stuff F# already have to do not doing it again???", "What to do: Erlang actors or GO CSP? and how?", "How do pattern matching", "Auto-instrument the code interpreter with dtrace or similar, and why I think this is good?", "I wanna be a reactive language?", "I decide the language be relational. I strongly believe that is amazing. I don't know yet why and how prove it" and a huge list like this...

I don't have a clean answer yet for some of this questions!

Also, I get rid of the idea of make a compiler, and do a interpreter instead. That is another distraction. I suspect that if later on I turn the interpreter to a bytecode, the step to full compilation will be simple, and without solve several questions about the full language the task of rewrite both the parsing and the compiler is not cool.

[+] seanmcdirmid|11 years ago|reply

Yes! I write my own recursive descent parsers because my programming languages tend to have incremental execution stories to them (you can see an example at http://research.microsoft.com/en-us/people/smcdirm/managedti...). And writing a recursive descent parser really isn't that hard, plus it gives you more flexibility in tweaking error recovery. It also turns out that in industry, auto generated parsers are rarely used (Java is an exception) given performance and error recovery concerns.

On the topic of precedence parsing, they have some really good properties in an incremental context: you can start parsing anywhere and the results will be the same. The problem is expressiveness, and while I came up with some hacks to push them beyond operators, pure recursive descent is much easier to work with (as I learned from Martin Odersky's scalac compiler).

[+] ufo|11 years ago|reply

For me the big advantage of using a parser generator is that they let you write things a bit more declaratively. Bottom up parser generators also have the advantage of being able to recognize some grammars that top down parsers can't and they also detect grammar ambiguities at compile time.

That said, handwritten top down parsers are very tweakable and flexible which can be a big plus. In particular, error recovery and things like semicolon insertion. You also get more control over code size and parser performance.

[+] tachyonbeam|11 years ago|reply

I agree. It's not as hard as one might think. I implemented my own handwritten JS parser using TDOP: https://github.com/higgsjs/Higgs/blob/master/source/parser/p...

I think it's more flexible and maintainable this way. I didn't have to wrestle with odd grammar rules. Implementing the JS "automatic semicolon insertion" with parser generator tools seems like it would have been a cluster headache.

[+] ioddly|11 years ago|reply

The more you do yourself, the more you will learn. If you're just learning and not trying to build the next big programming language, try and do some of the low level stuff yourself. Write a parser, write a garbage collector and a virtual machine, re-write the compiler in the language itself. Go hog wild.

[+] wldlyinaccurate|11 years ago|reply

The author appears to have added a CNAME since this was posted to HN. The updated URL is http://blog.ppelgren.se/2015-01-03/DIY-Make-Your-Own-Program...

[+] fcanela|11 years ago|reply

Server not found error when connecting. Tried removing the "-" but didn't work neither.

[+] stringy|11 years ago|reply

According to RFC 1034[1] hyphens can only be interior to a label; it appears many tools simply follow these recommendations strictly.

The exact wording is: "They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen."

[1]: http://www.ietf.org/rfc/rfc1034.txt

[+] gue5t|11 years ago|reply

In glibc's libresolv, this is caused by res_hnok in resolv/res_name.c returning 0:

    /*
     * Verify that a domain name uses an acceptable character set.
     */

    /*
     * Note the conspicuous absence of ctype macros in these definitions.  On
     * non-ASCII hosts, we can't depend on string literals or ctype macros to
     * tell us anything about network-format data.  The rest of the BIND system
     * is not careful about this, but for some reason, we're doing it right here.
     */
    #define PERIOD 0x2e
    #define	hyphenchar(c) ((c) == 0x2d)
    #define	underscorechar(c) ((c) == 0x5f)
    #define bslashchar(c) ((c) == 0x5c)
    #define periodchar(c) ((c) == PERIOD)
    #define asterchar(c) ((c) == 0x2a)
    #define alphachar(c) (((c) >= 0x41 && (c) <= 0x5a) \
		       || ((c) >= 0x61 && (c) <= 0x7a))
    #define digitchar(c) ((c) >= 0x30 && (c) <= 0x39)

    #define borderchar(c) (alphachar(c) || digitchar(c))
    #define middlechar(c) (borderchar(c) || hyphenchar(c) || underscorechar(c))
    #define	domainchar(c) ((c) > 0x20 && (c) < 0x7f)

    int
    res_hnok(const char *dn) {
	    int pch = PERIOD, ch = *dn++;

	    while (ch != '\0') {
		    int nch = *dn++;

		    if (periodchar(ch)) {
			    (void)NULL;
		    } else if (periodchar(pch)) {
			    if (!borderchar(ch))
				    return (0);
		    } else if (periodchar(nch) || nch == '\0') {
			    if (!borderchar(ch))
				    return (0);
		    } else {
			    if (!middlechar(ch))
				    return (0);
		    }
		    pch = ch, ch = nch;
	    }
	    return (1);
    }

[+] Pirate-of-SV|11 years ago|reply

Author here: This is slightly embarrassing.

I just added a custom domain/CNAME to my github pages so it should be reachable at http://blog.ppelgren.se

[+] Hemospectrum|11 years ago|reply

Apparently the hyphen isn't there by mistake. Here's the blog source repo: https://github.com/Mattias-/Mattias-.github.io

Are you perhaps using a browser or a DNS stack that rejects this type of domain name?

[+] lelandbatey|11 years ago|reply

I am also unable to visit the page. I'm on Android 4.2, using Firefox and Chrome.

[+] Tenjou|11 years ago|reply

Nice article, inspired me to invest time in the future to learn Rust.

Also working on JavaScript native compiler + developing a new language on top as subset for JavaScript: https://github.com/InfiniteFoundation/dopple (front page info is outdated though).

[+] nadams|11 years ago|reply

After writing a lexer, parser, compiler and virtual machine in C++ and ported parts to Python, Java and FreeBASIC - I think this is a good starting point.

Most of the concepts are pretty simple - it's when you get into if/while/for statements. Especially if you have nested if/while/for statements. Conceptually easy to implement using recursion but during the compilation phase keeping track of where you are at can be...difficult. I'll be the first to admit my implementation sucked but it worked.

[+] alediaferia|11 years ago|reply

Let's add the next thing I won't have time to play with. Anyway, this is a great post and I look forward to the whole guide. Keep it up!

[+] Pirate-of-SV|11 years ago|reply

Thank you! If I can at least make one person interested I'm happy to continue.

[+] dochtman|11 years ago|reply

Fun!

I've been working on Python-like systems language with a compiler in Python targeting LLVM IR, and it's been cool (though so much work to get a halfway usable language!).

https://github.com/djc/runa if anyone's interested.

[+] Pirate-of-SV|11 years ago|reply

Did you create an interpreter for the language before you started on the compiler? I think it's not much overhead and you can run tests with it to make sure you get all the lexing and parsing done correctly. Then you just replace the evaluation with code generation and BAM! you've got a compiler.

[+] jwatte|11 years ago|reply

I would argue that this statement is a mistake that should have been foreseen: "negative integers isn't literals!" Specifically, twos complement signed integers will be a bitch of you continue down that path.

[+] dezgeg|11 years ago|reply

On the other hand, having negative integers as literals will cause even harder problems with subtraction (as in "2-2")

[+] Vendan|11 years ago|reply

Actually, it's an incredibly good idea, cause you can then just implement constant folding on the AST, and that (unary negate)1 turns into an actual -1, as well as a lot of other good optimizations.

[+] 0942v8653|11 years ago|reply

Wow. I'm sure I've seen Rust before (in fact, I have it installed) but I'm just surprised how similar it is to Swift.

EDIT: s/is/looks/. I meant the syntax.

[+] Hemospectrum|11 years ago|reply

The design of Swift was influenced by Rust. However, most of the resemblances are superficial or relate only to areas where Rust in turn was heavily influenced by other languages. For example, Rust's memory management model could not work in Swift (because this is an area where Swift remains compatible with Objective-C).

[+] mjburgess|11 years ago|reply

> I'm just surprised how similar it is to Swift.

A lot of language enthusiasts just winced.

[+] loop2|11 years ago|reply

I started one a while ago, but I became fascinated by the "middle end" and bought a compiler textbook. Now I have SSA optmizations but no strings still.

[+] marcofiset|11 years ago|reply

Very interesting read. I am myself writing a series on the same subject and I was pretty inspired by your way of writing. Your article is paced a lot faster than what I have written so far and I really like the style.

I submitted my own posts to HN also, but I did not get the same attention as you did, congratulations!

I'm looking forward to the next iteration ;)

[+] gpsarakis|11 years ago|reply

Great read! An effort on a DSL using PLY (Python LEX-YACC implementation): https://github.com/georgepsarakis/quick-bash .

It is actually a transcompilation to Bash from a functional language, using Python for the intermediate processing.

[+] Gurrewe|11 years ago|reply

I started creating a language by myself a few days ago, this is a great read. Thanks!

[+] Pirate-of-SV|11 years ago|reply

Nice, what kind of language? I think nice beginner languages are Query languages (like SQL) or markup languages (like Markdown).

[+] jmcdonald-ut|11 years ago|reply

This was a really cool read :). I'm working on my own DSL for generating code. I'll make a Show HN soon, but it was really cool to see someone else doing something so similar.

[+] Pirate-of-SV|11 years ago|reply

Nice what kind of code is it going to be used for to generate? A specific language or something more generic.

It would be cool with a DSL to generate low level code like X86 ASM, LLVM IR or JVM bytecode. That would be so meta!

[+] rtpg|11 years ago|reply

whitespace sensitivity is a pretty big pain to deal with in the current toolkits available for building out a programming language, unfortunately.

The pain is compared to the absolute and total simplicity of defining a grammar when you're in a whitespace-insensitive setting (parser generators are easy to express and efficient in deterministic grammars).

We have a lot of good parsing tools out there, but we might be missing the right tools for building out good lexers

[+] mataug|11 years ago|reply

I'm getting started with LLVM, Just downloaded the source and now wondering how to go about things.

[+] msie|11 years ago|reply

It would be cool if we could add a "Rust" tag to this entry. I almost didn't give this link a second look because I thought it was an older link. Glad I did give it a second look.

78 comments