My Most Important Project Was a Bytecode Interpreter

robertelder|9 years ago

One of the moments where I really started to feel like I was starting to 'see the matrix' was when I was working on a regex engine to try to make my compiler faster (it didn't, but that's another story). The asymptotically fast way to approach regex processing actually involves writing a parser to process the regex, so in order to write a fast compiler, you need to write another fast compiler to process the regexes that will process the actual programs that you write. But, if your regexes get complex, you should really write a parser to parse the regexes that will parse the actual program. This is where you realize that it's parsers all the way down.

When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.

wyager|9 years ago

I mean, the whole point of regexes (not to be confused with PCREs) is that any given regex is isomorphic to some canonical finite state machine. It is, specifically speaking, a tiny description of an FSM over the alphabet of ASCII characters (or whatever charset you're using).

Interestingly, regexes/FSMs are (IIRC) the most powerful class of machines for which equivalence is decidable. So if you give me any two regexes, I can tell you if they match on all the same strings, but this is not true for any more powerful grammar.

signa11|9 years ago

> When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.

which is _exactly_ what bpf does :)

posterboy|9 years ago

A regular expression is by definition a regular grammar, the least mighty language in the chomsky hierarchy and realizable by finite state automaton.

You can get something similar to a virtual equivalent of a turing complete language, due to the fact that the tape of the turing machine is, contrary to the hypothetical theory, practically limit by recourses. CPUs are finite state automatons, yet we use them to solve problems of sufficiently low complexity in NP space.

I am not an expert, please correct me if I am wrong.

chrisseaton|9 years ago

I don't get it - what do regexps have to do with compilers and how do they make compilers faster?

Lanzaa|9 years ago

If you spent much time looking at regexs you might enjoy this set of articles about improving regex engine performance:

https://swtch.com/~rsc/regexp/regexp1.html

sillysaurus3|9 years ago

Also: a software rasterizer.

Most people refuse to write one because it's so easy not to. Why bother?

It will make you a better coder for the rest of your life.

Let's make a list of "power projects" like this. A bytecode interpreter, a software rasterizer... What else?

Jasper_|9 years ago

Tracks I've done and suggested to friends and colleagues as learning experiences:

* Compression (lossless, lossy, image, audio, texture, video)

* Languages (bytecode interpreter, AST interpreter, parser/lexer for a simple language, simple JIT, understanding instruction scheduling)

* DSP programming (writing programs for fast, branchless math)

* Comfort with binary and binary formats (start with packfiles .zip/.tar, move onto reverse engineering simple formats for e.g. games)

* Understanding the difference between RAM and address spaces (e.g. understanding virtual memory, mmap, memory-mapped IO, dynamic linking, the VDSO, page faulting, shared memory)

* Device drivers (easier on Linux, understanding userspace/kernel interaction, ioctls, how hardware and registers work, how to read spec sheets and hardware manuals)

* Graphics (modern software rasterizer that's not scanline-based, understanding 3D and projective transforms, GPU programming and shaders, basic lighting (reflection and illumination) models, what "GPU memory" is, what scanout is, how full scenes are accumulated all along the stack)

I could go heavily into depth for any one of these. Ask me questions if you're interested! They're all fun and I always have more to learn.

Also, the longer you get into any given track, the more you realize they all connect in the end.

shermanyo|9 years ago

A basic game engine! I was exposed to so many concepts over time, building on a code base I understood from the ground up. It was the first time my code (c++ no less!) felt completely deterministic, that I understood every piece of data down to the byte level, heap/stack/gpu locality at any point of execution, and especially the lifecycle and memory management.

If anyone is interested in creating an indie/hobby game engine, I'd recommend the following:

- game loop with independent render FPS and a constant simulation tick delta (otherwise it runs non-deterministic due to floating point) - "entities, components, systems" architecture (ECS) - data-driven content (load level and game object data from JSON or similar, and programmatically build your scene) - basic event or messaging system

Honestly, I can't think of a field that covers as wide a range of CS topics at the depth I've seen in lower level game development.

If you focus on understanding and implementing the popular patterns and algorithms recommended by the indie community, its not the most daunting of projects. There's so much great information, and a few good books that break down and walk through a working implementation you can build on once you understand it from the ground up.

zeta0134|9 years ago

I really recommend a raytracer, especially to anyone interested in graphics. It's straightforward, powerful, infinitely expandable with optional features, and opens up a ton of discussion about performance, code complexity, and general organization. Plus it's fun, in an instant gratification kind of way.

richdougherty|9 years ago

* A database (transaction, lock managers, buffer/IO management, etc).

* An MVC web framework.

* A GUI validation library.

* A JavaScript UI library.

* An iteratee implementation.

* An Erlang-style actors library in another language.

* Implementing for-yield, async-await, etc on top of delimited continuations.

* Interpreters, typecheckers, code generators.

* Some of an ECMAScript implementation.

* Concurrent data structures: futures/promises, queues, etc.

* A simple roguelike game.

spotman|9 years ago

A distributed database.

Same category of probably not a great idea to use and claim is better than what's out there ( it won't be ) but boy howdy will it teach you about distributed systems and pitfalls that will absolutely help you when you are working on one at your day job.

kovrik|9 years ago

Your own Lisp-like language (or Scheme implementation). Really recommend it.

keithnz|9 years ago

Your own multi tasking OS. Very common to roll your own everything on embedded systems. Especially back in the early 80s/90s

c-smile|9 years ago

Hmm, rasterizer was just a starting point for me on the way to understand "how browser works".

Having just a rasterizer is like naked VM without bytecode compiler.

So I decided to add HTML/CSS engines to it.

But if you have HTML/CSS then why not to add scripting to it? So added VM executing bytecodes, GC and compiler producing those bytecodes. Having VM I thought that it would be cool to have built-in persistence in it - you need to persist UI state somehow, right? So it got integrated NoSQL database on board.

That's pretty much how http://sciter.com was born.

usrusr|9 years ago

Reading from a raw filesystem dump was both interesting and rewarding as it works on a problem domain that exists in reality, not just on some simplified playground. If you target something as primitive as FAT, the challenge isn't terribly high.

0xcde4c3db|9 years ago

A screen-oriented text editor with undo and search/replace. Can you make it handle a 10MB file without "simple" operations having annoying delays? How about 100MB? 1GB? 100GB? With no line breaks in the file?

tptacek|9 years ago

Strong agree on emulator and particularly stack vm. Would add: TCP/IP stack.

bonoboTP|9 years ago

* Implementing Internet protocols (TCP, IP, SMTP etc)

* Floating point manipulation and numerical computation

* Rolling some own crypto (strictly for fun use)

skybrian|9 years ago

A simple search engine for a directory of documents. Create the index yourself.

voltagex_|9 years ago

I've messed around with a DNS client and DNS server for the same reasons, but I don't think they quite meet the bar for a "power" project.

skybrian|9 years ago

A chat server. One chat room. Make sure you can handle fast and slow clients, and disconnects.

skybrian|9 years ago

A web server. Use a subset of HTTP 1.0 and make the browser serve pages.

Johnny_Brahms|9 years ago

By far the project that teaches me the most was writing an inliner for a toy language we developed as a part of a course at uni.

Most people don't really realise that the more complex an inliner gets, the more similarities it gets to a rabid animal you can just barely restrain on a leash.

My prof said he had laughed at my "how hard can it be?"-attitude . I worked my ass off and ended up being asked whether he could use the inliner in a language often mentioned on HN today :)

The inliner was superseded by a much better one in 2003 written by a much smarter person than I, though.

prirun|9 years ago

An emulator for a real chip, like a 68000. I wrote one for a Prime minicomputer (Prime died in the early 90's). Telnet to em.prirun.com on port 8001 to try it out!

The advantage of emulating a real system/chip is that if you can find old software for it, you avoid the step of having to invent your own programs: real, running programs already exist.

titzer|9 years ago

A virtual memory system for a kernel.

Johnny_Brahms|9 years ago

[deleted]

tominous|9 years ago

I love the author's meta-idea of refusing to accept that unfamiliar things are black boxes full of magic that can't be touched.

A great example of this mindset is the guy who bought a mainframe. [1]

Refuse to be placed in a silo. Work your way up and down the stack and you'll be much better placed to solve problems and learn from the patterns that repeat at all levels.

[1] https://news.ycombinator.com/item?id=11376711

stephengillie|9 years ago

Everything is made from smaller components. Understand each of those components better and you'll understand the entire system better.

Sometimes, you can use end-errors to tell which component has the issue. For instance, if a web site gives a 502 error, the problem is likely with the load balancer or lower network stack on the web server. 404 would often be a file system level issue on the web server. 500 is frequently a network issue between web server and database server. 400 is a problem with the site presentation code, or maybe database malforming addresses.

wahern|9 years ago

Two approaches are severely underused in the software world:

1) Domain-specific languages (DSLs)

2) Virtual machines (or just explicit state machines more generally)

What I mean is, alot of problems could be solved cleanly, elegantly, more safely, and more powerfully by using one (or both) of the above. The problem is that when people think DSL or VM, they think big (Scheme or JVM) instead of thinking small (printf). A DSL or VM doesn't need to be complex; it could be incredibly simple but still be immensely more powerful than coding a solution directly in an existing language using its constructs and APIs.

Case in point: the BSD hexdump(1) utility. POSIX defines the od(1) utility for formatting binary data as text, and it takes a long list of complex command-line arguments. The hexdump utility, by contrast, uses a simple DSL to specify how to format output. hexdump can implement almost every conceivable output format of od and then some using its DSL. The DSL is basically printf format specifiers combined with looping declarations.

I got bored one day and decided to implement hexdump as a library (i.e. "one hexdump to rule them all"), with a thin command-line wrapper that emulates the BSD utility version. Unlike BSD hexdump(1) or POSIX od(1), which implement everything in C in the typical manner, I decided to translate the hexdump DSL into bytecode for a simple virtual machine.

  http://25thandclement.com/~william/projects/hexdump.c.html

The end result was that my implementation was about the same size as either of those, but 1) could built as a shared library, command-line utility, or Lua module, 2) is more performant (formats almost 30% faster for the common outputs, thanks to a couple of obvious, easy, single-line optimizations the approach opened up) than either of the others, and 3) is arguably easier to read and hack on.

Granted, my little hexdump utility doesn't have much value. I still tend to rewrite a simple dumper in a couple dozen lines of code for different projects (I'm big on avoiding dependencies), and not many other people use it. But I really liked the experience and the end result. I've used simple DSLs, VMs, and especially explicit state machines many times before and after, but this one was one of the largest and most satisfying.

The only more complex VM I've written was for an asynchronous I/O SPF C library, but that one is more difficult to explain and justify, though I will if pressed.

bachback|9 years ago

Yes, absolutely agree. In LISP its obvious.

1. Every function is a tiny VM already. Every macro is a layer on top of a "compiler" to let you redesign a language. LISP gives much more power, precisely because every program is its own DSL, and all power of the language within that DSL is available. http://www.paulgraham.com/avg.html

2. In SICP they show how to build anything from LISP constructs. The elegant thing is that LISP actually only needs very few machine primitives. https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-30.htm...

The most powerful constructs I've seen is a combination of these 2: Clojure/async is a macro which transforms any program to state machines. I think that kind of power gives you 20-30x advantage in business type applications. In fact I've seen C++ programmers spending a decade on discovering similar things by themselves. I strongly believe everyone should know at least a little about LISP.

biot|9 years ago

Clickable version, as I had a hell of a time selecting and copying that link on mobile: http://25thandclement.com/~william/projects/hexdump.c.html

linkregister|9 years ago

Yes, can you describe your SPF library? I'd love to learn more about how using a VM / state machine paradigm could help me approach CS problems.

gopalv|9 years ago

The project that affected my thinking the most was a bytecode interpreter[1].

I've had use for that knowledge, nearly fifteen years later - most of the interesting learnings about building one has been about the inner loop.

The way you build a good interpreter is upside-down in tech - the system which is simpler often works faster than anything more complicated.

Because of working on that, then writing my final paper about the JVM, contributing to Perl6/Parrot and then moving onto working on the PHP bytecode with APC, my career went down a particular funnel (still with the JVM now, but a logical level above it).

Building interpreters makes you an under-techtitect, if that's a word. It creates systems from the inner loop outwards rather than leaving the innards of the system for someone else to build - it produces a sort of double-vision between the details and the actual goals of the user.

[1] - "Design of the Portable.net interpreter"

_RPM|9 years ago

Interesting. One thing that I still haven't solved yet is the "break" and "continue" statement inside loops. For a break statement, it seems like it would just be the same a JMP with an address as the operand, but there would need to be some sort of registration of "The VM is in the loop now, and the break address is X", and continue would also be a JMP with an address to the top of the code for the loop.

I haven't implemented those in my system yet, and also have no idea how Python or PHP does it.

Is PHP's VM a stack based one? I do read the Zend/ directory of PHP's source, but it is really hard to follow and there is virtually no documentation on the VM

_RPM|9 years ago

I saw the matrix after I first implemented a virtual machine. I recommend everyone does it because it will teach you a lot about how code is executed and transformed from the syntax to the actual assembly/bytecode. A stack based virtual machine is so simple it takes a lot of thinking to understand how they work. (or maybe I'm just not that smart).

It's interesting that he implemented function calls via a jump. In my VM a function is just mapped to a name (variable), so functions are first class. When the VM gets to a CALL instruction, it loads the bytecode from the hash table (via a lookup of the name).

Since this is a procedural language where statements can be executed outside of a function, implementing the functions as a jump would be difficult because there would need to be multiple jumps between the function definition and statements that aren't in a function.

I really wish my CS program had a compilers class, but unfortunately they don't, so I had to learn everything on my own.

chii|9 years ago

A CS education is incomplete without a semester on writing a simple compiler, and a corresponding emulator for the output for said compiler.

briansteffens|9 years ago

Nice post! I really enjoy playing around with things like this. It's amazing how little is needed to make a language/interpreter capable of doing virtually anything, even if not elegantly or safely. As long as you can perform calculations, jump around, and implement some kind of stack your language can do just about anything.

I recently threw something together sort of like this, just for fun (I like your interpreter's name better though): https://github.com/briansteffens/bemu

It's crazy how much these little projects can clarify your understanding of concepts that seem more complicated or magical than they really are.

anaccountwow|9 years ago

This is a required hw assignment for a freshmen class @ cmu. https://www.cs.cmu.edu/~fp/courses/15122-s11/lectures/23-c0v...

Given it has some parts already written in the interest of time...

Arcten|9 years ago

What's even cooler, is that after building this VM (for the C0 language) as a freshman, you can come back as a junior/senior and write a compiler for that language in 15-411. It's a very cool way of going full circle.

_RPM|9 years ago

I am jealous of those students. I would love to take that class.

oops|9 years ago

Nice read! Reminds me of nand2tetris that was posted not too long ago https://news.ycombinator.com/item?id=12333508

(You basically implement every layer starting with the CPU and finishing with a working Tetris game)

douche|9 years ago

This reminds me a little bit of my computer architecture class. We started at logic gates in a simulator[1], and worked our way up from there to flip-flops and adders, memory chips, a simple ALU, and eventually a whole 8-bit CPU in the simulator. I want to think that we were even writing assembly for it, loading the programs into the simulated memory, and executing it. It was a great way to get a sense of how everything works, and I think it's when C-style pointers really clicked for me.

[1] this one, IIRC https://sourceforge.net/projects/circuit/

foobarge|9 years ago

I've done something similar 21 years ago: a C interpreter targeting a virtual machine. The runtime had a dynamic equivalent of libffi to call into native code and use existing native libraries. I added extensions to run code blocks in threads so that the dinning philosopher problem solution was very elegant. Back in the days, not having libffi meant generating assembly on the fly for Sparc, MIPS, PA-Risc, i386. Fun times. That C interpreter was used to extend a CAD package.

reidrac|9 years ago

I wrote a VM for the 6502 for fun and it was one of most interesting and satisfying projects I've ever made in my free time.

It is very close to a bytecode interpreter, only that it comes with a specification that is actually the opcode list for the MOS 6502 (and few details you need to take into account when implementing that CPU).

Besides there are cross-compilers that allows you to generate 6502 code from C for your specific VM (see cc65).

memsom|9 years ago

I did this in C#. It was a lunch time project at work a couple of years ago. It was fun. I still want to do a V2 and remove all of the shortcuts I put in because I didn't want to write code for the stack and stuff like that. At the end of the day, my solution was spookily similar to this - the 32bit instructions - well, yeah, I was the same! It was just simpler. I did have a few general purpose registers (V1, V2 and V3 I think) and I did have routines to handle bytes, words and such like. So stuff like this (as a random example I pulled from the source):

ORG START

START: ST_B 10

LOOP: ST_B 10

ADD_B ;;value will go back on stack

LD_B V1

SM_B V1 ;;value we use next loop

SM_B V1 ;;value we compare

SM_B V1 ;;value we echo to console

TRP 21 ;;writes to the console

ST_S '',13,10,$

TRP 21 ;;writes to the console

CMP_B 50 ;;compares stack to the constant

JNE LOOP

ST_S 'The End',13,10,$

TRP 21 ;;writes to the console

END

pka|9 years ago

I'm thinking a lot of the complexity of writing a compiler stems from the usage of inappropriate tools. I.e. I would rather kill myself than write a lexer in C (without yacc / bison), but using parser combinators it's a rather trivial task.

Similarly, annotating, transforming, folding, pattern matching on, CPS transforming etc. the produced AST is pretty trivial in a language that supports these constructs. And again, a nightmare in C.

That leaves codegen, but using the right abstractions it turns into a very manageable task as well.

Here's a compiler written in Haskell for LLVM [0].

[0] http://www.stephendiehl.com/llvm

TazeTSchnitzel|9 years ago

> I would rather kill myself than write a lexer in C

I've written several lexers in C-like languages, it's not that painful. I wouldn't dare write a parser though.

philippeback|9 years ago

Parsers made easy and pretty much interactive:

http://www.lukas-renggli.ch/blog/petitparser-1

http://www.themoosebook.org/book/internals/petit-parser

This include the dynamic generation of blocks and arrows style things...

philippeback|9 years ago

Soulmate of yours here: https://clementbera.wordpress.com

Lots of optimizations going on for OpenVM.

https://github.com/OpenSmalltalk/opensmalltalk-vm

Interesting bit: VM is written in Slang and transformed into C then compiled.

So you can livecode your VM. In the VM simulator.

elcct|9 years ago

I did something similar in the distant past, that is I wrote subset of C compiler (functions, standard types, pointers) to imaginary assembler and then bytecode interpreter. It was awesome fun, but also I got so into it my - then - girlfriend started to question my commitment to the relationship. So be careful, this is really interesting thing to do :)

rosstex|9 years ago

In this same vein, I recommend coding an emulator! It can be an excellent experience.

http://www.multigesture.net/articles/how-to-write-an-emulato...

curtfoo|9 years ago

Yes I wrote a parser/compiler and interpreter for a custom domain specific language and it had a similar effect on my career. Lots of fun!

Okay I guess technically I used a parser generator that I then modified to build an AST and convert it into assembly-like code that fed the interpreter.

reacweb|9 years ago

Bill gates also started with an interpreter (basic interpreter). Many parts of early windows applications were developed in p-code and visual basic is an important part of Microsoft success.

loeg|9 years ago

I like implementing emulators, because the toolchain and architecture specification are all there already. You get to implement what is basically a little embedded CPU.

dpratt|9 years ago

I'd add a driver for a non trivial binary protocol - I ended up implementing a JVM driver for Cassandra a few years ago, and it was a blast.

voltagex_|9 years ago

Working with data as binary is a good test of a high level language skills. When I was playing around with DNS, I wrote terrible code like https://github.com/voltagex/junkcode/blob/master/CSharp/DNS/.... A better way to do it is https://github.com/kapetan/dns/blob/master/DNS/Protocol/Head... - structs, of course.

I'd add reading and implementing a protocol from RFC - it's a great way to start thinking about design, especially if you read the original RFCs and work forward through the revisions and see what was kept vs deprecated.

Chanyow|9 years ago

[deleted]

149 comments