top | item 8533843

Arthur Whitney's One-page Interpreter (1992)

59 points| chrisdew | 11 years ago |jsoftware.com

50 comments

[+] stiff|11 years ago|reply

I am at a vacation and bored, so I decided to try to actually understand this. I do not know much APL/K/J, but for a start, here is a version with more traditional line breaking:

https://gist.github.com/anonymous/f72e5c4a432492abce59

Some basic observations:

- He uses all the obscure C features, like the ability to not declare the type of an argument or return value and let the compiler fill in, and also the old school style of declaring functions:

  foo(x,y) int x, y { return 0; }

Instead of:

  int foo(int x, int y) { return 0; }

Since the compiler will infer the int, and since he uses R for return this example eventually becomes:

  foo(x,y){R(0);}

- Variables, or really registers, are only accessible as letters from a to z, and the "st" array stores the values of all registers. Numbers entered in the REPL have to be between 0 and 9, and hence he avoids the dirty work of making a proper lexer. It's also super easy to trigger a segfault since any error handling is non-existent.

- DO(n,x) is a C macro that evaluates the given expression "x" for all numbers between 0 and "n"

- V1 is a C macro that defines unary operators for the interpreted language, and V2 defines binary operators. In V1 definitions the operand is called "w", in V2 the operands are called "a" and "w".

- For example ",", which calls the cat function, is a binary operator that creates vectors:

  1,2,3,4
  4 
  1 2 3 4

- The vt, vd and vm arrays map ascii symbols to the functions defined with V1 and V2. { is the second symbol in vt, so when used as a unary operator it calls "size" (second non-null element of vm):

  {5,6,7,8
  
  4

and when used as a binary operator it calls "from" (second non-null element of vd).

- wd is a parser that goes from the original input string to a weird intermediate form that is an array of longs. Each input character gets mapped one-to-one to an item in this intermediate form.

  If the input character was a number between 0 and 9:
    Value type instance gets allocated
    Intermediate form for this input character consists of the address of the allocated instance

  If the character is a letter between "a" and "z":
    Intermediate form consists simply of this character

  If the character represents an operator
    Intermediate form consists of the index of the operator in the vt array

In other words, the intermediate form is an array where some elements are ascii characters, others are memory addresses and yet other indices into some array. This part is really something.

- The ex function executes the intermediate form. Since everything in the input is fixed length, and there is no syntax checking, it just indexes into the intermediate form assuming everything is well formed, while the parser did not check that so it's not really guaranteed - again a source of easy segfaults. The execution goes from left to right and consists of looking at the first position in the intermediate form and then making recurrent calls if necessary (let X be the current item in the intermediate form):

  If X is a character
    Lookahead one item
    If it is a '=' char
      Assign the result of executing everything after the '=' to the register indicated by X
    Assign to X the value of the register named by X
  If X is not a character and is a small integer
    We are applying a binary operator
    X is the index into the "vm" array
    Fetch the function from "vm", apply it to the result of executing the rest of the intermediate form
  Otherwise:
    If there is any more input remaining other than the current item, we are applying a binary operator
    Lookup the function in "vt", apply it to the result of executing the intermediate form to the left and to the right of the operator

- I have the biggest problem with understanding that "a" struct, that represents all values in the interpreted language, which are arrays. ga is clearly the basic allocation function for it, "plus" obviously adds two arrays, so it's clear the "p" field holds the actual contents, but that's where things get very shady.

[+] cplease|11 years ago|reply

"He uses all the obscure C features, like the ability to not declare the type of an argument or return value and let the compiler fill in"

This isn't an obscure feature, the default C89 type is "int". Leaving out "int" has long been considered bad form, and since C99 omitting return type has not been allowed (but generally supported in nonstrict modes).

[+] avmich|11 years ago|reply

Didn't look at this piece, but more recent J interpreter is here - https://github.com/openj/core . I think the main idea for "a" struct (it is here - https://github.com/openj/core/blob/master/jt.h) is that it has "rank" - the single integer which is the number of dimensions, then array of dimensions - "shape" of length equal to "rank", then "data" part which has the size equal to multiplication of all elements in "shape".

Edit: the structure is at https://github.com/openj/core/blob/master/jtype.h .

[+] jpfr|11 years ago|reply

Uuh. This is ancient K&R C. Here's a version that is somewhat closer to todays C. https://gist.github.com/jpfr/560c861cd3eb76700a54

Can we replace some of the one-letter codes with better namings?

[+] nine_k|11 years ago|reply

This is a perfect illustration why normal people (like me, unlike Arthur) have problems working with his apparently brilliant creations, such as the k language. The amount of information required to efficiently read (let alone write) such texts just does not fit into normal people's working memory. One-letter names and the absence of comments don't help either.

(WRT weird names: IIRC, some time ago kdb featured a number of functions named with digits, like '2' being the function to connect to a socket or something.)

[+] yang_guo|11 years ago|reply

As a guy with a terrible memory, I think of it the opposite way. For Q (and more so K), the actual number of distinct items to remember is very small, as basically anything complex is created by composition of the simple functions/variables. This way you don't need to remember nearly as many libraries/variables/function names, and also the function itself becomes its own description.

That being said, for more production-heavy applications and interfacing with other languages, variable/function naming could be more descriptive :)

[+] golemotron|11 years ago|reply

I want to offer a bounty for a version that compiles with a contemporary C compiler. There's some magic going on with longs being treated as pointers here.

[+] avmich|11 years ago|reply

That's a rather normal situation - when your long is 32 bits long and the pointer is also 32 bits long, to satisfy a strict C compiler you just need a cast - which will be a no-op in machine code. This approach was used quite a bit...

[+] FullyFunctional|11 years ago|reply

This transcript of Roger Hui's talk casts more light on the general idea: http://archive.vector.org.uk/trad/v094/hui094_85.pdf

It's interesting that what he calls "Parsing" is more typically known as term reduction by pattern matching in the functional world.

[+] marktangotango|11 years ago|reply

He "cheats" by putting multiple statements on one line:

noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);z->p=c-'0';R z;}

would be

  noun(c){
   A z;
   if(c<'0'||c>'9')
     R 0;
   z=ga(0,0,0);
   *z->p=c-'0';
   R z;
  }

In this forms, it's still cryptic, but not quiet as inscrutable.

[+] [-] phpnode|11 years ago|reply

Except this isn't him "cheating", this is his preferred coding style, for readability!

[+] [-] icsa|11 years ago|reply

What you refer to as "cheating" is simply writing C like k/APL/J (i.e. functional versus imperative). Language implementations have this tendency. I've seen the same effect with Lisp implementations.
N.B. - k/APL/J functions are often written on one line because that are often very short.

[+] [-] chipsy|11 years ago|reply

When code is bulky, you can excuse your ego because your visibility is limited in a literal sense. But when it's dense like this you actively know, as you try and fail to skim it, that you can only usefully focus on a small segment at a time. So you blame the code instead of yourself.

[+] [-] ejk314|11 years ago|reply

I think it's fair to blame the code in this case. It's not written at all for readability (by anyone other than the author while he was writing it). Of course he knew what those 1 or 2 letter variable names meant when he came up with them, so it's a lot easier to load into memory. But I doubt Whitney could easily coming back to this code after any length of time.

[+] [-] ha292|11 years ago|reply

Isn't it wonderful that many wall st firms use k/q to write some of the most complex analytics ?
Of course, it is understandable. NOT.

[+] [-] avmich|11 years ago|reply

http://kaleidic.com/2009/composing-j-in-my-mind/
"J is conducive to thinking about computation because a lot is expressed by combining even a few components. With J it is natural to think in code, not just about code."

[+] [-] PhasmaFelis|11 years ago|reply

Interpreter of what? "J", apparently, from context and inference, but that should have been in the post title.
What's an "interpreter fragment"? Most of the Google results for that term point to the same story.
I know this is Hacker News and everyone is supposed to know everything about everything already, but on obscure topics it's nice to take a moment and explain what you're talking about.

[+] [-] cplease|11 years ago|reply

Complete shit. C was a mature language in 1989. Even for banging out a quick hack, and forgiving the gets into a 99 byte stack array, this is crap by any standard.
I can compile and run pretty much any K&R code from the C Programming Language in 1978 with no problem. Finding the appropriate compiler, I can do the same with BCPL, the Language and its Compiler (1981). And the code is actually understandable.
This, on the other hand, is an obfuscated mess that segfaults on any input, J or otherwise. I'll be damned if I spend 5 minutes debugging it, ex() segfaults with infinite recursion. It's not worth the trouble.
Not saying this isn't worth sharing as a curiousity or an artifact, but this is not a work of genius and it is not defensible or to be emulated. Do your colleagues a favor and don't write like this.

[+] [-] jpfr|11 years ago|reply

It's different. But it is definitely not shit. The majority of use shouldn't write code like this. But those who can create amazing things.
Here's a working adaptation that is quite close to the original. https://github.com/tangentstorm/j-incunabulum/blob/master/mj...
A current version of the k language still has traces of it inside. And there are still people on Wall Street running mission critical systems with it. http://kx.com/q/c/c/k.h
The kona project attempts an open source implementation. And they give some rationale on the paricular coding style. https://github.com/kevinlawler/kona/wiki/Coding-Guidelines

[+] [-] avmich|11 years ago|reply

> Do your colleagues a favor and don't write like this.
I'd second this - in any normal situation.
If you're really aspiring to form the future, strive to be extremely productive, demands that from your colleagues - that's another thing.
Sigh - industry as a whole isn't nearly on the level to use those advanced tools.