LLVM Meets the Truly Alien: Mill CPU Architecture [video]

[+] state|10 years ago|reply

It's so unusual to find projects like this. I have so much respect for these guys. It's always fun to see what they're working on. Lots of the technical specifics are over my head, but I really appreciate their tone and approach.

[+] igravious|10 years ago|reply

Previous discussion (with a member of their team chiming in): https://news.ycombinator.com/item?id=9856334

Seems like a _fascinating_ architecture. I don't get the double program counter bit, also seems like too much cognitive overhead to me. The belt (queue of un-named slots) idea and function call for free with multiple return results, such that regular ops and user defined functions work the same way, that sure is clever.

Godard mentions that the CPU space has seen little to no innovation. But what about Transmeta? They were doing some cool stuff, no? And the DEC Alpha, that was super interesting. Probably lots more besides. I'm sure if I bothered to check that Wikipedia would have a CPU graveyard page :) Thing is, x86 has been the only game in town for a long time with Motorola challenging in the past and ARM challenging now. Maybe time for a three horse race?

[+] Someone|10 years ago|reply

As to finding instruction boundaries with a variable-length instruction format: you could use the trick they use to start two instruction streams from a single address by using it for a stream of fixed-length instruction sizes running down that can be used to find the border in the stream of variable-length instructions running in the ’normal’ direction.

You could have:

  offset  : -----000000000111111
  offset  : 54321012345678901234
  contents: 64212iijkkllllmmmmmm (digits are instruction lengths, letters are instructions)

So, reading offsets -1, -2, -3, etc. you read off the sizes of the instructions you can read in offsets 1, 2, 3, etc. If you have 4 decode units, you can replace the sizes of the instructions by their offsets, resetting the offset every 4th instruction.

With instruction sizes of 1-4 bytes, you could store instruction sizes in 2 bits per instruction. That would lower the pressure on your cache for the extra information you store. There also would be more space in your instruction set because you no longer would have to worry about creating an instruction that has a prefix that also is a valid instruction. For example, you could use 0x01 both for INC and ADD IMMEDIATE VALUE, where it means the former if the instruction length is 1, the latter if it is longer, with the size of the immediate determined by the size of the instruction.

Would that be a net win? I wouldn’t know, but it seems simpler (but ’simpler’ looks like a negative in their worldview) than creating those two streams that somehow must work together to do what a single statement sequence in the source code does.

[+] CarVac|10 years ago|reply

The twin instruction streams have the advantage in that the instructions are split between two completely separate caches that can have reduced latency relative to (total) capacity, bringing benefits in the case of branch mispredict.

[+] darkmighty|10 years ago|reply

I don't understand why they don't use simply a prefix-free code, for example the Huffman code. You can have arbitrary length instructions at entropy-optimal efficiencies that way, and decoding in binary-tree fashion should be linear and very fast in hardware.

[+] igravious|10 years ago|reply

I don't really understand your solution, but then again I did not understand the Mill's solution. Can you explain it again?

All I will say is that they seem to have really tried to simplify and regularize the Mill _except_ in the opcode stream / decode step where they say that two streams is optimal! Seems like it would be a nightmare and introduce all sorts of complexities. Was the only bit of the talk I didn't grok, whereas most others bits had me nodding my head and thinking "oh, neat".

[+] Veedrac|10 years ago|reply

That doesn't really solve the problem, and has several deficiencies. A split stream might seem complicated, but IMHO it's an amazingly clean way of simplifying the problem.

---

The problem is not figuring out what sizes things are, but the routing. When you have 30+ instructions a cycle, you need to route a lot to a lot of different places. You also can't do a parallel read since the size of previous instructions affects the placement of the next instruction - you will eventually fix this with routing but it's not easy.

Consider a 30-wide instruction. Decoding it involves knowing where in the instruction stream it starts. This can only be known once the prior instruction is partially parsed.

Consider its later half - this can only be read after the first half is read. In your case, you actually only need to read the 2-bit prefixes but the difficulty remains.

With a split instruction stream, since there are two independent (at least until a branch) program counters so you have twice the contextual information. You already know where the second half starts from the prior instruction.

One more great thing about this is that the contextual information is preserved in jumps, without increasing the size of jump instructions. Even if prior instructions had metadata about splitting the next instruction for parsing (at yet more space wastage), this would be lost across jumps. That would worsen jump latency, which is a common bottleneck.

The above point is way more important if you expect a flow instruction almost every bundle, due to the massive ILP the Mill has.

---

> With instruction sizes of 1-4 bytes, you could store instruction sizes in 2 bits per instruction.

However, the Mill does not constrain its instruction sizes to multiples of 8 bits. This means either you'd lose that bonus or you'd need even more bits.

The Mill also splits which instructions are allowed in each half. Not only does this give you a free implicit bit, it again halves the amount of work for each decoder. This could probably happen with interleaved halves like you describe, but not without difficulty.

> For example, you could use 0x01 both for INC and ADD IMMEDIATE VALUE, where it means the former if the instruction length is 1, the latter if it is longer, with the size of the immediate determined by the size of the instruction. Would that be a net win?

The answer is no. The instructions codes are generated to be space optimal. I'm not sure how this is done, but it has been stated. They also leave it open for metadata like you mention to be included in each instruction stream, and in fact their blocks do have headers.

> creating those two streams that somehow must work together to do what a single statement sequence in the source code does

Given the instructions are disjoint and they only need synchronization at jumps, I would be very surprised if this was problematic.

[+] aidenn0|10 years ago|reply

Does anybody know The Mill's plan for commercialization?

[+] nly|10 years ago|reply

Sell Ivan Godards hair off as genuine wizard beard

[+] analognoise|10 years ago|reply

Generate buzz, get bought, never produce anything.

[+] luckydude|10 years ago|reply

The Mill CPU is going on and on as going to be awesome some day. I spoke with one their guys at Greg Chesson's wake, it seemed open ended.

Yup, you have good ideas. How about you ship? Then we can see if the ideas meet reality and work.

I'm friends with one the guys who did the P6. Which became the basis for pretty much any x86 chip. He shipped stuff.

Ideas are easy, execution is hard, I have yet to see the Mill people execute.

I'd be crazy happy if they stepped up and made me look stupid because they shipped a kick ass chip.

[+] uxcn|10 years ago|reply

As I understand, they aren't seeking venture capital backing, so progress is slow. I know they have (had) plans to implement a Mill on an FPGA, but even going to a usable ASIC will probably be a challenge.

It's a major architectural change as well. Most existing systems rely on virtual memory, which the Mill doesn't support.

[+] javajosh|10 years ago|reply

I know right? Even a crappy CPU based on the Mill concept would be good. You could at least perform computational efficiency (ops/watt) style benchmarks. I'm sort of curious what the endgame is, myself.

[+] quux|10 years ago|reply

I'm a little confused about how the "pointerness" of pointers is getting lost in LLVM. The mill can't be the only architecture that encodes extra data in the high bits of pointers right? The 64 bit objective-C runtime and Xbox 360 come to mind as systems where this is done. Clearly LLVM can generate good code 64 bit Objective-C... Am I missing something?

[+] nly|10 years ago|reply

Linking to a 2.3 GB file on HN is perhaps not the best move.

[+] iso8859-1|10 years ago|reply

I put it on YouTube: https://www.youtube.com/watch?v=QyzlEYspqTI

EDIT: Updated with complete video link

[+] killercup|10 years ago|reply

Indeed! Here's the 465MB SD version! http://llvm.org/devmtg/2015-04/Videos/SD/day%202/Ivan_1.mp4

[+] jdc|10 years ago|reply

http://jsbin.com/jikisugifu/edit?output

[+] faragon|10 years ago|reply

Those CPUs look like having a huge potential, but there is no really working compiler for them. In my opinion, they could try to engage Fabrice Bellard, the genious behind QEMU and TCC. Or people with experience on Mali GPU architecture, which shares many points with the Mill CPU approach.

[+] Joky|10 years ago|reply

Can you elaborate on the common point with the Mali? You may not know that Ivan has some compiler background, he is not really a newbie here! And the main problem right now is not a crappy compiler but no silicon...

[+] andrewchambers|10 years ago|reply

I have no idea how tcc could help mill. Its completely against all their goals.

[+] TazeTSchnitzel|10 years ago|reply

They'll get there. They aren't burning through money, they don't have any: they're not a properly formed company, just volunteers. It'll get done when it gets done.

85 comments