Mill CPU Inter-Process Communication

[+] cwzwarich|8 years ago|reply

Overall, this seems like one of the weaker Mill talks. Since they apparently don't yet have a real OS running real software in simulation, they probably haven't had the ability to test the ideas that affect how software is structured at a higher level.

They don't provide nearly enough ways to transitively grant permissions. Using the mechanisms discussed in the talk, it doesn't seem like you can implement a simple asynchronous queue of units of work to perform, each having their own permissions. The belt architecture encourages these sorts of second-class mechanisms that have to be used in a rigid way, because the details can be hidden in the belt and not be exposed architecturally.

Unless there's something else not mentioned in the talk, it seems like you still need to trust the OS, because when the OS is asked to allocate a page for a spillet there is nothing stopping it from creating a virtual alias of that page elsewhere and allowing another thread to read its data.

The mechanism to support fork() is a total kludge. Why have a single address space if you're just going to add segmentation in such an ad-hoc way for a single use case? Just run the original binary in emulation until exec() or something like that.

[+] jacquesm|8 years ago|reply

> The mechanism to support fork() is a total kludge.

Fork is a kludge. It also happened to be easy to implement on the hardware available at the time and we've been stuck with it ever since. So I happily forgive the mill team that their fork() implementation looks like it is a total kludge, it would be highly surprising if it were not.

[+] igodard|8 years ago|reply

>Unless there's something else not mentioned in the talk, it seems like you still need to trust the OS, because when the OS is asked to allocate a page for a spillet there is nothing stopping it from creating a virtual alias of that page elsewhere and allowing another thread to read its data.

The Mill is hardware and architecture, not policy. If you want to use such an OS then you are free to do so. The Mill is designed to efficiently support micro-kernel OSs. Note: micro-kernel, not no-kernel. There always will be a Resource Service that owns the machine. It will be a couple hundred LOC, small enough to be correct by eyeball or proof. Contrast your choice of monolith.

The OS is not involved in allocating spillets. Spillet space is a large statically-allocated matrix in the address space. It is not allocated in memory, only in the address space. As soon as you allocate a turf id and a thread id you have implicitly allocated a spillet. Only on spillet overflow is allocation necessary. Whether allocating turf or thread ids requires OS involvement depends on the policies and models chosen by the OS designer.

When first created the spillet data lives only in backless cache - no memory is allocated. Only if the spillet lives long enough to get evicted from cache is actual memory allocated, using the Backless Memory mechanism described in our Memory talk. The root spillets of apps will live that long; transient spillets from portal calls will likely live only in cache. Consequently truly secire IPC/RPC using Mill portals has overhead, both app and system combined, of the same magnitude of an ordinary function call.

> They don't provide nearly enough ways to transitively grant permissions. Using the mechanisms discussed in the talk, it doesn't seem like you can implement a simple asynchronous queue of units of work to perform, each having their own permissions.

There is a "session" notion that addresses such things. Unfortunately the talks are far enough into details that they must contain background and introduction slides for the viewers who have not already done (and retailed) all the other talks. This limits the amount of new material that can be covered in a single talk, and sessions didn't make the cut this time. We'll get to them.

> The mechanism to support fork() is a total kludge.

Agreed; there seems to be a Law of Conservation of Kludgery. We had as a minimum requirement that the architecture must support Unix shell. The only real problem is fork(). Would that we could issue an edict banning it.

[+] phs2501|8 years ago|reply

They've repeatedly said they want to run current software well, presumably including software that forks and does not follow up with exec (regardless of how ill-advised that may be; obviously opinions vary on this). Trapping to emulation on fork would seem to fly in the face of this.

That being said I'm still not sure how you can get all of fork's semantics out of this mechanism...

(For that matter various sneaky VM aliasing tricks ["magic" circular buffers mapped twice in a row so that any size and position block is contiguous, same file mapped in multiple processes with different (non-contiguous) memory mappings] are going to fail miserably with the virtual-addressed cache. It may well be worth not being able to do those things for the benefit of moving the TLBs, but it also flies in the face of "will run current software.")

[+] convolvatron|8 years ago|reply

the fork exposition was weak. admittedly fork() was a mistake and constrains alot of implementations in strange ways. i still dont understand how local/global exactly matches the semantics of cow.

transitive permissions are capabilities.

while i'm sympathetic to the lack of market appeal to a capability based system, doesn't it seem like you could implement posix on top of one by compromising it? fd transfer over unix domain is already halfway there.

seems like a better alternative.

[+] willvarfar|8 years ago|reply

(team Mill)

> They don't provide nearly enough ways to transitively grant permissions.

Portals are synchronous and transient permissions can only be used for the duration of the call and by the same thread. Asynchronous isn't so 'simple' because its about lifetime. With the synchronous portal the caller knows that the callee cannot retain any access to the buffers that were passed, and can reuse them safely. If those buffers were put on an asynchronous queue, when would the memory be safely reused and when would the owner know that? If you want asynchronous queues, you either have to have a buffered model like Unix pipes etc or you have to have a global hardware-implemented GC that somehow spans turfs and becomes part of the Trusted Computing Base (TCB) (shudder).

> it seems like you still need to trust the OS, because when the OS is asked to allocate a page for a spillet

Well there's plenty not mentioned in the talk and you touch on one aspect :) When a spillet overflows the extension cannot be in the reserved space, so space must be 'carved out' of the part of the address space where programs also have their needs carved out. Someone has to do the carving, whether its for spillets or for programs, and that someone has to be trusted. The problem isn't aliasing (we're Single Address Space), its that they can simply give as many permissions to it to as many turfs as they choose. There is a possibility that the carver is in the BIOS, but any which way there has to be a turf that can do this. This turf is obviously part of the TCB.

> Why have a single address space if you're just going to add segmentation in such an ad-hoc way for a single use case?

Unfortunately there isn't much market for general purpose CPUs if they are fundamentally unable to run Linux ;)

As you will almost certainly be running a Unix, and as you almost certainly will be using libraries that may fork, then all your normal heap and data stack pointers are going to be Local. Shared mmap pointers will be Global, as will your code. We can hope there is a flag this Linux lets you set that says that you forego the ability to fork() and in return all your pointers can be Global, because the Local puts some constraints on your address space use which become clearer if I explain them:

The local bit works like this:

If a pointer has the bit set, then before use it is mangled with a special register called the Local Space register. The hardware does this every time it uses a pointer. Each turf has its own Local Space value, and you can think of it as a simple offset, so if the address is 0x1...10 and the local space is 2 then the effective address in the global space is 0x1...100.

Now 64-bit adds are relatively slow for this use-case because we really want the effective address asap in the FU so a probable implementation of the Local Space is XORing it into the pointer instead of adding.

When a process forks the OS has to find a position in the global address space where the allocated ranges used by the program are free for the child and then set the new Local Space register for the child appropriately.

> Just run the original binary in emulation until exec() or something like that.

All modern OS don't actually COW until there's a page fault, and I'd expect them to use that trick on the Mill too. So the local bit makes it possible to fork(), but the hole-searching is lazy and only happens if you actually use it.

[+] PhilWright|8 years ago|reply

The Mill design is fascinating because it is genuinely very different to anything else. But it seems that the entire team might die from old age before they actually have any working silicon produced. Which would be a shame.

[+] thatswrong0|8 years ago|reply

Innovative CPU architectures and popular fantasy novel series both..

[+] monk_e_boy|8 years ago|reply

Yeah, I'd love to see some more money or resources poured into this.

[+] CyberDildonics|8 years ago|reply

They aren't trying to produce silicon, they are trying to produce patents.

[+] kev009|8 years ago|reply

Looking around the Mill Computing, Inc website, this feels like an (accidental?) sweat equity scam. I realize that is a very loaded charge, but this is NOT how real companies are run: "In the beginning we were a sweat equity organization; no one received a salary; instead, contributors received units that converted to stock when we incorporated. At incorporation 45 people had worked on the Mill and became shareholders. After incorporation we are still a sweat equity organization; we now use a stock option system for sweat equity, and we still pay no salaries. Reward for work today is comparable to what it was before incorporation."

I was involved in something similar around 10 years ago where we were working on a revolutionary EDA suite for analog and mixed signal circuit design. The owner was quite technically competent but kept upping the ante and theatrics to the point that no customers or suitors took the company seriously, only the desperate employees. They never closed sales nor sold the IP.

I advise extreme caution in dealing with the business side of this.

[+] igodard|8 years ago|reply

Caution is always warranted when you aren't getting cash on the barrel. The sweat equity documents are available - ask on the site (and now that I think about it, I suppose we should just put them on the site directly). There's no "owner": we all work on the same deal, me included. As it happens I have the largest chunk of equity. You can call that a scam after you have worked full time for over a decade with no paycheck :-)

And yes, Mill Computing, Inc. is not how real companies are run. Is that a bug or a feature?

[+] posterboy|8 years ago|reply

I understand the sentiment but the wording is indeed too loaded. A scam would mean the scammer gains a benefit.

[+] WhitneyLand|8 years ago|reply

On a regular basis Mill pops up here and generates some interest, I just can’t understand why.

They haven’t produced an FPGA proof of concept, after claiming they would have one ready last year. They now say they need investors to finish it, yet they previously claimed to not even be looking for funding.

They claim to have angel investors, but they are all secret ones. Of course it’s an investors right to stay private, but the reason you often see investors and companies shouting from roof tops is because the funding event itself can help a company. Publicizing it generates PR, gives the company credibility in dealing with other companies, and is a signal that can generate demand for more investors.

Even putting that aside, the biggest issue is they haven’t made a compelling case for how their ideas will outperform existing CPUs in practical usage scenarios. Yes a running FPGA would be nice, that’s not the only way to show potential.

They could do quantitative analysis, modeling, or start adding a lot more detail to their talks and papers (which tend to sound about as deep as you get in an undergrad architecture classroom), and argue very specifically and comparatively against today’s standards, even for just a few key scenarios.

Maybe they believe even those ways would still have capital/labor/opportunity costs that are prohibitive or for a startup? Another option could be small meetings with a few well respected hardware architects, who will have the best chance of understanding the potential value. Once convinced, they will probably be glad to write about it or just provide a reference, which will make funding, partnerships, hiring, etc all easier.

I dislike being critical of people swinging for the fences, because it’s what many of us here are trying to do, and it’s important that people keep doing it. However in this case it’s not just about long odds. Because of the reasons above and a few other details, things just don’t add up. I don’t believe the FPGA will ever demonstrate anything compelling, and don’t think any investor in their own backyard on sand hill road will bite.

It’s all conjecture of course, I’d be happy to be proven wrong.

[+] ema|8 years ago|reply

You brought up the question of why people are interested in the mill and then discussed the question of whether the mill is viable. They're not completely unrelated but still distinct.

Being a software guy I can't say much about the viability. Watching Ivan's lectures and thinking it over however tickles the same part of my brain that enjoys learning a new programming language. It is just fun to see how some problem could be solved differently.

[+] igodard|8 years ago|reply

All sound points.

Most of what you'd like to see are things we'd like to see too. At the beginning we decided to bootstrap rather than follow the usual funding model, at least to the point at which we could demonstrate what we had to people who would understand it in detail. We choose bootstrap in large part because most of us were old enough to have had actual experience with other business models. Yes, it has taken far longer to get this far than we wanted, but we have gotten this far.

About evaluation: it has been our experience that the more senior/skilled a hardware (and software) guy is the more they fall in love with the Mill. You don't hear much of that - we want the tech to be judged on its merits, not on some luminariy's say-so. And of course those senior guys tend to work for potential competitors and don't want to say much publicly.

But you are right: the proof will be running code, and we're starting to do that. We'll be doing more talks like the switches talk, with actual code comparisons. Eventually we will put our tool chain and sim on the cloud for you to play with. Patience, waiting is.

[+] jacquesm|8 years ago|reply

As a seed investor into several companies I'm pretty happy to keep my investments private.

[+] rjsw|8 years ago|reply

A closed source software emulation would be fine for me, ARM provide this for AArch64.

[+] convolvatron|8 years ago|reply

watching the talk. does he not compare this to a classic segment/call gate architecture because he doesn't expect it to be a familiar reference? i'm certain he's seen it before :)

edit: i thought they managed to do all of this without segments, but at the end we hear about a special local segment with offset addressing apparently introduced just to handle children of fork()..i lost how cow can be expressed losslessly as local/global

redit: question around 1:00:00 explicitly ask this question, and he said, erroneously i think that while semantically similar this is the first time a direct hardware implementation of a call gate has been proposed

[+] Taniwha|8 years ago|reply

I can think of a couple systems with hardware call gates

He also doesn't mention the word 'capability' anywhere either - this is all 1980s stuff

[+] infogulch|8 years ago|reply

Of all of the Mill subjects, the pointer kludge to support fork (itself a kludge, yes) seems to me to be the biggest offender of the "sufficiently smart compiler" red flag.

I just have a sinking feeling about hoping a compiler can correctly identify and track all pointers to know how to flag them. The "pointer is a native-word-sized int" assumption may be so ingrained -- from compilers to stdlibs to the wide variety and age of programs -- that it will be nigh impossible to rid existing codebases of it completely.

But I'm not a compiler guy (or hardware, or assembly, or C for that matter) so I could be quite mistaken. Perhaps it's enough to fix the compiler and make it capable of emitting warnings/errors when it detects a violation.

As far as the talk itself goes, I'm a little sad that there was so little new information though I understand that we're quite deep in the technical details and there's a lot of prerequisite background that you can't reasonably expect from a random tech audience. If there are more than a few more talks you might need to reevaluate this method altogether and use a different format.

I'm very glad that you've decided to change the wording to refer to it as an "SSA machine" as opposed to "belt". I think many more people are familiar with SSA or can be convinced that it works ("your current compiler uses it right now" probably helps) by describing it as "SSA where you can only reference the last N results" as opposed to building a whole model based on a "conceptual giant shift register" from before. I've been following the Mill talks since the first few videos and recently I wonder if even the asm programming model should be writing raw SSA instead of belt numbers, especially since genasm assumes an infinite belt anyways.

[+] neerajsi|8 years ago|reply

Unrelated to protection: Ivan mentioned that this is an SSA-like architecture.

How does the compiler implement PHIs connecting expressions with different latencies? Let's say I have:

`if (cond) { x = a + b; } else { x = a * b; }`

The MUL is may take a bit longer than the ADD, but the user needs to accept the argument at a given belt position. How do you avoid having the pay the latency cost for the MUL if `cond` is usually true?

[+] igodard|8 years ago|reply

The tool chain does hoisting and if-conversion with wild abandon. That code becomes {x = cond ? a+b : a*b}, and both expressions are evaluated in parallel. The conversion is a heuristic; if you have tracing data for the branch then it might not convert. However, a miss-predict is a lot more expensive than a multiply so the tracing has to be pretty skewed to be worth the branch.

The conversion does increase the latency of getting the value of x. If there's nothing else to do then the tool chain will insert explicit nops to wait for the expression. The same stalls will exist on other architectures for the same code, just not visibly in the code. It happens that making the nops explicit is faster than a stall; you can idle through a nop with no added overhead, but you can't restart a stall instantaneously.

[+] dirkt|8 years ago|reply

It's an interesting question, maybe ask at the Mill forum: https://millcomputing.com/forum/the-mill/

Based on my limited understanding of the Mill I'd say (1) (simple_ PHIs are implemented by the pick-phase, (2) it's an exposed pipeline with known latencies, so you can't really have the "I want to avoid the longer latency inside a single basic block", and it's probably not worth doing that for a single mult vs. a single add, (3) for cases where it matters because the latency difference is much better, you'd use two basic blocks, and rearrange belt positions at the end.

[+] marcosdumay|8 years ago|reply

The talk about the belt explains this.

If I remember it right, ADD and MUL take the same time, but when instructions differ in time the compiler is expected to reorder them. If it can't reorder, than yes, it has to wait.

[+] taliesinb|8 years ago|reply

Just wanted to say, it's always a real treat to watch Mill talks, and I thank Ivan for putting in the hard work of making them so good! (The negativity I see here on HN really disappoints me).

Also, is the thread talk close at hand? I feel I learned less from this talk than usual; most of the material was already discussed in the security talk.

[+] burner|8 years ago|reply

The emperor has no clothes. The guy who claims to have written 12 compilers hasn't turned out one in a decade. How are microarchitectural decisions being driven without a compiler?

[+] silisili|8 years ago|reply

I like the idea behind Mill, and the openness of the talks, etc. That said, it's been a -very- long time without so much as a real demo. What gives?

[+] neerajsi|8 years ago|reply

I wonder how the PLB can be fast. You have a dictionary from byte range to permission. This is harder than TLBs, which map a relatively large granule where you can form a search key by just extracting the top bits from the virtual address.

Intel MPX has a similar protection model, and that introduces a lot of overhead (of course it is bolted onto an existing arch and it wasn't a high priority feature).

[+] willvarfar|8 years ago|reply

The protection entries have ranges. The bounds are in bytes, but the range can be massive.

Imagine you load a 7MP image which takes, say, 21MB of RAM. That would be 5184 4K pages in a classic TLB. In the Mill's PLB, that whole part of the address space can be in a single protection entry.

Then, there's a big difference from how things can be organised in software vs hardware. The hardware PLB has some number of entries, and it will check all those entries in parallel.

[+] _ooqq|8 years ago|reply

I work with a lot of OO code (ORM) that regularly contains references on objects to other objects. How would that "security model" behave wrt the map of reachable objects in relation to the object passed... let's assume by reference. I figure this scenario would be somewhat similar to the problem of "/.." paths in URLs on web servers.

[+] igodard|8 years ago|reply

The grant model requires you to grant each object individually that you want to pass. That is annoying if you have many objects. In both the caps and grant models you can cut the overhead by thinking of the whole graph as "the object". A typical approach is to allocate graph nodes in an arena and pass the whole arena.

Fine granularity is expensive, which is why the monoliths have one process-granularity. If you have 100,000 graph nodes and want to pass all of them except this one then you will have to pay for the privilege, in any protection model. The Mill lets you pay less.

76 comments