igodard's comments

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

Much of the Mill is not new; we don't bother filing those parts. Much that is not new is very good, and was abandoned for reasons unrelated to the actual merits. The first compiler I ever wrote was for the Burroughs B6500, which in 1970 had better security than any current commercial architecture. That compiler is still in use.

Security in architecture is the history of a race to the bottom, driven by newbie customers not knowing there was such a thing and the economics of chip-making. We may hope that there are fewer newbies now. That leaves economics. To a large extent the Mill has been an effort to make old ideas economically viable to today's customers.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

The grant model requires you to grant each object individually that you want to pass. That is annoying if you have many objects. In both the caps and grant models you can cut the overhead by thinking of the whole graph as "the object". A typical approach is to allocate graph nodes in an arena and pass the whole arena.

Fine granularity is expensive, which is why the monoliths have one process-granularity. If you have 100,000 graph nodes and want to pass all of them except this one then you will have to pay for the privilege, in any protection model. The Mill lets you pay less.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

There are dededicated Well-Known-Regions for code, stack, globals and TLS that catch the great majority of memory references. The PLB is only consulted when we miss in the WKRs. What's not in the WKRs? MAP_SHARED mmap()s. How many of those are in your program? How often do you access one for the first time or after long enough that the entry has been evicted from the PLB?

Like any cache, the optimal PLB size is determined by the working set. In the typical code we are seeing, the program has a couple of open files, half a dozen mmaps where the heap grew itself, and portal blocks for assorted libraries. The working sets are much smaller than a conventional TLB, and with SAS we have several cycles available in parallel with the caches.

The upshot is that a PLB can be large, cool, and slow. As for the range compares, the PLB permits the same sore of address sub-setting as is done in mixed size TLBs. Think about how many bits in the typical address range differ between lower and upper bounds.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

The tool chain does hoisting and if-conversion with wild abandon. That code becomes {x = cond ? a+b : a*b}, and both expressions are evaluated in parallel. The conversion is a heuristic; if you have tracing data for the branch then it might not convert. However, a miss-predict is a lot more expensive than a multiply so the tracing has to be pretty skewed to be worth the branch.

The conversion does increase the latency of getting the value of x. If there's nothing else to do then the tool chain will insert explicit nops to wait for the expression. The same stalls will exist on other architectures for the same code, just not visibly in the code. It happens that making the nops explicit is faster than a stall; you can idle through a nop with no added overhead, but you can't restart a stall instantaneously.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

We have focused on the core and less on the uncore, which is why there have been no talks on I/O. The goal is for a smart peripheral to be indistinguishable from just another regular core; the Mill design is big on regularity. That implies that it has its own PLB and TLB, responds to HEYU, and supports the same IPC mechanisms, both those in the talk and those NYF.

Of course, modern peripherals don't look like that, so there will be adaptors. IBM 360 channels and CDC6600 PPs also haven't been architecturally revisited in a while.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

That's really more the VC model; a bootstrap is different.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

Caution is always warranted when you aren't getting cash on the barrel. The sweat equity documents are available - ask on the site (and now that I think about it, I suppose we should just put them on the site directly). There's no "owner": we all work on the same deal, me included. As it happens I have the largest chunk of equity. You can call that a scam after you have worked full time for over a decade with no paycheck :-)

And yes, Mill Computing, Inc. is not how real companies are run. Is that a bug or a feature?

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

So would we. See millcomputing.com -> About -> Invest in Us.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

All sound points.

Most of what you'd like to see are things we'd like to see too. At the beginning we decided to bootstrap rather than follow the usual funding model, at least to the point at which we could demonstrate what we had to people who would understand it in detail. We choose bootstrap in large part because most of us were old enough to have had actual experience with other business models. Yes, it has taken far longer to get this far than we wanted, but we have gotten this far.

About evaluation: it has been our experience that the more senior/skilled a hardware (and software) guy is the more they fall in love with the Mill. You don't hear much of that - we want the tech to be judged on its merits, not on some luminariy's say-so. And of course those senior guys tend to work for potential competitors and don't want to say much publicly.

But you are right: the proof will be running code, and we're starting to do that. We'll be doing more talks like the switches talk, with actual code comparisons. Eventually we will put our tool chain and sim on the cloud for you to play with. Patience, waiting is.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

The Mill grant-based model is semantically quite similar to capabilities, but it associates protection with the accessor (thread/turf) rather than the access (pointer/capability). This lets us preserve the size of a pointer, which no one knows how to do efficiently with capabilities.

The difference between the two models is visible when you pass a graph structure across a protection boundary. With caps is is easy to pass the whole graph, and hard to pass only one node. With grants it is vice versa.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

>Unless there's something else not mentioned in the talk, it seems like you still need to trust the OS, because when the OS is asked to allocate a page for a spillet there is nothing stopping it from creating a virtual alias of that page elsewhere and allowing another thread to read its data.

The Mill is hardware and architecture, not policy. If you want to use such an OS then you are free to do so. The Mill is designed to efficiently support micro-kernel OSs. Note: micro-kernel, not no-kernel. There always will be a Resource Service that owns the machine. It will be a couple hundred LOC, small enough to be correct by eyeball or proof. Contrast your choice of monolith.

The OS is not involved in allocating spillets. Spillet space is a large statically-allocated matrix in the address space. It is not allocated in memory, only in the address space. As soon as you allocate a turf id and a thread id you have implicitly allocated a spillet. Only on spillet overflow is allocation necessary. Whether allocating turf or thread ids requires OS involvement depends on the policies and models chosen by the OS designer.

When first created the spillet data lives only in backless cache - no memory is allocated. Only if the spillet lives long enough to get evicted from cache is actual memory allocated, using the Backless Memory mechanism described in our Memory talk. The root spillets of apps will live that long; transient spillets from portal calls will likely live only in cache. Consequently truly secire IPC/RPC using Mill portals has overhead, both app and system combined, of the same magnitude of an ordinary function call.

> They don't provide nearly enough ways to transitively grant permissions. Using the mechanisms discussed in the talk, it doesn't seem like you can implement a simple asynchronous queue of units of work to perform, each having their own permissions.

There is a "session" notion that addresses such things. Unfortunately the talks are far enough into details that they must contain background and introduction slides for the viewers who have not already done (and retailed) all the other talks. This limits the amount of new material that can be covered in a single talk, and sessions didn't make the cut this time. We'll get to them.

> The mechanism to support fork() is a total kludge.

Agreed; there seems to be a Law of Conservation of Kludgery. We had as a minimum requirement that the architecture must support Unix shell. The only real problem is fork(). Would that we could issue an edict banning it.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

No, it's aliasing, just as any fork() is. COW lowers the cost, for Mill as for any other. The Mill fork duplicates the address space, not the memory. The memory is duplicated page by page on demand, i.e. copy-on-write. Mill paging is quite conventional; it's the address space (SAS) that is different.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

The biggest problem with caps and legacy apps is not the semantics - a caps systems can emulate POSIX with no problem. The problem is the data representation: you can't fit a capability into a pointer, so all the data layouts change. Sadly, there are tons of C programs that make rash assumptions about data layout, and they would all break.

The guys are Cambridge have running caps systems that store the extra info in outboard data structures. We judge that the overhead is too great for commercial success. Customers buy benchmarks, and there are no security benchmarks.

'Tis true 'tis, 'tis pity. 'Tis pity 'tis, 'tis true.

igodard | 8 years ago | on: Mill CPU Inter-Process Communication

The next talk will be on threading, and will address all this. The IPC and threads talks belong together, but are too big to combine in one talk unfortunately.

igodard | 8 years ago | on: The Mill CPU Architecture: Switches [video]

Mill has no register file, and is essentially a bypass network internally. That bypass network has a source X sink complexity (not N^2 because the number of sources need not match the number of sinks). You are right that the cross product is a limit on Mill scaling. Our internal work gives us some confidence that we can handle 30-wide issue with tolerable clock impact; beyond that is unclear, and indeed we may hit other constraints that preclude going further; memory bandwidth is a likely issue.

igodard | 8 years ago | on: The Mill CPU Architecture: Switches [video]

The target is not static; it is dynamically updated as part of history.

Prefetch chaining is to get code out of DRAM, and it runs DRAM-latency ahead of execution. Separately, fetch chaining is to get the code up to the L1, and it runs L2/L3-latency ahead of execution. Load chaining gets the lines from the L1 to the L0 micro-caches and the decoders, and runs decode-latency (3 cycles on the Mill) ahead of execution.

The Mill stages instructions like this because the further ahead in time an action is the more likely that the action is down a path that execution will not take. We don't prefetch to the L1 because we don't have enough confidence that we will need the code to be willing to spam a scarce resource like the L1. But avoiding a full DRAM hit is important too, so we stage the fetching. It doesn't matter at all in small codes that spend all their time in a five-line loop, but that's not the only kind of codes there are :-)

igodard | 8 years ago | on: The Mill CPU Architecture: Switches [video]

Perhaps you could revisit the prediction talk. Short summary:

Table reload is part of our support for the micro-process coding style. It is essentially irrelevant for typical long running benchmarks, especially those that discard the first billion instructions or so to "warm up the caches".

Table reload provides faster response for cold code (either new processes or running processes at a program phase boundary) than simply letting the predictor accumulate history experience. There are heuristics that decide whether execution has entered a new phase and should table-load; the table is not reloaded on every miss. Like any, the heuristics may be better or worse for a given code.

The historical prediction information is in the load module file and is mapped into DRAM at process load time, just like the code and static data sections. Table-load is memory-to-predictor in hardware and is no more difficult than any of the other memory-to-cache-like-structure loading that all cores use, such as loading translation-table entries to a TLB.

While a newly-compiled load module file gets a prediction table from the compiler, purely as being better than nothing, the memory image from the file is continually updated during execution based on execution experience. When the process terminates, this newly-augmented history is retained in the file, so a subsequent run of the same load module is in effect self-profiling to take advantage of actual execution history. Of course, programs behave differently from run to run and the saved profile experience may be inapt for the next run; there are heuristics that try to cope with that too, although we have insufficient experience as yet to know how well those work. However, we are reasonably confident that even inapt history will be better than the random predictions made by a conventional predictor on cold code.

As always in the Mill, we welcome in the Forum (millcomputing.com/forum) posts of the form "I don't understand how <feature> works - don't you have trouble with <problem>?". Unfortunately, time and audience constraints don't let us go as deep into the details in our talks as we'd like, but the details are available for you. If, after you have understood what a feature is for and how it works, you still see a problem that we have overlooked (as has happened a lot over the years; part of the reason it's been years) then we'd really welcome your ideas about what to do about it, too.

igodard | 9 years ago | on: Mill Computing in 2017

Such paranoia is amply justified in our industry. However I expect that we will continue our iconoclastic approach to legalisms so long as the founders keep control. Our model for our cloud software follows the example pioneered by Greg Comeau; you can see his at http://www.comeaucomputing.com/tryitout/.

igodard | 9 years ago | on: Mill Computing in 2017

A more likely strategy for a major (not Intel specifically) is to just use whatever they want without a license, and beat us to death with lawyers.

igodard | 9 years ago | on: Mill Computing in 2017

You show your age :-)