The main issue AMD seems to be solving here is the yields at 7nm and lower processes.
Smaller chips means that you get more yield in the presence of errors. Intel builds relatively large 600mm^2 chips (like the XCC aka the 28-core Xeon), but AMD thinks the future is to build networks of ~200mm^2 chips, like what they've done with Zen / Threadripper / EPYC.
The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.
That's it. One singular chip design, mass produced over and over again, to handle AMD's entire consumer and high-end line. Keep this one design small to help yields and maybe AMD can get a process advantage over Intel's larger designs.
AMD's "mobile" or "APU" line is Raven Ridge (a 2nd design at 193mm^2) that doesn't use this.
----------------
The above is the current status quo. This "active interposer" that AMD is developing in the article would go above and beyond in terms of integration.
Note that HBM2 (next-generation high-bandwidth RAM) requires the interposer. PCB is not good enough for HBM's protocol. Ditto with Hybrid-Memory Cube (a competing standard). So it seems like the future of computer parts will be the interposer.
The interposer isn't necessary for AMD's CPU strategy however. So the roadmap for this network won't come till 2020+ or later (CPUs). Unless AMD might be building this network for their GPU line?? (But that roadmap is also past 2020). I bet this is all research-and-development, and may never come out as a commercial product.
> The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.
While a popular meme, this is not actually true. Epyc is actually a totally different die, stepping B2 vs the B1 die used in Ryzen+TR.
The 2700X is actually on a different die as well, and Raven Ridge on another. There will probably be another die for Banded Kestrel, if AMD ever gets around to releasing that. Presumably, their embedded SOC products are their own die as well.
So, about 5 dies per generation, across AMD's lineup (Epyc, Ryzen, APU, Atom, and embedded SOC). They're using about half as many dies as Intel is - still a significant difference, but far from the "one die for the whole lineup!" meme.
The big difference is that they're serving the whole server market with one small die, vs the three that Intel uses.
Of course, a small die isn't all roses either - both mfrs limit you to 8 dies per system, so right now with an 8-core die AMD systems are limited to 64-core systems (dual-socket Epyc) versus the 224-core systems that you can do with octo-socket 28-core Xeons. But, not everybody needs a million-dollar octo-socket system either.
Once AMD takes a node advantage that advantage will be diminished somewhat, but Intel's 10nm woes are a whole different story ;)
Writing code to run in parallel across a large number of small "chips" is probably the future of programming. In this sense, new programming languages like Julia might prove useful?
In the beginning, a sea of discrete components made up a system. Investment in fab technology caused process nodes to shrink very 18 months so these discrete components gave way to the System-On-Chip where the board of chips was replaced by a single chip.
Now physics is harder to overcome, the cost of development at the bleeding edge of technology is higher than ever, and the continued desire for larger and larger systems caused the SoC to break apart again. It’ll be interesting to see if this is what the future looks like for silicon-based chips, or if this is a temporary shortcut.
I'm not sure if SoC is breaking apart actually. I think chipbuilders are figuring out that its more efficient to combine some dies together at the package level.
Consider that EPYC is basically a miniturized multi-socket design. Infinity Fabric is really AMD's new protocol built on top of HyperTransport (multi-socket protocol from the past). Before, AMD used to support 8-sockets. But today, AMD stitches 4-chips together and only supports 2-sockets.
From a software perspective (ie: NUMA), EPYC x2 sockets looks like an 8-socket chip of old. In effect, AMD has miniturized the 4x-socket setup in the form of EPYC. And it has also miniturized the 2x socket setup in the form of Threadripper.
----------------------
These Threadripper / EPYC chips have the same downsides as all old 2x, 4x, and 8x NUMA designs of the past. High latency and poor communications between cores.
The thing is: the modern environment is a highly virtualized, highly independent set of systems. Running 8x NUMA efficiently today is as simple as spinning up 8x VMs, one for each NUMA node.
IIRC, people are finding that Intel's 28-core design is far more effective in say... unified Database performance. Intel's design has a true L3 cache which can be used by all 28-cores, while AMD's L3 cache is split between each die. 4x 8MB caches cannot function as a singular cache in a large-scale database application.
But there's enough situations (ie: VMs, multitasking, render farms) where AMD's NUMA + Infinity Fabric is good enough. And with prices anywhere from 1/4th to 1/2 the cost of Intel, AMD's chips these days are certainly worth considering.
What is old is new again. We'll probably go in this direction for a decade or three, breaking things up. Then we'll go the other way again, integrating everything into one chip again.
After having programmed for some unusual architectures (CM2, others) I have to say that the GreenArrays chip looked to be... impressively difficult to program for.
Then I played the game "TIS-100" and found out that my intuition was very likely correct.
It certainly does seem like chip design is marching grudgingly from Core i7 to Cell BE and eventually to Connection Machine. Physics doesn’t really care about ease of programming.
I think would be interesting to see someone like Mellanox make a chiplet with their tech which could be fully integrated into an AMD SoC or APU or whatever they're calling them now.
Sounds like it's just the next step in the chain from discrete components -> ICs on a circuit board -> this. The active interposer is filling the role that the circuit board currently fills, with devices etched into the interposer filling the role of discrete components, making everything more compact.
Any hardware gurus out there care to tak about how this helps? I guess having a flat pool of heterogenous resources is nice. As long as there’s a decent SDK that abstracts the hard stuff away I’m all for it.
It won't be user-facing, so there won't be an SDK. It's a way of building chips better,like AMD's Infinity Fabric. You could integrate GPUs, multiple CPU dies (like Epyc), and DRAM on a single package and tie them all together with interposers, which would look to the user like a CPU with an integrated GPU and a big L4 cache.
Using multiple small dies and tying them together has several advantages. Small dies yield better, so sometimes several small dies are cheaper than one large one. There's also versatility because you can mix and match components.
As I see it, this is just a network on interposer instead of a network on chip (NoC). NoCs have simple rules for routing that also prevent deadlocks, so I am not sure that the idea here is that significant. The active interposer is a quite new idea. I haven’t followed it. Maybe the journalist found the rules more exiting than the active interposer idea.
Either way, the research is one of the many small steps forward to better chips.
A more modular package brings a variety of benefits. Small chips are cheaper per unit area, because yields are higher. Reuse in different products can also lower development costs. Different IP can live on the most appropriate technology node- mature & cheap, new & fast, etc. IP can be developed on different schedules. IP from completely different companies can be integrated more easily. (see the intel-AMD partnership, it's "chiplet")
It's really not that complicated an idea- modular packages are more flexible! What's new is making it work within a compelling power, price, and performance envelope.
Is the Big Deal about this having what used to be different cards/chips sharing a cpu-style insanely fast bus, instead of trickling stuff over pcie or dram channels? If that is the case, the advantage of this will depend on the amount of bus saturation to be eliminated. Should be interesting. Everything works on Infiniti fabric!
That’s the advantage over having multiple packaged chip on a traditional PCB. The advantage over an SoC is that you can have different subsystems on decoupled development schedules and different process nodes all come together on the same “chip.”
With HBM2 being used in AMD's (and NVidia's) high-end products, it seems like the DRAM Channels are going to require an expensive interposer.
But "what else" can benefit from an interposer? If your RAM requires it, are there cheaper or more efficient designs that are built out of a network of chiplets on an interposer, as opposed to building out huge chips all the time?
AMD is already forced to build an interposer for Vega64. Might as well research other uses of it.
I wonder if FPGAs could be a node too. That would allow us to mix programmable, highly parallel analog modeled acceleration onto the same high speed / direct connect bus as all the other fixed, traditional computing components. I really like the approach AMD is suggesting here, treating it like nodes on a network.
You've been able to buy FPGAs tightly coupled with AMD CPUs since 2006 or so. The tech back then was to either plug them into an HTX Hypertransport slot, or in a cpu socket. Very few customers actually wanted to buy these things and I think all of the makers lost money.
Hypothetically, yes. But FPGA's are very area-hungry; a big powerful one isn't going to fit nicely into a modestly sized package along with lots of other chips.
My guess is modular will still be an option, if not the only option. x86 CPUs have really been PCBs for over 20 years. You are now starting to see a move toward SoMs take off in the embedded world, which combine the CPU and RAM on a module to simplify board layout. In the past, these would have been separate components that customers laid down themselves on their custom PCB.
That’s obviously not a hardware development, but I feel like the motivation may be similar: make components more modular; stabilize, standardize and align their interfaces.
By making these components “plug and play” the distance between a logical flow chart and the actual implementation is somewhat reduced, making the development of custom components more efficient and agile.
[+] [-] dragontamer|7 years ago|reply
Smaller chips means that you get more yield in the presence of errors. Intel builds relatively large 600mm^2 chips (like the XCC aka the 28-core Xeon), but AMD thinks the future is to build networks of ~200mm^2 chips, like what they've done with Zen / Threadripper / EPYC.
The advantage for AMD is that they've built a single design: the Zeppelin die. RyZen is simply one Zeppelin. Threadripper is two Zeppelins. And EPYC is four Zeppelins.
That's it. One singular chip design, mass produced over and over again, to handle AMD's entire consumer and high-end line. Keep this one design small to help yields and maybe AMD can get a process advantage over Intel's larger designs.
AMD's "mobile" or "APU" line is Raven Ridge (a 2nd design at 193mm^2) that doesn't use this.
----------------
The above is the current status quo. This "active interposer" that AMD is developing in the article would go above and beyond in terms of integration.
Note that HBM2 (next-generation high-bandwidth RAM) requires the interposer. PCB is not good enough for HBM's protocol. Ditto with Hybrid-Memory Cube (a competing standard). So it seems like the future of computer parts will be the interposer.
The interposer isn't necessary for AMD's CPU strategy however. So the roadmap for this network won't come till 2020+ or later (CPUs). Unless AMD might be building this network for their GPU line?? (But that roadmap is also past 2020). I bet this is all research-and-development, and may never come out as a commercial product.
[+] [-] paulmd|7 years ago|reply
While a popular meme, this is not actually true. Epyc is actually a totally different die, stepping B2 vs the B1 die used in Ryzen+TR.
https://en.wikichip.org/wiki/amd/ryzen_7/1800x
https://en.wikichip.org/wiki/amd/epyc/7601
The 2700X is actually on a different die as well, and Raven Ridge on another. There will probably be another die for Banded Kestrel, if AMD ever gets around to releasing that. Presumably, their embedded SOC products are their own die as well.
So, about 5 dies per generation, across AMD's lineup (Epyc, Ryzen, APU, Atom, and embedded SOC). They're using about half as many dies as Intel is - still a significant difference, but far from the "one die for the whole lineup!" meme.
The big difference is that they're serving the whole server market with one small die, vs the three that Intel uses.
Of course, a small die isn't all roses either - both mfrs limit you to 8 dies per system, so right now with an 8-core die AMD systems are limited to 64-core systems (dual-socket Epyc) versus the 224-core systems that you can do with octo-socket 28-core Xeons. But, not everybody needs a million-dollar octo-socket system either.
Once AMD takes a node advantage that advantage will be diminished somewhat, but Intel's 10nm woes are a whole different story ;)
[+] [-] rb808|7 years ago|reply
[+] [-] ksec|7 years ago|reply
>AMD's "mobile" or "APU" line is Raven Ridge (a 2nd design at 193mm^2) that doesn't use this.
My guess is that future APU will be chiplet based design as well.
[+] [-] jostmey|7 years ago|reply
[+] [-] petra|7 years ago|reply
Does it mean it would be easier for companies to cooperate, so in one chip you'll get best parts from the leaders ? or does IP already solves that ?
[+] [-] BooneJS|7 years ago|reply
Now physics is harder to overcome, the cost of development at the bleeding edge of technology is higher than ever, and the continued desire for larger and larger systems caused the SoC to break apart again. It’ll be interesting to see if this is what the future looks like for silicon-based chips, or if this is a temporary shortcut.
[+] [-] dragontamer|7 years ago|reply
Consider that EPYC is basically a miniturized multi-socket design. Infinity Fabric is really AMD's new protocol built on top of HyperTransport (multi-socket protocol from the past). Before, AMD used to support 8-sockets. But today, AMD stitches 4-chips together and only supports 2-sockets.
From a software perspective (ie: NUMA), EPYC x2 sockets looks like an 8-socket chip of old. In effect, AMD has miniturized the 4x-socket setup in the form of EPYC. And it has also miniturized the 2x socket setup in the form of Threadripper.
----------------------
These Threadripper / EPYC chips have the same downsides as all old 2x, 4x, and 8x NUMA designs of the past. High latency and poor communications between cores.
The thing is: the modern environment is a highly virtualized, highly independent set of systems. Running 8x NUMA efficiently today is as simple as spinning up 8x VMs, one for each NUMA node.
IIRC, people are finding that Intel's 28-core design is far more effective in say... unified Database performance. Intel's design has a true L3 cache which can be used by all 28-cores, while AMD's L3 cache is split between each die. 4x 8MB caches cannot function as a singular cache in a large-scale database application.
But there's enough situations (ie: VMs, multitasking, render farms) where AMD's NUMA + Infinity Fabric is good enough. And with prices anywhere from 1/4th to 1/2 the cost of Intel, AMD's chips these days are certainly worth considering.
[+] [-] st26|7 years ago|reply
[+] [-] bokchoi|7 years ago|reply
http://www.greenarraychips.com/
[+] [-] chris_st|7 years ago|reply
Then I played the game "TIS-100" and found out that my intuition was very likely correct.
[+] [-] corysama|7 years ago|reply
https://en.m.wikipedia.org/wiki/Cell_(microprocessor)
https://en.m.wikipedia.org/wiki/Connection_Machine
[+] [-] Quequau|7 years ago|reply
[+] [-] greglindahl|7 years ago|reply
[+] [-] taneq|7 years ago|reply
[+] [-] gigatexal|7 years ago|reply
[+] [-] asgionionio|7 years ago|reply
Using multiple small dies and tying them together has several advantages. Small dies yield better, so sometimes several small dies are cheaper than one large one. There's also versatility because you can mix and match components.
[+] [-] petermonsson|7 years ago|reply
Either way, the research is one of the many small steps forward to better chips.
[+] [-] tzahola|7 years ago|reply
I'd be very skeptical of that. See: Cell Broadband Engine.
[+] [-] st26|7 years ago|reply
It's really not that complicated an idea- modular packages are more flexible! What's new is making it work within a compelling power, price, and performance envelope.
[+] [-] berbec|7 years ago|reply
[+] [-] nrp|7 years ago|reply
[+] [-] dragontamer|7 years ago|reply
With HBM2 being used in AMD's (and NVidia's) high-end products, it seems like the DRAM Channels are going to require an expensive interposer.
But "what else" can benefit from an interposer? If your RAM requires it, are there cheaper or more efficient designs that are built out of a network of chiplets on an interposer, as opposed to building out huge chips all the time?
AMD is already forced to build an interposer for Vega64. Might as well research other uses of it.
[+] [-] sixdimensional|7 years ago|reply
[+] [-] greglindahl|7 years ago|reply
[+] [-] st26|7 years ago|reply
[+] [-] tormeh|7 years ago|reply
[+] [-] taneq|7 years ago|reply
[+] [-] rbanffy|7 years ago|reply
https://octavosystems.com/app_notes/osd335x-design-tutorial/
[+] [-] etaioinshrdlu|7 years ago|reply
[+] [-] shmerl|7 years ago|reply
[+] [-] planteen|7 years ago|reply
[+] [-] faragon|7 years ago|reply
I would like to know how they solved that problem. Is there any public paper or patent explaining that?
[+] [-] m3kw9|7 years ago|reply
[+] [-] KenanSulayman|7 years ago|reply
That’s obviously not a hardware development, but I feel like the motivation may be similar: make components more modular; stabilize, standardize and align their interfaces.
By making these components “plug and play” the distance between a logical flow chart and the actual implementation is somewhat reduced, making the development of custom components more efficient and agile.