The End of Moore’s Law and Faster General Purpose Computing, and a Road Forward [pdf]

[+] AtlasBarfed|6 years ago|reply

We've built up layers and layers and layers of inefficiencies in the entire OS and software stack since the gigahertz wars took us from 66Mhz to multiple GHz in the 90s.

The software industry is awful at conserving code and approaches through the every-five-years total redo of programming languages and frameworks. Or less for Javascript.

That churn also means optimization from hardware --> program execution doesn't happen. Instead we plow through layers upon layers of both conceptual abstraction layers and actual software execution barriers.

Also, why the hell are standardized libraries more ... standardized? I get lots of languages are different in mechanics and syntax... But a standardized library set could be optimized behind the interface repetitively, be optimized at the hw/software level, etc.

Why do ruby, python, javascript, c#, java, rust, C++, etc etc etc etc etc not have evolved to an efficient underpinning and common design? Linux, windows, android, and iOS need to converge on this too. It would be less wasted space in memory, less wasted space in OS complexity, less wasted space in app complexity and size. I guess ARM/Intel/AMD would also need to get in the game to optimize down to chip level.

Maybe that's what he means with "DSLs", but to me "DSLs" are an order of magnitude more complex in infrastructure and coordination if we are talking about dedicated hardware for dedicated processing tasks while still having general task ability. DSLs just seem to constrain too much freedom.

[+] wayoutthere|6 years ago|reply

Correct me if I'm wrong, but isn't this exactly the problem LLVM was designed to tackle?

[+] omarhaneef|6 years ago|reply

For those who have not looked yet: John Hennessey presentation. Argues -- with a lot of detail -- that Moore's law has closed out, that energy efficiency is the next key metric, and that specialized hardware (like TPU) might be the future.

When I buy a machine, I am now perfectly happy buying an old CPU, and I think this shows you why. You can buy something from as far back as 2012, and you're okay.

However, I do look for fast memory. SSDs at least, and I wish he had added a slide about the drop in memory speed. Am I at inflection?

Perhaps the future is: you buy an old laptop with specs like today and then you buy one additional piece of hardware (TPU, ASIC, Graphics for gaming etc).

[+] woliveirajr|6 years ago|reply

(for the average user) the CPU speed isn't relevant anymore, even the number of cores. Internet speed plays the biggest factor when watching a movie, opening a web page, using some cloud-based app. Then memory speed and GPU speed (even in cell phones) come second: how fast will your CPU grab that data and process something.

In niches, of course, CPU speed matters. I want to compile something in half of the time. I want to train my AI model faster. But what really bugs me is when I have to wait for a refresh on some site, this really makes me loose focus (and then I come to HN to see some news and time goes by).

[+] gambler|6 years ago|reply

>energy efficiency is the next key metric, and that specialized hardware (like TPU) might be the future.

This is nonsense pushed forward by large corporations who want to own all your data and computational capacity.

[+] seph-reed|6 years ago|reply

> "that energy efficiency is the next key metric"

What limits cpu speed? => https://electronics.stackexchange.com/questions/122050/what-...

More efficiency means more overclocking. Also, better cooling/dissipation methods and less gate delay.

We may be hitting the limits of wire thinness, but I get a strong feeling (ie not an expert) we've got a decent ways to go before we hit the limits of clock speed.

[+] robmaister|6 years ago|reply

My i7-2600k @ 4.4GHz agrees with you. The only reason I would upgrade would be for better USB 3+ support (I have a hard time with anything more complex than a SuperSpeed flash drive).

SSDs are by far the most cost-effective upgrade you can get nowadays. HDDs tend to be the bottleneck for boot times and general "snappiness" nowadays.

[+] joshlegs|6 years ago|reply

i seem to recall something about some carbon technology recently that is supposedly the next big revolution in semiconductors, and might revive Moore's Law? I wanna say it was about how carbon nanotubes can be used as an excellent semiconductoer, and because their width is at the atomic scale, they can result in even smaller transistors.

or something to that effect. Not sure where I read it tho. Maybe on here

[+] romaniv|6 years ago|reply

>Perhaps the future is: you buy an old laptop with specs like today and then you buy one additional piece of hardware (TPU, ASIC, Graphics for gaming etc).

If Google has its way, the future will be: you buy a Chromebook and do all the real work on Google Cloud. Note that John Hennessy is a chair of Alphabet. And in case someone here forgot, TPUs were developed by Google and so was Tensorflow.

[+] unknown|6 years ago|reply

[deleted]

[+] banjo_milkman|6 years ago|reply

This ties in nicely with chiplets: https://semiengineering.com/the-chiplet-race-begins/ - a way to integrate dies in a package, where the dies can use specialized processes for different functions - e.g. analog or digital or memory or accelerators or CPUs or networking etc. This would make it easier to iterate memory/CPU/GPU/FPGA/accelerator designs at different rates, and reduce development costs (don't need to support/have IP for every function, just an accelerated set of operations on an optimized process within each chiplet). But it will need progress on inter-chiplet PHY/interface standardization.

[+] deepnotderp|6 years ago|reply

So yes, if you compare matrix multiply in Python vs SIMD instructions, you will find a big improvement. Much harder to do that for more general purpose workloads.

And it doesn't scale: https://spectrum.ieee.org/nanoclast/semiconductors/processor...

And in many cases, if you normalize all the metrics, e.g. precision, process node, etc. You'll find that the advantage of ASICs is greatly exaggerated in most cases and is often within ~2-4X of the more general purpose processor. E.g. small GEMM cores in the Volta GPU actually beat the TPUv2 on a per chip basis. Anton 2, normalized for process, is within 5x ish of manycore MIMD processors in energy efficiency.

In other cases, e.g. the marquee example of bitcoin ASICs, that only works because of extremely low memory and memory bandwidth requirements.

[+] prvc|6 years ago|reply

A possibly stupid question from a neophyte: what was the driving force behind Moore's law when it was in operation? Did it become a self-fulfilling prophecy by becoming a performance goal after becoming enshrined in folklore, or is there an underlying physical reason?

[+] dredmorbius|6 years ago|reply

Moore's law is part of a general set of principles regarding learning curves, see generally:

https://en.wikipedia.org/wiki/Experience_curve_effects

During WWII, it was observed that each doubling of output reduced labour costs (or increased labour efficiency) by 20%.

Moore's law is dependent on the density of transitors (the count doubles for a given cost every two years). Increased density => increased computing power and efficiency, and speed.

Chip design is dependent on numerous factors: die size (e.g., 14 vs. 9nm photolithography dies), silicon purity, fab cleanliness (much as with cascade refrigeration, chip fabs now have multiple concentric zones of increased cleanliness), and the power and capacity of the software that's used in chip modelling itself.

The law is also not entirely exogenous as it relies on market forces and demand: need for increased computing power tends to proceed at a predictable rate, and the ability to make use of more capacity is also constrained by existing practices, software, programmer skill, etc.

Then there are the other non-CPU bottlenecks. Disk and memory have long been the foundations of that, increasingly it's networking. The tendency of old technology and layers not to die but to be buried in ever deeper levels of encapsulation means that efficiencies which might be gained aren't due to multiple transitions and translations -- the reason why a 1980 CPM or Apple II system had a faster response than today's digital wireless Bluetooth keyboards talking to a rendered graphical display. Bufferbloat, at the network stack, is another example.

But: the main driver for Moore's law is increased density leading to increased efficiency (the same centralising tendency present in virtually all networks), bound and limited by the ability to get power in and heat out (Amdahl's observation that all problems ultimately break down to plumbing).

[+] wmf|6 years ago|reply

Dennard's laws explain the physics of why smaller transistors are faster and more efficient, but overall Moore's Law was closer to a self-fulfilling prophecy. There's no intrinsic reason why each generation was targeting 2x density and 18/24-month cycles are probably convenient from a business perspective but not essential.

[+] aiCeivi9|6 years ago|reply

https://en.wikipedia.org/wiki/Transistor_count

The transistor can only get so small before it stops working. There are many issues with required extreme ultraviolet light sources (lasers) and allowed amount of impurities in silicon waffer. And R&D cost for each iteration of lithography is getting higher while bringing less benefits.

[+] sifar|6 years ago|reply

Slide 36 compares the TPU with a CPU/GPU. This is apples to oranges comparison. One uses an 8bit Integer multiply while the other uses a 32b Floating Point multiply which inherently uses at least >4X more energy[1]. If you scale the TPU by 4, it is not an order of magnitude better. The proper comparison should be between the TPU and an equivalent DSP doing 8b computations. That would show if eliminating the energy consumed due to the Register File accesses is significant.I suspect most of the energy saving comes from having a huge on chip memory.

[1] From slide 21

Function Energy in Pj

8-bit add 0.03

32-bit add 0.1

FP Multiply 16-bit 1.1

FP Multiply32-bit 3.7

Register file *6

L1 cache access 10

L2 cache access 20

L3 cache access 100

Off-chip DRAM access 1,300-2,600

[+] SemiTom|6 years ago|reply

Big chipmakers are turning to architectural improvements such as chiplets, faster throughput both on-chip and off-chip, and concentrating more work per operation or cycle, in order to ramp up processing speeds and efficiency https://semiengineering.com/chiplets-faster-interconnects-an...

Scaling certainly isn’t dead. There will still will be chips developed at 5nm and 3nm, primarily because you need to put more and different types of processors/accelerators and memories on a die. But this isn’t just about scaling of logic and memory for power, performance and area reasons, as defined by Moore’s Law. The big problem now is that some of the new AI/ML chips are larger than reticle size, which means you have to stitch multiple die together. Shrinking allows you to put all of this on a single die. These are basically massively parallel architectures on a chip. Scaling provides the means to make this happen, but by itself it is a small part of total the power/performance improvement. At 3nm, you’d be lucky to get 20% P/P improvements, and even that will require new materials like cobalt and a new transistor structure like gate-all-around FETs. A lot of these new chips are promising for orders of magnitude improvement—100 to 1,000X, and you can’t achieve that with scaling alone. That requires other chips, like HBM memory, with a high speed interconnect like an interposer or a bridge, as well as more efficient/sparser algorithms. So scaling is still important, but not for the same reasons it used to be.

[+] unknown|6 years ago|reply

[deleted]

[+] DSingularity|6 years ago|reply

It is not that I disagree with Hennessy, but I think it is premature to conclude that general-purpose processors have reached the end of the road. There is a healthy middle in between specialized and general-purpose design. Exploiting that middle is what I think will deliver the next generation of growth. That is exactly what naturally occurred with SoC and mobile design.

The raw computational capabilities of the TPU don't really prove anything. Of course co-design wins. Whether it is vison or NLP -- NN training has dominant characteristics. The arithmetic is known: GeMM. The control is known: SGD. Tailoring control and memory-hierarchy to this is a no-brainer and of course the economic incentives at Google push them in this direction and of course the expertise available at Google powered this success. For other applications it is not so clear.

Finding similar dominance in other applications is trickier. To accelerate an application with a specialized architecture you need dominating characteristics in the apps memory-access, computational, and control profiles.

[+] yogthos|6 years ago|reply

It's odd that the presentation doesn't discuss alternatives to using silicon. Ultimately, this is akin to saying that there are limits on how small a vacuum tube we can make. We already know of a number of other potential computing platforms such as graphene, photonics, memristors, and so on. These things have already been discovered, and they have been shown to work in the lab. It's really just a matter of putting the effort into producing these technologies at scale.

Another interesting aspect of moving to a more efficient substrate would be that power requirements for the devices will also lower as per Koomey's law https://en.wikipedia.org/wiki/Koomey%27s_law

[+] brennanpeterson|6 years ago|reply

Well...no. what it says is there are limits on how small of wires we can make, and how small of layers make a material functional (about 5).

Wires can't get smaller without compromising RC (and thus speed). Quite horrifically: this is way more an issue than the transistor.

Graphene and photonics don't help this. At all. It isn't a matter of how small a tube. You physically need 5nm to insulate, and 5nm for a functional material. So a 5nm device with a 5nm spacer and a 5nm space to the next device is about it. The smallest pitch of any physical device is 20nm. The critical pitches in wafer are about 30nm and 40nm, so in an ideal world, we can reach 3x, ever. It doesn't matter which material you choose.

And yeah, you can stack up, but not in quite the way you dream, and thermal and processing issues make this hard in most domains. When I build, I deposit at temperatures, which affect underlying layers. So stacking doesn't quite work as you might expect. Again, real materials Ina real flow are actually different, and not in a trivial 'just make it work' reducible fashion.

Memristors may not really exist, and are useful in the context of high speed memory. That has real physical challenges. And people.have spent billions for decades on this problem.

Anyway, this is missing some background, but the presentation is great.

[+] rrss|6 years ago|reply

> It's odd that the presentation doesn't discuss alternatives to using silicon.

You must have missed slide 41, which has a "beyond silicon" bullet.

[+] toasterlovin|6 years ago|reply

That we are exploring other computing substrates does not mean that those substrates will be economical or practical to use. They either will be or they won't (a determination which is ultimately dependent on the laws of physics). Our exploration of other substrates is a necessary but insufficient precondition to actually putting other computing substrates into production.

[+] dragontamer|6 years ago|reply

"WASTED WORK ON THE INTEL CORE I7", slide#12 (page 13 in pdf) is fascinating to me. But I want to know how the data was collected, and what the % wasted work actually means.

40% wasted work, does that mean that they checked the branch-predictor and found that 40% of the time was spent on (wrongfully) speculated branches?

It also suggests that for all of the power-efficiency faults of branch predictors (aka: running power-consuming computations when it was "unnecessary"), the best you could do is maybe a 40% reduction in power consumption (no task seems to be 40% inefficient).

[+] vardump|6 years ago|reply

> ... INTEL CORE I7

When someone says Intel i5 or i7, I immediately wonder if they're talking about 2008 i7 or 2019 model.

Intel would be smart to retire whole i3/i5/i7/i9 branding. People seem to think every i5 or i7 is the same.

[+] roenxi|6 years ago|reply

Still be too early to call the end of the march of microprocessors though.

https://www.scienceabc.com/humans/the-human-brain-vs-superco...

The limits they are running up against are indeed crisises, but they're probably going to be able to find that they can copy whatever it is that biology is doing and squeeze out quite a bit more. The tradeoffs will get a lot weirder though.

[+] rrss|6 years ago|reply

Humans are not good at general purpose computation. Your linked article states the brain achieves 1 exaflops, and cites http://people.uwplatt.edu/~yangq/csse411/csse411-materials/s... for this number. That document states the value with no citation or rationale.

I can do far less than 0.0001 single precision floating point operations per second, so whatever the context for "1 exaflops" is, it isn't general purpose computation.

EDIT: this seems sort of like saying that throwing a brick through a window achieves many exaflops because simulating the physics in real time would require that performance. I'd like to read more about this value and how someone came up with it, but googling just gives me that same scienceabc article and stuff referencing it.

[+] justicezyx|6 years ago|reply

Amin's keynote is relevant here: https://onfconnect2019.sched.com/event/RzZl

The basic form of computing is becoming distributed. More are coming.

[+] mikewarot|6 years ago|reply

I'm amazed that it's less than a picojoule to do an 8 bit add.

[+] scottlocklin|6 years ago|reply

The Landauer limit is about a billion times smaller than this, so there's room for power savings before we hit any physical limits.

[+] singularity2001|6 years ago|reply

So what's the name of the metric flop/sec/USD because that keeps on growing exponentially thanks to GPUs/TPUs, a paradigm shift predicted by Ray Kurzweil.

[+] yalogin|6 years ago|reply

Is there a video of this talk available somewhere?

Also can someone tell me what p4 is? Looks like almost every company and a bunch of universities are "contributors" there.

[+] musicale|6 years ago|reply

P4 is a domain-specific language for specifying packet forwarding pipelines, i.e. the hardware that takes packets in one port, decodes their headers (e.g. destination MAC or IP address), munges them somehow (e.g. updating TTL, destination MAC, and checksum), and sends them out another port. This enables you to build all sorts of network devices from Ethernet switches to IP routers to RDMA fabrics, etc.. You can compile P4 onto a CPU, a smart NIC, an NPU, a programmable ASIC, an FPGA, etc.. It can also be used a bit like EBPF and compiled into a pipeline in the Linux kernel.

Basically P4 allows you to (re)program your network data plane to do whatever you want, and you can create new network protocols or change the way existing ones work without having to change your hardware and without losing line rate performance.

It's also somewhat like EBPF, but it compiles to hardware as well as software.

[+] musicale|6 years ago|reply

One interesting example of switch company using P4 is Arista, who have rolled out multi-function programmable switches (7170 series) that can be repurposed/reprogrammed with different personality profiles/operational modes as needed. Some of the profiles are things like stateful firewall/ACLs (up to 100k), large-scale NAT (again 100k), large-scale tunnel termination (up to 192k), packet inspection/telemetry (first 128 bytes), and segment routing (basically source routing over network segments.) And it is also user-programmable.

[+] wmf|6 years ago|reply

https://www.youtube.com/watch?v=xGNTzjxsLSk

P4 is a domain-specific programming language for accelerated packet header processing in switches and NICs.

[+] almost_usual|6 years ago|reply

One of the more interesting things I’ve read on HN in awhile. Seems like this will result in a large paradigm shift for the computing industry.

[+] SkyPuncher|6 years ago|reply

I think we've already seen the shift with cellphones.

I think consumer facing performance processors will fade.

Data centers will continue to push for more performance. It could mean less rack space, less power consumption, and less to manage.

Cell phone/tablet focused processors will become powerful enough to handle the majority of daily tasks while enjoying extended battery life.

[+] Accujack|6 years ago|reply

There's an internet meme about "Imminent death of Moore's law predicted".

All Moore's law talks about is the density of transistors on a chip, and it's never been a linear progression of numbers. Recently I've seen news articles about some research into 5nm processes and other methods for increasing density of components on silicon, so it seems Moore's law (really Moore's rule of thumb or Moore's casual observation) isn't done yet.

70 comments