Graviton2 and Graviton3

[+] erulabs|4 years ago|reply

> most workloads could actually run more efficiently if they had more memory bandwidth and lower latency access to memory

Turns out memory access speed is more or less the entire game for everything except scientific computing or insanely optimized code. In the real world, CPU frequency seems to matter much less than DRAM timings, for example, in everything but extremely well engineered games. It'll be interesting to learn (if we ever do) how much of the "real-world" 25% performance gain is solely due to DDR5.

I remember getting my AMD K8 Opteron around 2003 or 2004 with the first on-die memory controller. Absolutely demolished Intel chips at the time in non-synthetic benchmarks.

[+] shoo|4 years ago|reply

> everything except scientific computing or insanely optimized code

for insanely unoptimized code, such as accidentally ending up writing something compute intensive in pure python, its very plausible for it to be compute constrained -- but less because of the hardware and more because 99 %or 99.9% of the operations you're asking the cpu to perform are effectively waste.

[+] klelatti|4 years ago|reply

The team that designed the original Arm CPU in 1985 came to the conclusion that bandwidth was the most important factor influencing performance - they even approached Intel for a 286 with more memory bandwidth!

[+] hinkley|4 years ago|reply

In the 90’s there were people trying to solve this problem by putting a small CPU on chip with the memory and running some operations there. I routinely wonder why memory hasn’t gotten smarter over time.

[+] baybal2|4 years ago|reply

Memory bandwidth is not a problem. Even puny 1 channel memory desktops don't usually saturate it.

It's memory latency.

[+] mda|4 years ago|reply

I find this claim hard to believe honestly,could you point to examples where performance is limited by Dram speed and not by cpu / caches? They must be applications with extremely bad design causing super low cache hits.

[+] veselin|4 years ago|reply

Well, I disagree with pretty much everything in the claims.

First, most real unoptimised code faces many issues before memory bandwidth. During my PhD, the optimisation guys doing spiral.net sat nextdoor and they produced beautiful plots of what limits performance for a bunch of tasks and how each optimisation they do removes an upper bound line until last they get to some bandwidth limitation. Real code will likely have false IPC dependencies, memory latency problems due to pointer chaising or branch mispredictions well before memory bandwidth.

Then the database workload is something I would consider insanely optimized. Most engines are in fierce performance competition. And normally they hit the memory bandwidth in the end. This probably answers why the author is not comparing to EPYC instances that have the memory bandwidth to compete with Graviton.

Then the claims that they choose not to implement SMT or to use DDR5 are both coming from their upstream providers.

[+] wyldfire|4 years ago|reply

Wouldn't SMT be a feature that you are free to use when designing your own cores? I'm assuming Amazon has an architectural license (Annapurna acquisition probably had them, this team is likely the Graviton design team at AWS). So who is the upstream provider? ARM?

And if they designed the CPU wouldn't they decide which memory controller is appropriate? Seems like AWS should get as much credit for their CPUs as Apple gets for theirs.

Bottom line for Graviton is that a lot of AWS customers rely on open source software that already works well on ARM. And the AWS customer themselves often write their code in a language that will work just as well on ARM. So AWS can offer it's customers tremendous value with minimal transition pain. But sure, if you have a CPU-bound workload, it'll do better on EPYC or Xeon than Graviton.

[+] wmf|4 years ago|reply

I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).

[+] hodgesrm|4 years ago|reply

> I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).

ARM is just a design. AWS brought it to market. ARM-based server processors are still rare on the ground. IIRC Equinix Metal and Oracle Cloud offer them (Ampere chips) but not GCP or Azure.

We've tested Graviton2 for data warehouse workloads and the price/performance was about 25% cheaper and 25% faster than comparable Intel-based VMs. Still crunching the numbers but that's the approximate shape of the results.

[+] magila|4 years ago|reply

Yeah, the tone of these talks is kind of weird. They talk about how "we decided to do foo" when the reality is "we updated to the latest tech from our upstream providers which got us foo".

[+] pm90|4 years ago|reply

Like how Apple takes credit for packaging new technology in an easy to use product? What’s wrong with that? They’re not exactly hiding it.

[+] aledalgrande|4 years ago|reply

Isn't making the CPU wider one of the things Apple also did with M1? Doesn't feel like they are the first.

[+] phamilton|4 years ago|reply

A recurring theme is "build a processor that performs well on real workloads".

It occurs to me that AWS might have far more insight into "real workloads" than any CPU designer out there. Do they track things like L1 cache misses across all of EC2?

[+] uplifter|4 years ago|reply

Reality varies. Its a truism in optimization that the only valid benchmark is the task you are trying to accomplish. These chips have been optimized for an average of the tasks run on AWS (which is entirely sensible for them), but that doesn't mean they'll be the best for your specific job.

[+] w1nk|4 years ago|reply

They'll definitely have information that traditional CPU designers won't. Check out this talk from Brendan Gregg (he's probably lurking), where he specifically calls this out:

https://www.brendangregg.com/blog/2021-07-05/computing-perfo...

See slide 26 (and the rest ofc :)).

[+] virtuallynathan|4 years ago|reply

Hard to track for other people’s VMs, but they probably have (or can sample) that data for every AWS-operated service (dynamo, S3, redshift, etc..)

[+] trhway|4 years ago|reply

AWS can also build slightly different CPUs under the same name for different workloads and not tell anybody.

[+] pm90|4 years ago|reply

Arm seems poised to replace x86 in servers. If I were Intel this would make me really nervous.

[+] betaby|4 years ago|reply

Very unlikely. See for example Linus reasoning https://www.realworldtech.com/forum/?threadid=183440&curpost...

[+] taf2|4 years ago|reply

Possibly on the desktop too... I imagine we'll see many m1 like windows pc options in the near future...

[+] freemint|4 years ago|reply

Don't forget Ampere's A1 i found them really, really impressive for SAT solving and that you can get them at 1ct/core/hour at Orcale makes them really financially attractive.

[+] jeffreyrogers|4 years ago|reply

5 or 6 years ago Marc Andreesen was saying this would happen eventually. I was skeptical when I first heard the claim, but it's seeming more and more likely.

[+] adfgdtyhaet|4 years ago|reply

[deleted]

73 comments