> most workloads could actually run more efficiently if they had more memory bandwidth and lower latency access to memory
Turns out memory access speed is more or less the entire game for everything except scientific computing or insanely optimized code. In the real world, CPU frequency seems to matter much less than DRAM timings, for example, in everything but extremely well engineered games. It'll be interesting to learn (if we ever do) how much of the "real-world" 25% performance gain is solely due to DDR5.
I remember getting my AMD K8 Opteron around 2003 or 2004 with the first on-die memory controller. Absolutely demolished Intel chips at the time in non-synthetic benchmarks.
> everything except scientific computing or insanely optimized code
for insanely unoptimized code, such as accidentally ending up writing something compute intensive in pure python, its very plausible for it to be compute constrained -- but less because of the hardware and more because 99 %or 99.9% of the operations you're asking the cpu to perform are effectively waste.
The team that designed the original Arm CPU in 1985 came to the conclusion that bandwidth was the most important factor influencing performance - they even approached Intel for a 286 with more memory bandwidth!
In the 90’s there were people trying to solve this problem by putting a small CPU on chip with the memory and running some operations there. I routinely wonder why memory hasn’t gotten smarter over time.
I find this claim hard to believe honestly,could you point to examples where performance is limited by Dram speed and not by cpu / caches? They must be applications with extremely bad design causing super low cache hits.
Well, I disagree with pretty much everything in the claims.
First, most real unoptimised code faces many issues before memory bandwidth. During my PhD, the optimisation guys doing spiral.net sat nextdoor and they produced beautiful plots of what limits performance for a bunch of tasks and how each optimisation they do removes an upper bound line until last they get to some bandwidth limitation. Real code will likely have false IPC dependencies, memory latency problems due to pointer chaising or branch mispredictions well before memory bandwidth.
Then the database workload is something I would consider insanely optimized. Most engines are in fierce performance competition. And normally they hit the memory bandwidth in the end. This probably answers why the author is not comparing to EPYC instances that have the memory bandwidth to compete with Graviton.
Then the claims that they choose not to implement SMT or to use DDR5 are both coming from their upstream providers.
Wouldn't SMT be a feature that you are free to use when designing your own cores? I'm assuming Amazon has an architectural license (Annapurna acquisition probably had them, this team is likely the Graviton design team at AWS). So who is the upstream provider? ARM?
And if they designed the CPU wouldn't they decide which memory controller is appropriate? Seems like AWS should get as much credit for their CPUs as Apple gets for theirs.
Bottom line for Graviton is that a lot of AWS customers rely on open source software that already works well on ARM. And the AWS customer themselves often write their code in a language that will work just as well on ARM. So AWS can offer it's customers tremendous value with minimal transition pain. But sure, if you have a CPU-bound workload, it'll do better on EPYC or Xeon than Graviton.
> I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).
ARM is just a design. AWS brought it to market. ARM-based server processors are still rare on the ground. IIRC Equinix Metal and Oracle Cloud offer them (Ampere chips) but not GCP or Azure.
We've tested Graviton2 for data warehouse workloads and the price/performance was about 25% cheaper and 25% faster than comparable Intel-based VMs. Still crunching the numbers but that's the approximate shape of the results.
Yeah, the tone of these talks is kind of weird. They talk about how "we decided to do foo" when the reality is "we updated to the latest tech from our upstream providers which got us foo".
A recurring theme is "build a processor that performs well on real workloads".
It occurs to me that AWS might have far more insight into "real workloads" than any CPU designer out there. Do they track things like L1 cache misses across all of EC2?
Reality varies. Its a truism in optimization that the only valid benchmark is the task you are trying to accomplish. These chips have been optimized for an average of the tasks run on AWS (which is entirely sensible for them), but that doesn't mean they'll be the best for your specific job.
They'll definitely have information that traditional CPU designers won't. Check out this talk from Brendan Gregg (he's probably lurking), where he specifically calls this out:
Don't forget Ampere's A1 i found them really, really impressive for SAT solving and that you can get them at 1ct/core/hour at Orcale makes them really financially attractive.
5 or 6 years ago Marc Andreesen was saying this would happen eventually. I was skeptical when I first heard the claim, but it's seeming more and more likely.
[+] [-] erulabs|4 years ago|reply
Turns out memory access speed is more or less the entire game for everything except scientific computing or insanely optimized code. In the real world, CPU frequency seems to matter much less than DRAM timings, for example, in everything but extremely well engineered games. It'll be interesting to learn (if we ever do) how much of the "real-world" 25% performance gain is solely due to DDR5.
I remember getting my AMD K8 Opteron around 2003 or 2004 with the first on-die memory controller. Absolutely demolished Intel chips at the time in non-synthetic benchmarks.
[+] [-] shoo|4 years ago|reply
for insanely unoptimized code, such as accidentally ending up writing something compute intensive in pure python, its very plausible for it to be compute constrained -- but less because of the hardware and more because 99 %or 99.9% of the operations you're asking the cpu to perform are effectively waste.
[+] [-] klelatti|4 years ago|reply
[+] [-] hinkley|4 years ago|reply
[+] [-] baybal2|4 years ago|reply
It's memory latency.
[+] [-] mda|4 years ago|reply
[+] [-] veselin|4 years ago|reply
First, most real unoptimised code faces many issues before memory bandwidth. During my PhD, the optimisation guys doing spiral.net sat nextdoor and they produced beautiful plots of what limits performance for a bunch of tasks and how each optimisation they do removes an upper bound line until last they get to some bandwidth limitation. Real code will likely have false IPC dependencies, memory latency problems due to pointer chaising or branch mispredictions well before memory bandwidth.
Then the database workload is something I would consider insanely optimized. Most engines are in fierce performance competition. And normally they hit the memory bandwidth in the end. This probably answers why the author is not comparing to EPYC instances that have the memory bandwidth to compete with Graviton.
Then the claims that they choose not to implement SMT or to use DDR5 are both coming from their upstream providers.
[+] [-] wyldfire|4 years ago|reply
And if they designed the CPU wouldn't they decide which memory controller is appropriate? Seems like AWS should get as much credit for their CPUs as Apple gets for theirs.
Bottom line for Graviton is that a lot of AWS customers rely on open source software that already works well on ARM. And the AWS customer themselves often write their code in a language that will work just as well on ARM. So AWS can offer it's customers tremendous value with minimal transition pain. But sure, if you have a CPU-bound workload, it'll do better on EPYC or Xeon than Graviton.
[+] [-] wmf|4 years ago|reply
[+] [-] hodgesrm|4 years ago|reply
ARM is just a design. AWS brought it to market. ARM-based server processors are still rare on the ground. IIRC Equinix Metal and Oracle Cloud offer them (Ampere chips) but not GCP or Azure.
We've tested Graviton2 for data warehouse workloads and the price/performance was about 25% cheaper and 25% faster than comparable Intel-based VMs. Still crunching the numbers but that's the approximate shape of the results.
[+] [-] magila|4 years ago|reply
[+] [-] pm90|4 years ago|reply
[+] [-] aledalgrande|4 years ago|reply
[+] [-] phamilton|4 years ago|reply
It occurs to me that AWS might have far more insight into "real workloads" than any CPU designer out there. Do they track things like L1 cache misses across all of EC2?
[+] [-] uplifter|4 years ago|reply
[+] [-] w1nk|4 years ago|reply
https://www.brendangregg.com/blog/2021-07-05/computing-perfo...
See slide 26 (and the rest ofc :)).
[+] [-] virtuallynathan|4 years ago|reply
[+] [-] trhway|4 years ago|reply
[+] [-] pm90|4 years ago|reply
[+] [-] betaby|4 years ago|reply
[+] [-] taf2|4 years ago|reply
[+] [-] freemint|4 years ago|reply
[+] [-] jeffreyrogers|4 years ago|reply
[+] [-] adfgdtyhaet|4 years ago|reply
[deleted]