top | item 45399366

(no title)

Waraqa | 5 months ago

This is great news for the entire ARM ecosystem. The fact that ARM is now exceeding the best x86 CPUs marks a historical turning point, and other manufacturers are sure to follow suit.

discuss

order

freedomben|5 months ago

> The fact that ARM is now exceeding the best x86 CPUs marks a historical turning point, and other manufacturers are sure to follow suit.

Haven't they been playing leap frog for years now? I avoid the ARM ecosystem now because of how non-standardized BIOS is (especially after being burned with several different SoC purchases), and I prefer compatibility over performance, but I think there have been some high performance ARM chips for quite some time

mattbillenstein|5 months ago

I came to realize by soldering a lot of fast ram on to the board of newer laptops and phones, maybe it's not the instruction set that matters that much.

Modern Apple hardware has so much more memory bandwidth than the x86 systems they're being compared to - I'm not sure it's apples to apples.

hajile|5 months ago

A19 has WAY less bandwidth on its 64-bit bus than desktop chips with 128-bit busses . AMD’s strix halo is also slower despite a 256-bit bus.

Pushing this point further, x86 chips are also slower when the entire task fits in cache.

The real observation is how this isn’t some Apple black magic. All three of the big ARM core designers (Apple, ARM, and Qualcomm) are now beating x86 in raw performance and stomping them in performance per watt (and in performance per watt per area).

It’s not just apples deep pockets either. AMD spent more in R&D than ARM’s entire gross profit margin last I checked. Either AMD sucks or x86 has more technical roadblocks than some people like to believe.

lordnacho|5 months ago

I talked to a guy who'd worked at Apple on the chips. He more or less said the same thing, it's the memory that's all the difference.

This makes a lot of sense. If the calculations are fast, they need to be fed quickly. You don't want to spend a bunch of time shuffling various caches.

zamadatix|5 months ago

Memory bandwidth/latency is helpful in certain scenarios, but it can be easily oversold in the performance portion of the story. E.g. the 9950X and 9950X3D are within less than 1/20th of a percentage point of each other in PassMark Single thread (feeding a single core is dead easy) but have a spread of ~6.4% (in favor of the 9950X3D) in the multi-thread (where the cache is starting to help on the one CCD). It could just as easily have been in the other direction or 10 times as much depending on what the benchmark was trying to do. For most day to day user workloads the performance difference from memory bandwidth/latency is the "nil to some" though.

Meanwhile the AI Max+ 395 has at least twice the bandwidth + same number of cores and comes to more like a ~15% loss on single and ~30% loss on multithread due to other "traditional" reasons for performance difference. I still like my 395 though, but more for the following reason.

The more practical advantage of soldered memory on mobile devices is the power/heat reductions, same with increasing the cache on e-cores to get something out of every possible cycle you power rather than try to increase the overall computation with more wattage (i.e. transistors or clocks). Better bandwidth/latency is a cool bonus though.

For a hard number the iPhone 17 Pro Max is supposed to be around 76 GB/s, yet my iPhone 17 Pro Max has a higher PassMark single core performance score than my 9800X3D with larger L3 cache and RAM operating at >100 GB/s. The iPhone does have a TSMC node advantage to consider as well, but I still think it just comes out ahead due to "better overall engineering".

hamandcheese|5 months ago

It's very possible I am misinterpreting, but the A19 seems to have less total memory bandwidth than, say, a 9800x (but not by much). And far less than the Max and Ultra chips that go into MacBooks.

So I think there's more to it than memory bandwidth.

omikun|5 months ago

X86 compete based on clock speed for the longest time so they use cell libraries designed for higher that. This means the transistors are larger and less dense. Arm cores are targeted at energy efficiency first so they use denser cells that doesn’t clock as fast. The trade off is they can have larger reorder buffers and larger scheduling windows to squeeze better ipc. As frequency scaling slows but not so much density scaling you get better results going the slower route.