This article has no measurements, in fact it doesn't tell us anything we couldn't have gotten by looking at the product specifications. They even took their table from Phoronix, so they didn't even do the legwork of comparing the marketing material for the products!
Given that this low-effort article picked a metric which everyone knows will benefit the ARM processor, I can only assume it is marketing for Ampere. And yet, the first sentence starts with:
> Ampere's flagship 128-core Altra Max M128-30 may not be the world's highest-performing processor
Ampere is cool. It is really awesome that somebody is putting up a fight in CPU design without Intel/AMD's legacy advantages, or Apple/Amazon's infinite money. I really hope they didn't pay much for this fluff, that would be is pretty embarrassing.
--
Edit: It is neat to see that they've got a chip under $1k. I wonder if a Q32-17 workstation could be put together for cheaper than whatever the cheapest Apple M1 pro device is, to experiment with computationally crunchy Arm codes.
> Edit: It is neat to see that they've got a chip under $1k. I wonder if a Q32-17 workstation could be put together for cheaper than whatever the cheapest Apple M1 pro device is, to experiment with computationally crunchy Arm codes.
The cheapest M1P device is currently rather expensive ($2k for the 8-cores 14") but there'll almost certainly be an M1P mini for about the same price as the current (still on intel) high-ender: $1100.
A Q32-17 leaves you with $300 for a bespoke box around the CPU. For such a CPU class I'd expect the mainboard alone to exceed that budget. Even if the Mini is price-bumped to, say, 1500 (which would be somewhat in-line with the 13" -> 14" price differential) I don't think you can get even just the guts of an Altra-based workstation for less than the price of the processor.
Q32-17 may have 32 cores, may have 45W TDP and may have whooping 128 pcie 4.0 lanes but it is still only 1.7GHz.
What this means in practice is that this will heavily depend on the type of load you are running. A lot of workstation type loads just can't make use of 32 threads and on that CPU it will have to just to offset for the slower single core performance.
It's also quite a niche use case... An application fine with low single thread performance, that is highly parallelizable, requiring hundreds of threads, but with sufficiently branchy execution that CUDA/GPU doesn't work out... Oh, and it also no binary blobs, or you won't be able to port to ARM.
So? The real metric isn’t the price or the number of cores. Its how much performance you get per dollar. (And sometimes also, perf per watt or perf per RU).
The headline may as well be “Slow CPU cheaper than fast CPU”. This is not newsworthy.
This should be the link of the post, HN should ban links to sites like tomshardware.com where the scroll is hijacked to force play videos and don’t provide meaningful information over the same Phoronix’s articles.
On a socket to socket basis scoring significantly better than Intel and AMD's offering on some SPECint tests, significantly worse on others, and doing about the same as an AMD 7763 on average which has 64 cores with 128 threads. On the SPECfp a bit behind AMD but doing better than Intel, which sort of surprised me given AVX512.
This is great! I've waited for chips approaching something like 256 to 1024 cores since the 90s. These CPUs still leave a lot to be desired, but I sense that there's been a shift towards better uses of transistor count than just single-threaded performance.
These are still roughly 10 times more expensive than they should be because of their memory architecture. I'd vote to drop the idea of busses to external memories and switch to local memories, then have the cores self-organize by using web metaphors like content-addressable memory (CAM) to handle caching. Basically get rid of all of the cache coherence hardware and treat each core-memory as its own computer. The hardware that wasn't scalable could go to hardware-accelerated hashing for the CAM.
And a somewhat controversial opinion - I'd probably drop 64 bit also and either emulate 64 bit math on an 8/16/32 bit processor, or switch to arbitrary precision. That's because the number of cores is scalable and quickly dwarfs bits calculated per cycle. So we'd take say a 10% performance hit for a 100% increase in the number of cores, something like that. This would probably need to be tested in simulation to know where the threshold is, maybe 64 cores or something. Similar arguments could be used for clock speed and bus width, etc.
> These are still roughly 10 times more expensive than they should be because of their memory architecture. I'd vote to drop the idea of busses to external memories and switch to local memories, then have the cores self-organize by using web metaphors like content-addressable memory (CAM) to handle caching. Basically get rid of all of the cache coherence hardware and treat each core-memory as its own computer. The hardware that wasn't scalable could go to hardware-accelerated hashing for the CAM.
If you want "many cores" and "get rid of cache-coherence hardware", its called a GPU.
Yes, a lot of those "cores" are SIMD-lanes, at least by NVidia / AMD naming conventions. But GPU SIMD-lanes have memory-fetching hardware that operates per-lane, so you approximate the effects of a many-many core computer.
-------
Japanese companies are experimenting with more CPUs though. PEZY "villages" are all proper CPUs IIRC, but this architecture isn't very popular outside of Japan. In terms of the global market, your best bet is in fact a GPU.
The Fujitsu ARM supercomputer was also ARM-based + HBM2. But once again, that's a specific Japanese supercomputer and not very popular outside of Japan. It is available though.
There are a lot of other nice things that would have to go along that. A 32-bit linear address space is not enough for a lot of the things we do today, specially not in servers.
Having some memory dedicated to a given core is clever, however, provided we have the required affinity settings to match (moving a task to a different core would imply copying the scratchpad to the new core and would be extremely costly - much more than the cache misses we account for in current kernels)
What I would drop immediately is ISA compatibility. I have no use for it provided I can compile my code on the new ISA.
Number of cores is just another metric to optimize for. What counts in the end is whether it can efficiently and quickly deal with the load it is expected to handle.
Many cores are great for mostly independent tasks, but performance will suffer as soon as communication is required. Making chip architectures more distributed seems to be the state of the art at the moment, but this doesn't mean we will suddenly be able to escape Amdahl's Law. To be specific, for inherently serial applications where we are absolutely interested in getting the result ASAP, single-thread performance remains crucial.
There was a company about a decade back that did this. I seem to remember it was useful for web serving. Bought by AMD. Not sure what happened next. Look them up. Name was SeaMicro
These ARM cpu/boards has a cheaper replacement for x86 in real world general computing are becoming a “Year of desktop Linux” by now.
I’ve been hearing these news for more than a decade and still hasn’t materialize in anything meaningful if you take into account how much “fraction of a cost” they are advertised.
Edit: I was referring to a regular person or SME buying something like a couple "ATX" boards or "regular servers". Since it's a "fraction of the cost" I don't get why it hasn't spread like wildfire yet. I wasn't talking about giant cloud companies who place orders on the hundreds of thousands at least and many of them design their own hardware by now. Neither I was talking about a CPU that's attached with +$1000 of gray aluminum.
Raspberry Pi and it's "clones" are what it's closer to what I was talking about, but not really.
Apple’s entire Mac lineup is going to be Arm-powered by year end too.
Cost comparison is almost useless in this market because there simply are not enough wafers to meet demand. The price per core per watt is the real world comparison, and my inaudible MacBook fan demonstrates that beautifully.
The CPU costs a fraction of the equivalent Xeon, but the CPU is not the only part in a server BoM nor is the most expensive subsystem of the box. When you add a terabyte of RAM, extra networking, and a bunch of SAS SSD's and HDD's, the CPU cost is almost negligible.
Most companies that buy x86 servers have no desire to recompile their software for a new architecture - they want to run PowerBI or SharePoint. They don't really benefit from a machine like this.
Not counting cloud computing (and, presumably, Apple computers) is akin to the Linux naysayers not counting Android after the very essence of desktop computing was uprooted.
Standardisation and critical mass is the hard part of the puzzle for Arm64 desktops. But it was also the hard part of the puzzle for supercomputing and cloud servers, where it now has a firm foothold. Personally, I work in an industry where everyone is moving away from x86 and developing on physical Arm64 machines because x86 simply can't fit the power budget in production.
Not sure if it's 'real world' enough, but for many workloads Graviton2 (AWS ARM) is a drop in replacement for x86. Last year I moved a lot of workloads over with very little effort.
Oracle’s cloud servers are OK. Their physical servers have 160 cores (2x Ampere Altra Q80-30, 80 cores/each), 1TB RAM, and 100 Gbps network bandwidth (2x 50Gbps cards). They can also cut these servers into VMs and offer these smaller VMs.
The software story is OK by now. I had little to no issues with that aarch64 Linux in their VMs. I didn't need a lot though, only mysql, asp.net core runtime, and related OS setup (SELinux, built-in firewall, etc).
ARM has been pretty weak for more than a decade, and only started getting decent CPU performance and RAM sizes with a reasonable pricing very recently. Even x86 computing took a while to develop into something useful.
2 watts per core at 3GHz is pretty impressive, what nanometer are they at?
But still memory bandwidth is going to restrain those 128 cores to do anything joint parallel, you might actually be better off with many smaller 4-8 core machines.
I think Ampere doesnt support hyper threading. That means this 128 cores is comparable to 64 cores on EPYC/Xeon. Also there is the L2/L3 cache that is important and ofc arch. Arm still has low adoption because code has to be recompiled. While "web" targets are easy things like financial software that gains benefits from AVX-512 instructions might be harder because Neoverse doesn't have this instructions.
On the other side, massively concurrent solutions might benefit much more from this new 128/256 arms chips. So for sure there is room for this type of solutions and im happy we are adding on top of x86/amd64.
Last but not the least x86/amd64 unless something changed were locked to AMD and Intel so regions like EU can't rely on them if they want to be independent in terms of silicon production/design. So Arm and maybe RISC-V are the only real paths right now.
My impression of Arm-world SMT is that it was added in duress because people kept asking for it despite the word-of-God being that it was better just to add additional cores (I wonder if their license structure influences that claim?). Today SMT Arm cores are still very much the minority, so either the fashion sustained or their customers/implementors agree that more cores are better than fewer cores with SMT.
There are no AVX-512 instructions. But that's the x86 branding of the vector instructions that you can only implement with the right x86 licence. So it's tautological. Arm can have vector instructions and languages are even beginning to make portable interfaces for vectors on multiple architectures.
During a ARM HPC User group meetup I got to play with a 160 core machine from Ampere and got really, really impressive performance (not to talk about performance per dollar) for SAT solving. I pitched buying one to my local HPC cluster.
is it ? Most of EPYC processors are TDP 180/200 watt and there are cheap mobos for them that can host even two sockets. So I don't think that would be a big issue.
Also we don't even know how they calculate TDP, lets not forget every single company Intel, Amd, Nvidia etc... have their own weird formula to calculate TDP. Your Intel 12900k has TDP on paper of 125W but can easily jump to 300W of power consumed. Without knowing formula for TDP calculation from every manufacturer this type of comparison is only a guessing game.
Is it though? My desktop CPU regularly pulls 150W. 250W is pretty normal for a server. That said, I agree that motherboards will be really expensive since we’re not gonna see hundreds/thousands of models competing on price like we see with every new x86 socket.
The new Intel i9-12900K's have a published power usage of 241 W in turbo mode. Modern high-end Intel desktop motherboards are designed to sustain this indefinitely, treating it as a TDP.
The first "bestselling" consumer CPU I checked costs ~$50/core (a 6core AMD Ryzen 5600X). Scaled to 128 cores that Ryzen would cost $6400. Considering how many motherboards, PSUs, Fans, etc. you safe by having one computer with 128-core CPU compared to 8.3 computers with a 6-core CPU each, the price premium pays for itself (for workloads where this CPU performs at least as well as 8.3 Ryzen 5600X).
It’s not for consumers. It’s for special-purpose servers and workstations.
Many problems don’t scale well to more nodes. In many cases, it’s worth spending a lot on a single, very expensive server to avoid having to rewrite the software to be distributed across multiple machines.
I somehow was lucky enough to get a used Gigabyte E252-P30 as a home server. It has the 80-core version of their CPU in it, and it's been great. Seriously polished server-grade hardware with remote management, tons of memory channels and PCIe lanes, surprisingly low idle power consumption. Installation of a Linux distro was quite straightforward too.
[+] [-] bee_rider|4 years ago|reply
Given that this low-effort article picked a metric which everyone knows will benefit the ARM processor, I can only assume it is marketing for Ampere. And yet, the first sentence starts with:
> Ampere's flagship 128-core Altra Max M128-30 may not be the world's highest-performing processor
Ampere is cool. It is really awesome that somebody is putting up a fight in CPU design without Intel/AMD's legacy advantages, or Apple/Amazon's infinite money. I really hope they didn't pay much for this fluff, that would be is pretty embarrassing.
--
Edit: It is neat to see that they've got a chip under $1k. I wonder if a Q32-17 workstation could be put together for cheaper than whatever the cheapest Apple M1 pro device is, to experiment with computationally crunchy Arm codes.
[+] [-] floatboth|4 years ago|reply
Nah. You can't just get a chip & mainboard retail since there is basically no market for that.
About the only option for an Altra workstation is a prebuilt for "as low as $7661" :(
https://store.avantek.co.uk/ampere-altra-64bit-arm-workstati...
[+] [-] masklinn|4 years ago|reply
The cheapest M1P device is currently rather expensive ($2k for the 8-cores 14") but there'll almost certainly be an M1P mini for about the same price as the current (still on intel) high-ender: $1100.
A Q32-17 leaves you with $300 for a bespoke box around the CPU. For such a CPU class I'd expect the mainboard alone to exceed that budget. Even if the Mini is price-bumped to, say, 1500 (which would be somewhat in-line with the 13" -> 14" price differential) I don't think you can get even just the guts of an Altra-based workstation for less than the price of the processor.
[+] [-] lmilcin|4 years ago|reply
What this means in practice is that this will heavily depend on the type of load you are running. A lot of workstation type loads just can't make use of 32 threads and on that CPU it will have to just to offset for the slower single core performance.
[+] [-] londons_explore|4 years ago|reply
[+] [-] jcadam|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] tkinom|4 years ago|reply
[+] [-] josephg|4 years ago|reply
The headline may as well be “Slow CPU cheaper than fast CPU”. This is not newsworthy.
[+] [-] cwillu|4 years ago|reply
[+] [-] bithavoc|4 years ago|reply
[+] [-] berkut|4 years ago|reply
https://www.anandtech.com/show/16979/the-ampere-altra-max-re...
[+] [-] Symmetry|4 years ago|reply
[+] [-] zackmorris|4 years ago|reply
These are still roughly 10 times more expensive than they should be because of their memory architecture. I'd vote to drop the idea of busses to external memories and switch to local memories, then have the cores self-organize by using web metaphors like content-addressable memory (CAM) to handle caching. Basically get rid of all of the cache coherence hardware and treat each core-memory as its own computer. The hardware that wasn't scalable could go to hardware-accelerated hashing for the CAM.
And a somewhat controversial opinion - I'd probably drop 64 bit also and either emulate 64 bit math on an 8/16/32 bit processor, or switch to arbitrary precision. That's because the number of cores is scalable and quickly dwarfs bits calculated per cycle. So we'd take say a 10% performance hit for a 100% increase in the number of cores, something like that. This would probably need to be tested in simulation to know where the threshold is, maybe 64 cores or something. Similar arguments could be used for clock speed and bus width, etc.
[+] [-] dragontamer|4 years ago|reply
If you want "many cores" and "get rid of cache-coherence hardware", its called a GPU.
Yes, a lot of those "cores" are SIMD-lanes, at least by NVidia / AMD naming conventions. But GPU SIMD-lanes have memory-fetching hardware that operates per-lane, so you approximate the effects of a many-many core computer.
-------
Japanese companies are experimenting with more CPUs though. PEZY "villages" are all proper CPUs IIRC, but this architecture isn't very popular outside of Japan. In terms of the global market, your best bet is in fact a GPU.
The Fujitsu ARM supercomputer was also ARM-based + HBM2. But once again, that's a specific Japanese supercomputer and not very popular outside of Japan. It is available though.
[+] [-] rbanffy|4 years ago|reply
There are a lot of other nice things that would have to go along that. A 32-bit linear address space is not enough for a lot of the things we do today, specially not in servers.
Having some memory dedicated to a given core is clever, however, provided we have the required affinity settings to match (moving a task to a different core would imply copying the scratchpad to the new core and would be extremely costly - much more than the cache misses we account for in current kernels)
What I would drop immediately is ISA compatibility. I have no use for it provided I can compile my code on the new ISA.
[+] [-] samus|4 years ago|reply
Many cores are great for mostly independent tasks, but performance will suffer as soon as communication is required. Making chip architectures more distributed seems to be the state of the art at the moment, but this doesn't mean we will suddenly be able to escape Amdahl's Law. To be specific, for inherently serial applications where we are absolutely interested in getting the result ASAP, single-thread performance remains crucial.
[+] [-] dmitrygr|4 years ago|reply
There was a company about a decade back that did this. I seem to remember it was useful for web serving. Bought by AMD. Not sure what happened next. Look them up. Name was SeaMicro
[+] [-] l33tman|4 years ago|reply
[+] [-] PedroBatista|4 years ago|reply
I’ve been hearing these news for more than a decade and still hasn’t materialize in anything meaningful if you take into account how much “fraction of a cost” they are advertised.
Edit: I was referring to a regular person or SME buying something like a couple "ATX" boards or "regular servers". Since it's a "fraction of the cost" I don't get why it hasn't spread like wildfire yet. I wasn't talking about giant cloud companies who place orders on the hundreds of thousands at least and many of them design their own hardware by now. Neither I was talking about a CPU that's attached with +$1000 of gray aluminum.
Raspberry Pi and it's "clones" are what it's closer to what I was talking about, but not really.
[+] [-] jagger27|4 years ago|reply
https://www.fujitsu.com/global/about/innovation/fugaku/
Apple’s entire Mac lineup is going to be Arm-powered by year end too.
Cost comparison is almost useless in this market because there simply are not enough wafers to meet demand. The price per core per watt is the real world comparison, and my inaudible MacBook fan demonstrates that beautifully.
[+] [-] rbanffy|4 years ago|reply
Most companies that buy x86 servers have no desire to recompile their software for a new architecture - they want to run PowerBI or SharePoint. They don't really benefit from a machine like this.
[+] [-] nine_k|4 years ago|reply
Not dramatically less, like 15% of x64 instances, but still.
[+] [-] thrwyoilarticle|4 years ago|reply
Standardisation and critical mass is the hard part of the puzzle for Arm64 desktops. But it was also the hard part of the puzzle for supercomputing and cloud servers, where it now has a firm foothold. Personally, I work in an industry where everyone is moving away from x86 and developing on physical Arm64 machines because x86 simply can't fit the power budget in production.
[+] [-] matwood|4 years ago|reply
[+] [-] swdev281634|4 years ago|reply
The software story is OK by now. I had little to no issues with that aarch64 Linux in their VMs. I didn't need a lot though, only mysql, asp.net core runtime, and related OS setup (SELinux, built-in firewall, etc).
[+] [-] selfhoster11|4 years ago|reply
[+] [-] bullen|4 years ago|reply
But still memory bandwidth is going to restrain those 128 cores to do anything joint parallel, you might actually be better off with many smaller 4-8 core machines.
[+] [-] kellengreen|4 years ago|reply
[+] [-] jakuboboza|4 years ago|reply
On the other side, massively concurrent solutions might benefit much more from this new 128/256 arms chips. So for sure there is room for this type of solutions and im happy we are adding on top of x86/amd64.
Last but not the least x86/amd64 unless something changed were locked to AMD and Intel so regions like EU can't rely on them if they want to be independent in terms of silicon production/design. So Arm and maybe RISC-V are the only real paths right now.
[+] [-] thrwyoilarticle|4 years ago|reply
There are no AVX-512 instructions. But that's the x86 branding of the vector instructions that you can only implement with the right x86 licence. So it's tautological. Arm can have vector instructions and languages are even beginning to make portable interfaces for vectors on multiple architectures.
[+] [-] monocasa|4 years ago|reply
[+] [-] childintime|4 years ago|reply
The P650 has only 16 cores, but looks like it should be able to compete with the $800 32 core Ampere running at 1.7GHz.
[+] [-] freemint|4 years ago|reply
[+] [-] mobilio|4 years ago|reply
[+] [-] jakuboboza|4 years ago|reply
Also we don't even know how they calculate TDP, lets not forget every single company Intel, Amd, Nvidia etc... have their own weird formula to calculate TDP. Your Intel 12900k has TDP on paper of 125W but can easily jump to 300W of power consumed. Without knowing formula for TDP calculation from every manufacturer this type of comparison is only a guessing game.
[+] [-] NavinF|4 years ago|reply
[+] [-] tehbeard|4 years ago|reply
But for something packing a similar core count to EPYC/Threadripper, it's in the right ballpark.
[+] [-] addaon|4 years ago|reply
[+] [-] jeffbee|4 years ago|reply
I don't think 250W is outrageous for a chip with this much logic on it.
[+] [-] OJFord|4 years ago|reply
128 cores is too though, I imagine it's pretty linear?
[+] [-] JohnJamesRambo|4 years ago|reply
[+] [-] wongarsu|4 years ago|reply
[+] [-] PragmaticPulp|4 years ago|reply
Many problems don’t scale well to more nodes. In many cases, it’s worth spending a lot on a single, very expensive server to avoid having to rewrite the software to be distributed across multiple machines.
[+] [-] zeroping|4 years ago|reply
Happy to answer any questions I can.
[+] [-] silicaroach|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] imachine1980_|4 years ago|reply
[+] [-] stillicidious|4 years ago|reply
[+] [-] 8K832d7tNmiQ|4 years ago|reply
> Ampere positions its Altra and Altra Max processors with up to 128 core largely for hyperscale providers of cloud services.
> That leaves the company with a fairly limited number of potential customers.
Even the article itself admits that this is a niche product.