Arm-Based 128-Core Ampere CPUs Cost a Fraction of x86 Price

[+] bee_rider|4 years ago|reply

This article has no measurements, in fact it doesn't tell us anything we couldn't have gotten by looking at the product specifications. They even took their table from Phoronix, so they didn't even do the legwork of comparing the marketing material for the products!

Given that this low-effort article picked a metric which everyone knows will benefit the ARM processor, I can only assume it is marketing for Ampere. And yet, the first sentence starts with:

> Ampere's flagship 128-core Altra Max M128-30 may not be the world's highest-performing processor

Ampere is cool. It is really awesome that somebody is putting up a fight in CPU design without Intel/AMD's legacy advantages, or Apple/Amazon's infinite money. I really hope they didn't pay much for this fluff, that would be is pretty embarrassing.

--

Edit: It is neat to see that they've got a chip under $1k. I wonder if a Q32-17 workstation could be put together for cheaper than whatever the cheapest Apple M1 pro device is, to experiment with computationally crunchy Arm codes.

[+] floatboth|4 years ago|reply

> I wonder if a Q32-17 workstation could be put together for cheaper than whatever the cheapest Apple M1 pro device is

Nah. You can't just get a chip & mainboard retail since there is basically no market for that.

About the only option for an Altra workstation is a prebuilt for "as low as $7661" :(

https://store.avantek.co.uk/ampere-altra-64bit-arm-workstati...

[+] masklinn|4 years ago|reply

> Edit: It is neat to see that they've got a chip under $1k. I wonder if a Q32-17 workstation could be put together for cheaper than whatever the cheapest Apple M1 pro device is, to experiment with computationally crunchy Arm codes.

The cheapest M1P device is currently rather expensive ($2k for the 8-cores 14") but there'll almost certainly be an M1P mini for about the same price as the current (still on intel) high-ender: $1100.

A Q32-17 leaves you with $300 for a bespoke box around the CPU. For such a CPU class I'd expect the mainboard alone to exceed that budget. Even if the Mini is price-bumped to, say, 1500 (which would be somewhat in-line with the 13" -> 14" price differential) I don't think you can get even just the guts of an Altra-based workstation for less than the price of the processor.

[+] lmilcin|4 years ago|reply

Q32-17 may have 32 cores, may have 45W TDP and may have whooping 128 pcie 4.0 lanes but it is still only 1.7GHz.

What this means in practice is that this will heavily depend on the type of load you are running. A lot of workstation type loads just can't make use of 32 threads and on that CPU it will have to just to offset for the slower single core performance.

[+] londons_explore|4 years ago|reply

It's also quite a niche use case... An application fine with low single thread performance, that is highly parallelizable, requiring hundreds of threads, but with sufficiently branchy execution that CUDA/GPU doesn't work out... Oh, and it also no binary blobs, or you won't be able to port to ARM.

[+] jcadam|4 years ago|reply

I would love an ARM-based linux workstation, but I'm not willing to pay the extreme Apple premium for whatever an M1-based Mac Pro is going to cost.

[+] unknown|4 years ago|reply

[deleted]

[+] tkinom|4 years ago|reply

Love to see compiler benchmark (compile firefox, chrome) on this vs system with EPYC 64C/128T or 128C/256T.

[+] josephg|4 years ago|reply

So? The real metric isn’t the price or the number of cores. Its how much performance you get per dollar. (And sometimes also, perf per watt or perf per RU).

The headline may as well be “Slow CPU cheaper than fast CPU”. This is not newsworthy.

[+] cwillu|4 years ago|reply

Original information from https://www.phoronix.com/scan.php?page=news_item&px=Ampere-A...

[+] bithavoc|4 years ago|reply

This should be the link of the post, HN should ban links to sites like tomshardware.com where the scroll is hijacked to force play videos and don’t provide meaningful information over the same Phoronix’s articles.

[+] berkut|4 years ago|reply

Anandtech had a review of it here (and a 2-socket system) against Xeons and EPYCs:

https://www.anandtech.com/show/16979/the-ampere-altra-max-re...

[+] Symmetry|4 years ago|reply

On a socket to socket basis scoring significantly better than Intel and AMD's offering on some SPECint tests, significantly worse on others, and doing about the same as an AMD 7763 on average which has 64 cores with 128 threads. On the SPECfp a bit behind AMD but doing better than Intel, which sort of surprised me given AVX512.

[+] zackmorris|4 years ago|reply

This is great! I've waited for chips approaching something like 256 to 1024 cores since the 90s. These CPUs still leave a lot to be desired, but I sense that there's been a shift towards better uses of transistor count than just single-threaded performance.

These are still roughly 10 times more expensive than they should be because of their memory architecture. I'd vote to drop the idea of busses to external memories and switch to local memories, then have the cores self-organize by using web metaphors like content-addressable memory (CAM) to handle caching. Basically get rid of all of the cache coherence hardware and treat each core-memory as its own computer. The hardware that wasn't scalable could go to hardware-accelerated hashing for the CAM.

And a somewhat controversial opinion - I'd probably drop 64 bit also and either emulate 64 bit math on an 8/16/32 bit processor, or switch to arbitrary precision. That's because the number of cores is scalable and quickly dwarfs bits calculated per cycle. So we'd take say a 10% performance hit for a 100% increase in the number of cores, something like that. This would probably need to be tested in simulation to know where the threshold is, maybe 64 cores or something. Similar arguments could be used for clock speed and bus width, etc.

[+] dragontamer|4 years ago|reply

> These are still roughly 10 times more expensive than they should be because of their memory architecture. I'd vote to drop the idea of busses to external memories and switch to local memories, then have the cores self-organize by using web metaphors like content-addressable memory (CAM) to handle caching. Basically get rid of all of the cache coherence hardware and treat each core-memory as its own computer. The hardware that wasn't scalable could go to hardware-accelerated hashing for the CAM.

If you want "many cores" and "get rid of cache-coherence hardware", its called a GPU.

Yes, a lot of those "cores" are SIMD-lanes, at least by NVidia / AMD naming conventions. But GPU SIMD-lanes have memory-fetching hardware that operates per-lane, so you approximate the effects of a many-many core computer.

-------

Japanese companies are experimenting with more CPUs though. PEZY "villages" are all proper CPUs IIRC, but this architecture isn't very popular outside of Japan. In terms of the global market, your best bet is in fact a GPU.

The Fujitsu ARM supercomputer was also ARM-based + HBM2. But once again, that's a specific Japanese supercomputer and not very popular outside of Japan. It is available though.

[+] rbanffy|4 years ago|reply

> I'd probably drop 64 bit also

There are a lot of other nice things that would have to go along that. A 32-bit linear address space is not enough for a lot of the things we do today, specially not in servers.

Having some memory dedicated to a given core is clever, however, provided we have the required affinity settings to match (moving a task to a different core would imply copying the scratchpad to the new core and would be extremely costly - much more than the cache misses we account for in current kernels)

What I would drop immediately is ISA compatibility. I have no use for it provided I can compile my code on the new ISA.

[+] samus|4 years ago|reply

Number of cores is just another metric to optimize for. What counts in the end is whether it can efficiently and quickly deal with the load it is expected to handle.

Many cores are great for mostly independent tasks, but performance will suffer as soon as communication is required. Making chip architectures more distributed seems to be the state of the art at the moment, but this doesn't mean we will suddenly be able to escape Amdahl's Law. To be specific, for inherently serial applications where we are absolutely interested in getting the result ASAP, single-thread performance remains crucial.

[+] dmitrygr|4 years ago|reply

> treat each core-memory as its own computer.

There was a company about a decade back that did this. I seem to remember it was useful for web serving. Bought by AMD. Not sure what happened next. Look them up. Name was SeaMicro

[+] l33tman|4 years ago|reply

This article is mostly useless, it just focuses on number of cores... I didn't find even one mention in the article of actual benchmarks.

[+] PedroBatista|4 years ago|reply

These ARM cpu/boards has a cheaper replacement for x86 in real world general computing are becoming a “Year of desktop Linux” by now.

I’ve been hearing these news for more than a decade and still hasn’t materialize in anything meaningful if you take into account how much “fraction of a cost” they are advertised.

Edit: I was referring to a regular person or SME buying something like a couple "ATX" boards or "regular servers". Since it's a "fraction of the cost" I don't get why it hasn't spread like wildfire yet. I wasn't talking about giant cloud companies who place orders on the hundreds of thousands at least and many of them design their own hardware by now. Neither I was talking about a CPU that's attached with +$1000 of gray aluminum.

Raspberry Pi and it's "clones" are what it's closer to what I was talking about, but not really.

[+] jagger27|4 years ago|reply

What? General computing as in supercomputers?

https://www.fujitsu.com/global/about/innovation/fugaku/

Apple’s entire Mac lineup is going to be Arm-powered by year end too.

Cost comparison is almost useless in this market because there simply are not enough wafers to meet demand. The price per core per watt is the real world comparison, and my inaudible MacBook fan demonstrates that beautifully.

[+] rbanffy|4 years ago|reply

The CPU costs a fraction of the equivalent Xeon, but the CPU is not the only part in a server BoM nor is the most expensive subsystem of the box. When you add a terabyte of RAM, extra networking, and a bunch of SAS SSD's and HDD's, the CPU cost is almost negligible.

Most companies that buy x86 servers have no desire to recompile their software for a new architecture - they want to run PowerBI or SharePoint. They don't really benefit from a machine like this.

[+] nine_k|4 years ago|reply

ARM-based instances on AWS are a thing, and they do cost less.

Not dramatically less, like 15% of x64 instances, but still.

[+] thrwyoilarticle|4 years ago|reply

Not counting cloud computing (and, presumably, Apple computers) is akin to the Linux naysayers not counting Android after the very essence of desktop computing was uprooted.

Standardisation and critical mass is the hard part of the puzzle for Arm64 desktops. But it was also the hard part of the puzzle for supercomputing and cloud servers, where it now has a firm foothold. Personally, I work in an industry where everyone is moving away from x86 and developing on physical Arm64 machines because x86 simply can't fit the power budget in production.

[+] matwood|4 years ago|reply

Not sure if it's 'real world' enough, but for many workloads Graviton2 (AWS ARM) is a drop in replacement for x86. Last year I moved a lot of workloads over with very little effort.

[+] swdev281634|4 years ago|reply

Oracle’s cloud servers are OK. Their physical servers have 160 cores (2x Ampere Altra Q80-30, 80 cores/each), 1TB RAM, and 100 Gbps network bandwidth (2x 50Gbps cards). They can also cut these servers into VMs and offer these smaller VMs.

The software story is OK by now. I had little to no issues with that aarch64 Linux in their VMs. I didn't need a lot though, only mysql, asp.net core runtime, and related OS setup (SELinux, built-in firewall, etc).

[+] selfhoster11|4 years ago|reply

ARM has been pretty weak for more than a decade, and only started getting decent CPU performance and RAM sizes with a reasonable pricing very recently. Even x86 computing took a while to develop into something useful.

[+] bullen|4 years ago|reply

2 watts per core at 3GHz is pretty impressive, what nanometer are they at?

But still memory bandwidth is going to restrain those 128 cores to do anything joint parallel, you might actually be better off with many smaller 4-8 core machines.

[+] kellengreen|4 years ago|reply

This article just makes me miss the Tom's Hardware of old.

[+] jakuboboza|4 years ago|reply

I think Ampere doesnt support hyper threading. That means this 128 cores is comparable to 64 cores on EPYC/Xeon. Also there is the L2/L3 cache that is important and ofc arch. Arm still has low adoption because code has to be recompiled. While "web" targets are easy things like financial software that gains benefits from AVX-512 instructions might be harder because Neoverse doesn't have this instructions.

On the other side, massively concurrent solutions might benefit much more from this new 128/256 arms chips. So for sure there is room for this type of solutions and im happy we are adding on top of x86/amd64.

Last but not the least x86/amd64 unless something changed were locked to AMD and Intel so regions like EU can't rely on them if they want to be independent in terms of silicon production/design. So Arm and maybe RISC-V are the only real paths right now.

[+] thrwyoilarticle|4 years ago|reply

My impression of Arm-world SMT is that it was added in duress because people kept asking for it despite the word-of-God being that it was better just to add additional cores (I wonder if their license structure influences that claim?). Today SMT Arm cores are still very much the minority, so either the fashion sustained or their customers/implementors agree that more cores are better than fewer cores with SMT.

There are no AVX-512 instructions. But that's the x86 branding of the vector instructions that you can only implement with the right x86 licence. So it's tautological. Arm can have vector instructions and languages are even beginning to make portable interfaces for vectors on multiple architectures.

[+] monocasa|4 years ago|reply

A hyperthread is only worth ~30% of a full core back of the napkin.

[+] childintime|4 years ago|reply

Anyone has an idea how these cores compare to say SiFive's latest, the P650?

The P650 has only 16 cores, but looks like it should be able to compete with the $800 32 core Ampere running at 1.7GHz.

[+] freemint|4 years ago|reply

During a ARM HPC User group meetup I got to play with a 160 core machine from Ampere and got really, really impressive performance (not to talk about performance per dollar) for SAT solving. I pitched buying one to my local HPC cluster.

[+] mobilio|4 years ago|reply

Just two things: - 250W TDP is HUGE - motherboard for this processor is expensive

[+] jakuboboza|4 years ago|reply

is it ? Most of EPYC processors are TDP 180/200 watt and there are cheap mobos for them that can host even two sockets. So I don't think that would be a big issue.

Also we don't even know how they calculate TDP, lets not forget every single company Intel, Amd, Nvidia etc... have their own weird formula to calculate TDP. Your Intel 12900k has TDP on paper of 125W but can easily jump to 300W of power consumed. Without knowing formula for TDP calculation from every manufacturer this type of comparison is only a guessing game.

[+] NavinF|4 years ago|reply

Is it though? My desktop CPU regularly pulls 150W. 250W is pretty normal for a server. That said, I agree that motherboards will be really expensive since we’re not gonna see hundreds/thousands of models competing on price like we see with every new x86 socket.

[+] tehbeard|4 years ago|reply

if this was an Intel chip, then yeah 250W TDP is huge given their core counts..

But for something packing a similar core count to EPYC/Threadripper, it's in the right ballpark.

[+] addaon|4 years ago|reply

The new Intel i9-12900K's have a published power usage of 241 W in turbo mode. Modern high-end Intel desktop motherboards are designed to sustain this indefinitely, treating it as a TDP.

[+] jeffbee|4 years ago|reply

I just bought a new Intel CPU that draws 125-241W (they don't give a TDP any more) and it only has 8 cores. It is very fast, though.

I don't think 250W is outrageous for a chip with this much logic on it.

[+] OJFord|4 years ago|reply

> 250W TDP is huge

128 cores is too though, I imagine it's pretty linear?

[+] JohnJamesRambo|4 years ago|reply

I can’t believe there are CPUs that cost $8k. Does it really pay off price/performance wise or is it just for people that like hot rods?

[+] wongarsu|4 years ago|reply

The first "bestselling" consumer CPU I checked costs ~$50/core (a 6core AMD Ryzen 5600X). Scaled to 128 cores that Ryzen would cost $6400. Considering how many motherboards, PSUs, Fans, etc. you safe by having one computer with 128-core CPU compared to 8.3 computers with a 6-core CPU each, the price premium pays for itself (for workloads where this CPU performs at least as well as 8.3 Ryzen 5600X).

[+] PragmaticPulp|4 years ago|reply

It’s not for consumers. It’s for special-purpose servers and workstations.

Many problems don’t scale well to more nodes. In many cases, it’s worth spending a lot on a single, very expensive server to avoid having to rewrite the software to be distributed across multiple machines.

[+] zeroping|4 years ago|reply

I somehow was lucky enough to get a used Gigabyte E252-P30 as a home server. It has the 80-core version of their CPU in it, and it's been great. Seriously polished server-grade hardware with remote management, tons of memory channels and PCIe lanes, surprisingly low idle power consumption. Installation of a Linux distro was quite straightforward too.

Happy to answer any questions I can.

[+] silicaroach|4 years ago|reply

should read "... at fraction of the _comparable_ x86 Price" :-D

[+] unknown|4 years ago|reply

[deleted]

[+] imachine1980_|4 years ago|reply

oracle cloud have free tier four vcore from amper, if you want to have a test on arm is proablably the best option

[+] stillicidious|4 years ago|reply

Totally meaningless numbers without at least some normalized performance/watt to compare against

[+] 8K832d7tNmiQ|4 years ago|reply

Do enterprises even care about price per core?

> Ampere positions its Altra and Altra Max processors with up to 128 core largely for hyperscale providers of cloud services.

> That leaves the company with a fairly limited number of potential customers.

Even the article itself admits that this is a niche product.

211 comments