top | item 20019647

Arm announces its new premium CPU and GPU designs

317 points| ChuckMcM | 6 years ago |techcrunch.com | reply

163 comments

order
[+] ChuckMcM|6 years ago|reply
This is interesting to me, lots of noise around "machine learning" in the GPU rather than graphics which kind of validates Google's TPU, Apple's AI co-processors, and Nvidia's Jetson. Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.
[+] hyperpallium|6 years ago|reply
The chart has 4 data points, but only 3 CPUs. The Cortex-76 is in the middle. Can anyone explain?

Also, in the wider socioeconomic picture, everyone once got the same computational technology eventually, because better was also cheaper.

Now, computation is stratifying, like any normal industry, where you get what you pay for. This ending of egalitarianism is bad.

[+] 0815test|6 years ago|reply
Computing technology is more egalitarian today than it ever was. Unless you're gaming, doing software development or serious science-related work, you can get away with something baseline and dirt-cheap, and it will work just fine. Maybe slightly less so in mobile but still, usable compute really isn't that niche even in that space.
[+] clouddrover|6 years ago|reply
> where you get what you pay for

I think that's always been true for CPUs and GPUs. The faster ones were always the more expensive ones.

[+] oflordal|6 years ago|reply
Supposedly the step in the middle is SW optimizations?
[+] blu42|6 years ago|reply
One guess re the second CA76 datapoint would be N1.
[+] Causality1|6 years ago|reply
>the company argues that 85 percent of smartphones today run machine learning workloads with only a CPU or a CPU+GPU combo.

Gotta admit I'm not real clear on what my phone does that needs on-device machine learning.

[+] bsaul|6 years ago|reply
Was about to write the same comment. Plus, i must say every time i hear some function is going to be performed via a NN, i'm thinking that my device is becoming more and more random in the way its behave. Which i really don't like.

At least when some classical algorithm fails at performing certain task, people talk of the failure as a bug. Not as something that's "probably going to be solved with more training data".

[+] antpls|6 years ago|reply
If you need actual examples running right now :

  Recent Snapchat and Instagram selfie filters
  Google Keyboard's translation and prediction
  Google Traduction's lens
  Google Assistant's voice recognition
  Google Message's instant replies
Almost all of them are inference workloads. I believe only Google Keyboard does on-device training in the background when the phone is charging.
[+] PhilippGille|6 years ago|reply
Automatic enhancements when taking a photo / video (digital zoom, anti shake, "night mode") for example.
[+] michaelt|6 years ago|reply
So it turns out the hardware needed to run a pretrained model is pretty much the same as the hardware needed to train a model. In both cases, it means lots of matrix multiplication.

Of course, training a model takes longer given the same amount of processing power - but for applications like video processing, just applying the model can be pretty demanding.

[+] bufferoverflow|6 years ago|reply
2019 Google I/O presenters talked about it: much much better latency, no need to send the audio to the server for speech recognition.
[+] TazeTSchnitzel|6 years ago|reply
I assume it's what Apple uses to automatically categorise photos by “moment” and recognised face.
[+] cheerlessbog|6 years ago|reply
I assume they mean inference. Voice assistant or photo facial recognition without network access?
[+] iamnothere|6 years ago|reply
Determining whether or not nearby ambient sounds indicate potentially illegal activity. Or, perhaps, mapping the content and emotional valence of conversations to understand whether or not you deserve a lower social credit score.
[+] sigmonsays|6 years ago|reply
can someone explain to me why I want ML specific processor features or chips in my phone?

I thought ML required massive amounts of data to be taught, most of which makes more sense in the cloud.

Am I way off here?

[+] btown|6 years ago|reply
Training a model does require massive data and compute, but evaluating/using an already model (e.g. running a Hot Dog/Not Hot Dog classifier) can be done on mobile hardware. Accelerating this could, for instance, allow this to run in real time on a video feed.
[+] joshvm|6 years ago|reply
Aside from the runtime difference between training and inference, having on-device ML makes a lot of sense for other reasons.

There can be more guarantees over data privacy, since your data can stay on-device. It also reduces bandwidth as there's no need to upload data for classification to the cloud. And that also may mean it's faster, potentially real time, since you don't have that round trip latency.

This is not necessarily for phones. Lots of (virtually all?) low power IoT devices have ARM cores. There are plenty of environments where the cloud or compute power isn't available.

[+] jopsen|6 years ago|reply
From what little I understand models are built in the cloud, compressed and evaluated on your phone.

Better hardware probably means less power drain and larger models.

There was some cool stuff about this in Google I/O keynote.

[+] Traster|6 years ago|reply
Machine learning is like any learning, there's the learning stage and the putting it into practice. ML in the cloud is like researchers coming up with a new way to slice bread, ML on your phone is like the your local baker following the instructions in a new cook book. You still need a skilled baker to follow the instructions.
[+] NegatioN|6 years ago|reply
It doesn't have to be for training the models. You can run versions of trained models locally on your phone. Having a dedicated chip will allow for more snappy calculation of those fancy snapchat filters, language translation, image recognition etc
[+] mruts|6 years ago|reply
The way neural networks work is that each perceptron runs a linear equation of the form: w1x1 + w2x2...wnxn. Even if you train the model somewhere else, you still need to hold the weights locally in each perceptron in order to evaluate the model.

This requires hardware in which you can multiply these huge matrices quickly, even if the weights are downloaded from the cloud.

[+] mensetmanusman|6 years ago|reply
E.g. accurate edge detection algorithms are better done with ML-based mathematics and accompanying optimized transistor layouts, this would enable better augmented reality type graphic additions from things like snapchat selfies to visualizing free-space CAD.
[+] fxj|6 years ago|reply
How does this NNPU compare to the other vendors (Goolge TPU (3 TOPS?, Intel NCS2 (1 TOPS), kendryte RISC-V (0.5 TOPS), Nvidia Jetson (4 TOPS)? Can you use tensorflow networks out of the box like the others provide?
[+] KirinDave|6 years ago|reply
Does someone have a mirror of this article that doesn't seize up if you refuse to display ads?
[+] ksec|6 years ago|reply
I think it is worth pointing out this new CPU possibly landing in Flagship Android in late 2019 or early 2020 would still only be equal to or slower than an Apple A10 used in iPhone 8.

Assuming Apple continue like they do in the past, drop iPhone 7 and moves iPhone 8 down to its price range. They would have an entry level iPhone that is faster than 95% of all Android Smartphone on the market.

[+] Liquid_Fire|6 years ago|reply
Isn't the entry level iPhone also significantly more expensive than most Android phones?
[+] narnianal|6 years ago|reply
What even is an "AI chip"? What's the difference to a GPU? As long as nobody can explain that I have big doubts that it whould be more than a GPU+marketing. So no big deal if they don't provide one.
[+] fxj|6 years ago|reply
This link gives a nice explanation of the google nn-processor aka TPU:

https://cloud.google.com/blog/products/gcp/an-in-depth-look-...

It boils down to:

CPUs: 10s of cores

GPUs: 1000s of cores

NNs: 100000s of cores

NNs have very simple cores (fused multiply-add and look-up table functions) but can run many of them in one cycle.

FTA: Because general-purpose processors such as CPUs and GPUs must provide good performance across a wide range of applications, they have evolved myriad sophisticated, performance-oriented mechanisms. As a side effect, the behavior of those processors can be difficult to predict, which makes it hard to guarantee a certain latency limit on neural network inference. In contrast, TPU design is strictly minimal and deterministic as it has to run only one task at a time: neural network prediction. You can see its simplicity in the floor plan of the TPU die.

[+] acidbaseextract|6 years ago|reply
Modern GPUs are extremely programmable, but this flexibility isn't that heavily used by neural networks. NN inference is pretty much just a huge amount of matrix multiplication.

Especially for mobile applications (most Arm customers), you pay extra energy for all that pipeline flexibility that isn't being used. A dedicated chip will save a bunch of power.

[+] endorphone|6 years ago|reply
Neural network learning and inference primarily uses matrix multiplication and addition, usually at lower bit depths. You can do this on GPUs to great success and massive parallelism, however the GPU is more generalized than this so it takes more silicon, and more power. With a TPU/neural processor you optimize the silicon to a very, very specific problem, generally multiplying large matrixes and then adding to another matrix. On a GPU we decompose this into a large number of scalar calculations and it massively parallelizes and does a good job, but on a TPU we feed it the matrixes and that's all it's made to do, with very large scale matrix operations, with so much silicon dedicated to matrix operations that it often does it in a single cycle.

Another comment mentioned cores and I don't think that's a good way of looking at it, as in most ways a TPU is back to very "few" but hyper-specialized "cores". There is essentially no parallelism in a TPU or neural processor -- you feed it three matrixes and it gives you the result. You move on to the next one.

[+] mruts|6 years ago|reply
an AI chip would essentially just be a chip that can do matrix multiplication very quickly plus some addition. Each perceptron (neuron) is just fitting a linear equation, so if we had a chip that could support millions of perceptrons all fitting linear equations (i.e with matrix multiplication), than it would be a huge win compared to a GPUs, which are more general and less efficient for this specific task that dedicated silicon.
[+] lrem|6 years ago|reply
I would guess a chip optimised for addition, multiplication and ReLU, not burdened too much by all those other rarely used opcodes.
[+] lone_haxx0r|6 years ago|reply
Unreadable. As soon as I click on the (X) button, the article closes and the browser goes back to the root page of https://techcrunch.com/.

There's no way to avoid clicking on it, as it follows you around and grabs your attention, not letting you read.

[+] founderling|6 years ago|reply
Even when you fight that annoyance with a content blocker, the page itself is aggressive to no end.

I scrolled down to see how long the article is. That somehow also triggered a redirect to the root.

How is it possible that the most user hostile news sites often get the most visibility? Even here on HN which is so user friendly. What are the mechanics behind this?

The url should be changed to a more user friendly news source. How about one of these?

https://liliputing.com/2019/05/arm-launches-cortex-a77-cpu-m...

https://hexus.net/tech/news/cpu/130757-arm-releases-cortex-a...

https://www.xda-developers.com/arm-cortex-a77-cpu-announceme...

https://www.theregister.co.uk/2019/05/27/arm_cortex_a77/

https://venturebeat.com/2019/05/26/arm-reveals-new-cpu-gpu-a...

[+] narnianal|6 years ago|reply
And when you open the page every time it wants to go through all the cookie and partner options instead of just saying "yup, we remember you deactivated everything. [change settings] [continue reading]"
[+] holstvoogd|6 years ago|reply
hmm, I feel ARM performance is a bit like nuclear fusion. It's always the next generation that will deliver an order of magnitude performance increase. Yet some how ARM single core performance is still shit compared to x86. (No matter how much I hope and pray for that to change, cause x86 needs to die)
[+] The_rationalist|6 years ago|reply
When will Deep learning frameworks get a reality check and decide to support openCL/SYCL? All this hardware is useless silicone until then
[+] thomasfl|6 years ago|reply
The day Apple releases their first ARM powered laptop, will will be a turning point. This comment written on an ipad, the best product to come out of apple to yhis day.
[+] xvector|6 years ago|reply
> out of apple to yhis day.

I suppose keyboards haven't really been a strong point!

[+] imperialdrive|6 years ago|reply
While I don't agree with the first part, I do agree that the iPad is unique and worthy being the only Apple device my hard earned money has ever gone towards.