top | item 25775781

(no title)

volta87 | 5 years ago

When developing ML models, you rarely train "just one".

The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).

It would be interesting to know how long does the whole process takes on the M1 vs the V100.

For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).

In particular it would be interesting to know, whether the V100 trains all models in the same time that it trains one, and whether the M1 does the same, or whether the M1 takes N times more time to train N models.

This could paint a completely different picture, particularly for the user perspective. When I go for lunch, coffee, or home, I usually spawn jobs training a large number of models, such that when I get back, all these models are trained.

I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.

---

To make the analogy, what this article is doing is something similar to benchmarking a 64 core CPU against a 1 core CPU using a single threaded benchmark. The 64 core CPU happens to be slightly beefier and faster than the 1 core CPU, but it is more expensive and consumes more power because... it has 64x more cores. So to put things in perspective, it would make sense to also show a benchmark that can use 64x cores, which is the reason somebody would buy a 64-core CPU, and see how the single-core one compares (typically 64x slower).

---

To me, the only news here is that Apple GPU cores are not very far behind NVIDIA's cores for ML training, but there is much more to a GPGPU than just the perf that you get for small models in a small number of cores. Apple would still need to (1) catch up, and (2) extremely scale up their design. They probably can do both if they set their eyes on it. Exciting times.

discuss

sdenton4|5 years ago

The low gpu utilization rate in the first graph is kind of a tell... Seems like the M1 is a little bit worse than 40% of a v100?

volta87|5 years ago

If that's the case that would be very good. One can buy lots of M1 mac minis for the price of a V100..

nightcracker|5 years ago

> When developing ML models, you rarely train "just one".

Depends on your field. In Reinforcement Learning you often really do train just one, at least on the same data set (since the data set often is dynamically generated based on the behavior of the previous iteration of the model).

volta87|5 years ago

Even in reinforcement learning you can train multiple model with different data-sets concurrently and combine them for the next iteration.

lukas|5 years ago

Do you really train more than one model at the same time on a single GPU? In my experience that's pretty unusual.

I completely agree with your conclusion here.

volta87|5 years ago

Depends on model size, but if the model is small enough that I actually do training on a PCIe board, I do. I partition an A100 in 8, and train 8 models at a time, or just use MPS on a V100 board. The bigger A100 boards can fit multiple of the same models that do fit in a single V100..

Also I tend to do this initially, when I am exploring the hyperparameter space, for which I tend to use smaller but more models.

I find that using big models initially is just a waste of time. You want to try many things as quickly as possible.

junipertea|5 years ago

I found training multiple models on same GPU hit other bottlenecks (mainly memory capacity/bandwidth) fast. I tend to train one model per GPU and just scale the number of computers. Also, if nothing else, we tend to push the models to fit the GPU memory.