It's not surprising that TF is the slowest in many cases. It has been widely, sometimes harshly, criticized in the past for that reason. On the other hand, despite its speed TF appears to be the only tool that doesn't have to sit out any of the tests due to incompatibilities or lack of features.
Other tools like MXNet deserve a shoutout as well, and it would be interesting to see how a wider group compares. MXNet also integrates seamlessly into R, something of a rarity in deep learning tools (excepting the also excellent h2o package).
Yes, unfortunate that MXNet wasn't covered. It's in the happy Venn place of (fully cross-platform) ∩ (easy to embed) ∩ (flexible) ∩ (hackable).
* Cross platform: Windows, MacOS, Linux; CPU and CUDA. Though their CMake needs work.
* Easy to embed: straightforward C FFI, JSON for metadata and parameter serialization, no weird runtime.
* Flexible: not too specialized to vision. Static unrolling of RNNs possible now (with mirroring this can still be very memory efficient [0]), basic support for the fast new cuDNN 5 RNN layers [1] (contributed by colleague of mine). Dynamic unrolling is on the horizon I hear.
* Hackable: once you're familiar with the codebase, custom elementwise unary or binary ops = few minutes, custom layers = 1+ hours (depending on complexity). And if you can leverage mshadow primitives for your layer implementation, you don't even have to touch CUDA. Also fairly active on github, responsive to PRs etc.
I concur about missing MXnet, but it's noteworthy that TF is the fastest for the convolutional nets on GPU -- a case many care about, and one covered best by Soumith's benchmarks.
(Full disclosure: I help develop TF part time).
Clearly room for optimizing the CPU versions of things. That may be Eigen. Intel now has a preview out of their DNN toolkit -- I wonder if we'll see the same speed convergence as we did with CuDNN.
When properly configured, most of these libraries use NVidia's CuDNN package under the hood. The only thing you're really measuring here is overhead, not the actual computation.
Realistically, most people barely get to multiple GPUs, let alone multiple machines. You're more likely to do hyperparameter tuning across machines before you do distributed training.
What do you mean "so slow"? It's by far the fastest framework covered by the paper in scenarios where threads don't outnumber CPU cores.
Taken from the article itself:
"However, Torch still achieves the best performance in our experiments in which Torch has nearly 12x speed up compared with TensorFlow under 4-thread setting."
[+] [-] dgax|9 years ago|reply
Other tools like MXNet deserve a shoutout as well, and it would be interesting to see how a wider group compares. MXNet also integrates seamlessly into R, something of a rarity in deep learning tools (excepting the also excellent h2o package).
[+] [-] taliesinb|9 years ago|reply
* Cross platform: Windows, MacOS, Linux; CPU and CUDA. Though their CMake needs work.
* Easy to embed: straightforward C FFI, JSON for metadata and parameter serialization, no weird runtime.
* Flexible: not too specialized to vision. Static unrolling of RNNs possible now (with mirroring this can still be very memory efficient [0]), basic support for the fast new cuDNN 5 RNN layers [1] (contributed by colleague of mine). Dynamic unrolling is on the horizon I hear.
* Hackable: once you're familiar with the codebase, custom elementwise unary or binary ops = few minutes, custom layers = 1+ hours (depending on complexity). And if you can leverage mshadow primitives for your layer implementation, you don't even have to touch CUDA. Also fairly active on github, responsive to PRs etc.
[0] https://arxiv.org/pdf/1606.03401.pdf
[1] https://devblogs.nvidia.com/parallelforall/optimizing-recurr...
[+] [-] dgacmu|9 years ago|reply
Clearly room for optimizing the CPU versions of things. That may be Eigen. Intel now has a preview out of their DNN toolkit -- I wonder if we'll see the same speed convergence as we did with CuDNN.
[+] [-] gcr|9 years ago|reply
[+] [-] mbeissinger|9 years ago|reply
[+] [-] dave168|9 years ago|reply
[+] [-] Eridrus|9 years ago|reply
[+] [-] breezest|9 years ago|reply
[+] [-] geezerjay|9 years ago|reply
What do you mean "so slow"? It's by far the fastest framework covered by the paper in scenarios where threads don't outnumber CPU cores.
Taken from the article itself:
"However, Torch still achieves the best performance in our experiments in which Torch has nearly 12x speed up compared with TensorFlow under 4-thread setting."
[+] [-] meeper16|9 years ago|reply
[deleted]