This is such a great post. It really shows how much room for improvement there is in all released deep learning code. Almost none of the open source work is really production ready for fast inference, and tuning the systems requires a good working knowledge of the GPU.
The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.
I'd alter your conclusion that open source work isn't production ready. As long as it works as described, it is production ready for at least some subset of use cases. There's just a lot of low hanging fruit re: performance improvement.
It's entirely valid to trade-off either a more straight-forward design or minimizing development time for performance and just throw hardware at the problem as needed.... companies do it all of the time.
Funny how blaming GIL for being a bottleneck is the least researched/not backed by performance measurement (before/after) part of the article. Everyone loves to hate GIL. maybe there should be T-shirts made for this for the C++ loving folks out there.
> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.
At least for the PyTorch bits of it, using the PyTorch JIT works well. When you run PyTorch code through Python, the intermediate results will be created as Python objects (with GIL and all) while when you run it in TorchScript, the intermediates will only be in C++ PyTorch Tensors, all without the GIL.
We have a small comment about it in our PyTorch book in the section on what improvements to expect from the PyTorch JIT and it seems rather relevant in practice.
The JIT is hands down the best feature of PyTorch. Especially compared to the somewhat neglected suite of native inference tools for TensorFlow. Just recently I was trying to get a TensorFlow 2 model to work nicely in C++. Basically, the external API for TensorFlow is the C API, but it does not have proper support for `SavedModel` yet. Linking to the C++ library is a pain, and both of them cannot do eager execution at all if you have a model trained in Python code :(
PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)
How do you keep track of the shutter clock in this kind of system? For example the camera clocks at 60fps, but the image processing is a few frames late, the gyroscope clocks at 4kHz, the accelerometer way slower, lidar is a slug, etc. Then you have to get all that stuff in your kalman filter to estimate the state and the central question is: “when did you collect this data?” I guess “no clue it comes from USB then disappeared into a GPU pipeline” is not a scientifically sound answer, you want to know if it goes before or after sample no 3864 of the gyroscope.
Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?
I've been trying to coax better performance out of a Jetson nano camera, currently using Python's Open CV lib, with some threading, and can only manage at best about 29fps.
I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.
Author here. As other commenters are saying, the Pytorch JIT and torchscript might be your friend here.
Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.
Good job digging into all of this Paul! At my company (onspecta.com) we solve similar problems (and more!) to accelerate AI/deep learning/computer vision problems, across both CPUs, GPUs as well as other types of chips.
This is a fascinating space, and there are tons of speed up opportunities. Depending on the type of the workload you're running, you might be able to ditch the GPU entirely and run everything just on the CPU, greatly reducing cost & deployment complexity. Or, at the very least, improve SLAs and 10x decrease the GPU (or CPU) cost.
I've seen this over and over again. Glad someone's documenting this publicly :-) If any one of you readers have more questions about this I'm happy to discuss in the comments here. Or you can reach out to me at victor at onspecta dot com.
I think this is a great explanation. Are this kind of manual optimisations still needed when using the higher level frameworks? Or at least those should make it clear in the types when a pipeline moves from cpu to gpu and vice versa.
How would one accelerate object tracking on a video stream where each frame depends on the result of the previous one? Batching and multi-threading doesn't work here.
Are there some CNN-libraries that have way less overhead for small batch sizes? Tensorflow (GPU accelerated) seems to go down from 10000 fps on large batches to 200 fps for single frames for a small CNN.
It depends on the algorithm you're using, but here are some places to start:
1. How many times is the data being copied, or moved between devices?
2. Are you recomputing data from previous frames that you could just be saving? For example, some tracking algorithms apply the same CNN tower to the last 3-5 images, and you could just save the results from the last frame instead of recomputing. (Of course, you also want to follow hint #1 and keep these results on the GPU).
3. Change the algorithm or network you're using.
Really you should read the original article carefully. The article is showing you the steps for profiling what part of the runtime is slow. Typically, once you profile a little you'll be surprised to find that time is being wasted somewhere unexpected.
Great point - dependencies between frames are inherently problematic for many of these techniques.
Everything lostdog says. I've had experience speeding up tracking immensely using the same big hammer I talk about in the article - moving the larger parts of tracking compute to GPU.
Also, in a tracking pipeline you'll generally have the big compute on pixels done up front. Object detection and ReID take the bulk of the compute and can be easily batched and run in parallel. The results (metadata) can then be fed into a more serial process (but still doing the N<->N ReID comparisons on GPU).
I can't attest to the usefulness of pytorch's multiprocessing module, but using python's multiprocessing module feels like low-level programming (serializing, packing and unpacking data-structures, etc. where you'd hope the environment would handle it for you).
Processing separate video streams works well with separate processes. There is some cost related to starting the other processes and sometimes libraries may stumble (e.g. several instances of ML libraries allocating all the GPU memory) but once it's running it's literally two separate processes that can do their work independently.
Multiprocessing could be a pain if you need to pass frames of a single video stream. Traditionally you'd need to pickle/unpickle them to pass them between processes.
Yes, I have done non-trivial implementations of a number of SoTA models in Julia. The framework I've used is Flux[1] which I love for it's simplicity, it is very much like the DarkNet[2] framework in that regard which is refreshing after using TensorFlow. PyTorch is much better about not having unnecessary complexity and a sensible API but Flux is certainly better.
The ability for Julia to compile directly to PTX assembly[3][4] means that you can even write the GPU kernels in Julia and eliminate the C/C++ CUDA code. Unfortunately, there is still a lot of work to be done to make it as reliably fast and easy as TensorFlow/PyTorch so I don't think it is usable for production yet.
I hope it will be production ready soon but it will likely take some time to highly tune the compute stacks. They are already working on AMD GPU support with AMDGPU.jl[5] and with the latest NVIDIA GPU release which has IMHO purposefully decreased performance (onboard RAM, power) for scientific compute application I would love to be able to develop on my AMD GPU workstation and deploy on whatever infrastructure easily in the same language.
I do have some gripes with Julia but the biggest of them are mostly cosmetic.
Has any company tried putting the GPU and CPU in the same chip, sharing the same data caches? That could greatly increase the performance of the CPU-GPU data transfers.
[+] [-] lostdog|5 years ago|reply
The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.
[+] [-] blihp|5 years ago|reply
It's entirely valid to trade-off either a more straight-forward design or minimizing development time for performance and just throw hardware at the problem as needed.... companies do it all of the time.
[+] [-] briggers|5 years ago|reply
Completely agree that almost none of the SoTA github repos are really ready for production and making this stuff work can be pretty hard.
Getting this done on C++ and moving up to the next level of performance is the focus of my next article :)
[+] [-] gameswithgo|5 years ago|reply
too bad such great ecosystems evolved around a language that can’t fully utilize the amazing hardware we have today.
[+] [-] threatripper|5 years ago|reply
Do you have any experience with that?
[+] [-] mzakharo1|5 years ago|reply
[+] [-] whimsicalism|5 years ago|reply
I know a number of python frameworks (ie. detectron) that are fast.
I'd like to see the evidence that the performance bottleneck is python, esp. when asynchronous dispatch exists.
[+] [-] t-vi|5 years ago|reply
At least for the PyTorch bits of it, using the PyTorch JIT works well. When you run PyTorch code through Python, the intermediate results will be created as Python objects (with GIL and all) while when you run it in TorchScript, the intermediates will only be in C++ PyTorch Tensors, all without the GIL. We have a small comment about it in our PyTorch book in the section on what improvements to expect from the PyTorch JIT and it seems rather relevant in practice.
[+] [-] g_airborne|5 years ago|reply
PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)
[+] [-] nraynaud|5 years ago|reply
Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?
[+] [-] NikolaeVarius|5 years ago|reply
I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.
[+] [-] briggers|5 years ago|reply
Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.
[+] [-] ilaksh|5 years ago|reply
Seems like Xavier NX is more realistic for my needs right now personally though. Of course it's much more expensive etc.
[+] [-] vj44|5 years ago|reply
This is a fascinating space, and there are tons of speed up opportunities. Depending on the type of the workload you're running, you might be able to ditch the GPU entirely and run everything just on the CPU, greatly reducing cost & deployment complexity. Or, at the very least, improve SLAs and 10x decrease the GPU (or CPU) cost.
I've seen this over and over again. Glad someone's documenting this publicly :-) If any one of you readers have more questions about this I'm happy to discuss in the comments here. Or you can reach out to me at victor at onspecta dot com.
[+] [-] spockz|5 years ago|reply
[+] [-] threatripper|5 years ago|reply
Are there some CNN-libraries that have way less overhead for small batch sizes? Tensorflow (GPU accelerated) seems to go down from 10000 fps on large batches to 200 fps for single frames for a small CNN.
[+] [-] lostdog|5 years ago|reply
1. How many times is the data being copied, or moved between devices?
2. Are you recomputing data from previous frames that you could just be saving? For example, some tracking algorithms apply the same CNN tower to the last 3-5 images, and you could just save the results from the last frame instead of recomputing. (Of course, you also want to follow hint #1 and keep these results on the GPU).
3. Change the algorithm or network you're using.
Really you should read the original article carefully. The article is showing you the steps for profiling what part of the runtime is slow. Typically, once you profile a little you'll be surprised to find that time is being wasted somewhere unexpected.
[+] [-] briggers|5 years ago|reply
Everything lostdog says. I've had experience speeding up tracking immensely using the same big hammer I talk about in the article - moving the larger parts of tracking compute to GPU.
Also, in a tracking pipeline you'll generally have the big compute on pixels done up front. Object detection and ReID take the bulk of the compute and can be easily batched and run in parallel. The results (metadata) can then be fed into a more serial process (but still doing the N<->N ReID comparisons on GPU).
[+] [-] O5vYtytb|5 years ago|reply
What about using pytorch multiprocessing[1]?
[1] https://pytorch.org/docs/stable/notes/multiprocessing.html
[+] [-] amelius|5 years ago|reply
[+] [-] threatripper|5 years ago|reply
Multiprocessing could be a pain if you need to pass frames of a single video stream. Traditionally you'd need to pickle/unpickle them to pass them between processes.
[+] [-] andrewbridger|5 years ago|reply
[+] [-] Datenstrom|5 years ago|reply
The ability for Julia to compile directly to PTX assembly[3][4] means that you can even write the GPU kernels in Julia and eliminate the C/C++ CUDA code. Unfortunately, there is still a lot of work to be done to make it as reliably fast and easy as TensorFlow/PyTorch so I don't think it is usable for production yet.
I hope it will be production ready soon but it will likely take some time to highly tune the compute stacks. They are already working on AMD GPU support with AMDGPU.jl[5] and with the latest NVIDIA GPU release which has IMHO purposefully decreased performance (onboard RAM, power) for scientific compute application I would love to be able to develop on my AMD GPU workstation and deploy on whatever infrastructure easily in the same language.
I do have some gripes with Julia but the biggest of them are mostly cosmetic.
[1]: https://fluxml.ai/
[2]: https://github.com/pjreddie/darknet
[3]: https://developer.nvidia.com/blog/gpu-computing-julia-progra...
[4]: http://blog.maleadt.net/2015/01/15/julia-cuda/
[5]: https://github.com/JuliaGPU/AMDGPU.jl
[+] [-] mleonhard|5 years ago|reply
[+] [-] egberts1|5 years ago|reply
https://github.com/streamlit/demo-self-driving
It uses StreamLit
https://github.com/streamlit/streamlit
[+] [-] minimaxir|5 years ago|reply