Hidet: A Deep Learning Compiler for Efficient Model Serving

[+] lamchob|2 years ago|reply

If you read into the paper (https://dl.acm.org/doi/10.1145/3575693.3575702), one can find more performance comparisons. There, from a latency/throughput PoV they are en par with existing tools like TVM/Ansor. Sometimes faster, sometimes slower.

What is more interesting is this: They have very GPU-specific auto-tuning routine that drastically reduces the optimzation space, compared to TVM/Ansor. They go from ~10^6 possible implementations for an operator to a "few hundred", which enabled much faster time-to-solution. This is achieved with a GPU-centric problem formulation and search space. In essence, they trade how widely applicable their approach is (from "any" kind of hardware to only GPU-style architectures) for retrieval speed.

[+] junrushao1994|2 years ago|reply

Nice work! This is interesting to read the comparison between Hidet and Triton in this blog:

> Hidet Script vs. Triton: Triton greatly simplifies the CUDA programming by introducing the tile-based programming model where the parallel execution unit is thread blocks instead of threads. However, this simplification also prevents the tensor program developers from manipulating the fine-grained computation and memory resources (e.g., warps, shared memory) in their preferred ways. It would be challenging to implement an optimization that requires fine-grained control of these resources using Triton if it has not been implemented by the Triton compiler itself. Hidet Script, on the other hand, simplifies tensor programming while still enabling users to implement their own optimizations with extensive flexibility. It’s worth noting that the more granular control of Hidet Script also brings added complexity compared to Triton.

[+] kookamamie|2 years ago|reply

I more relevant benchmark would be against TensorRT, the current kind of model serving performance on NVIDIA GPUs.

[+] lamchob|2 years ago|reply

In the paper (https://dl.acm.org/doi/10.1145/3575693.3575702) there is also a TensorRT comparison. There, they are pretty much neck and neck. Sometimes faster, sometimes slower.

[+] pavelstoev|2 years ago|reply

Generally, Hidet outperforms other inference compilers - PyTorch Eager, ORT, TRT, TVM. For example, PyTorch Eager - too much framework overhead. ORT -doesn't do operator fusion. TRT - close-sourced and hard to fix if a model can not run. TVM - tuning time is too long, also limited expressiveness in optimization.

Additionally Hidet comes with Hidet Script, a brand new domain-specific language to write tensor programs in Python with high flexibility to express optimizations that can only be done in C++ CUDA C code. Hidet Script also supports operator tuning and automatic fusion.

[+] LeanderK|2 years ago|reply

> USING HIDET SCRIPT TO WRITE CUSTOM OPERATORS

this is really cool. I sometimes would like to have custom operators that are more or less trivial but the amount of work to create the operators by hand is just not worth it.

[+] brucethemoose2|2 years ago|reply

Hmm, does not work for Stable Diffusion yet :(

But inductor/triton didn't work in the 2.0 nightlies either, and now it works fine for SD.

[+] mlazos|2 years ago|reply

This bug was in the front end (dynamo) so I’m not surprised it happens with hidet too. I fixed this for the 2.0.1 release.

[+] akbarnur|2 years ago|reply

Hey @bructhemoose2 can you file an issue, we will try to fix it ASAP: https://github.com/hidet-org/hidet/issues

[+] vrglvrglvrgl|2 years ago|reply

[deleted]

[+] Maclennan|2 years ago|reply

[deleted]

[+] psuedo_uuh|2 years ago|reply

Hidet: a deep learning model to detect when you’re ready for your bidet to activate

14 comments