If you read into the paper (https://dl.acm.org/doi/10.1145/3575693.3575702), one can find more performance comparisons.
There, from a latency/throughput PoV they are en par with existing tools like TVM/Ansor. Sometimes faster, sometimes slower.
What is more interesting is this: They have very GPU-specific auto-tuning routine that drastically reduces the optimzation space, compared to TVM/Ansor. They go from ~10^6 possible implementations for an operator to a "few hundred", which enabled much faster time-to-solution. This is achieved with a GPU-centric problem formulation and search space. In essence, they trade how widely applicable their approach is (from "any" kind of hardware to only GPU-style architectures) for retrieval speed.
Nice work! This is interesting to read the comparison between Hidet and Triton in this blog:
> Hidet Script vs. Triton: Triton greatly simplifies the CUDA programming by introducing the tile-based programming model where the parallel execution unit is thread blocks instead of threads. However, this simplification also prevents the tensor program developers from manipulating the fine-grained computation and memory resources (e.g., warps, shared memory) in their preferred ways. It would be challenging to implement an optimization that requires fine-grained control of these resources using Triton if it has not been implemented by the Triton compiler itself. Hidet Script, on the other hand, simplifies tensor programming while still enabling users to implement their own optimizations with extensive flexibility. It’s worth noting that the more granular control of Hidet Script also brings added complexity compared to Triton.
Generally, Hidet outperforms other inference compilers - PyTorch Eager, ORT, TRT, TVM. For example, PyTorch Eager - too much framework overhead. ORT -doesn't do operator fusion. TRT - close-sourced and hard to fix if a model can not run. TVM - tuning time is too long, also limited expressiveness in optimization.
Additionally Hidet comes with Hidet Script, a brand new domain-specific language to write tensor programs in Python with high flexibility to express optimizations that can only be done in C++ CUDA C code. Hidet Script also supports operator tuning and automatic fusion.
this is really cool. I sometimes would like to have custom operators that are more or less trivial but the amount of work to create the operators by hand is just not worth it.
[+] [-] lamchob|2 years ago|reply
What is more interesting is this: They have very GPU-specific auto-tuning routine that drastically reduces the optimzation space, compared to TVM/Ansor. They go from ~10^6 possible implementations for an operator to a "few hundred", which enabled much faster time-to-solution. This is achieved with a GPU-centric problem formulation and search space. In essence, they trade how widely applicable their approach is (from "any" kind of hardware to only GPU-style architectures) for retrieval speed.
[+] [-] junrushao1994|2 years ago|reply
> Hidet Script vs. Triton: Triton greatly simplifies the CUDA programming by introducing the tile-based programming model where the parallel execution unit is thread blocks instead of threads. However, this simplification also prevents the tensor program developers from manipulating the fine-grained computation and memory resources (e.g., warps, shared memory) in their preferred ways. It would be challenging to implement an optimization that requires fine-grained control of these resources using Triton if it has not been implemented by the Triton compiler itself. Hidet Script, on the other hand, simplifies tensor programming while still enabling users to implement their own optimizations with extensive flexibility. It’s worth noting that the more granular control of Hidet Script also brings added complexity compared to Triton.
[+] [-] kookamamie|2 years ago|reply
[+] [-] lamchob|2 years ago|reply
[+] [-] pavelstoev|2 years ago|reply
Additionally Hidet comes with Hidet Script, a brand new domain-specific language to write tensor programs in Python with high flexibility to express optimizations that can only be done in C++ CUDA C code. Hidet Script also supports operator tuning and automatic fusion.
[+] [-] LeanderK|2 years ago|reply
this is really cool. I sometimes would like to have custom operators that are more or less trivial but the amount of work to create the operators by hand is just not worth it.
[+] [-] brucethemoose2|2 years ago|reply
But inductor/triton didn't work in the 2.0 nightlies either, and now it works fine for SD.
[+] [-] mlazos|2 years ago|reply
[+] [-] akbarnur|2 years ago|reply
[+] [-] vrglvrglvrgl|2 years ago|reply
[deleted]
[+] [-] Maclennan|2 years ago|reply
[deleted]
[+] [-] psuedo_uuh|2 years ago|reply