top | item 40389241

(no title)

Fala Taelin, nice work! Does HVM2 compile interaction nets to e.g. spirv, or is this an interpreter (like the original HVM) that happens to run on the GPU?

I ask because a while back I was messing around with compiling interaction nets to C after reducing as much of the program as possible (without reducing the inputs), as a form of whole program optimization. Wouldn't be too much harder to target a shader language.

Edit: Oh I see...

> This repository provides a low-level IR language for specifying the HVM2 nets, and a compiler from that language to C and CUDA HVM

Will have to look at the code then!

https://github.com/HigherOrderCO/HVM

Edit: Wait nvm, it looks like the HVM2 cuda runtime is an interpreter, that traverses an in-memory graph and applies reductions.

https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...

I was talking about traversing an interaction net to recover a lambda-calculus-like term, which can be lowered to C a la lisp in small pieces with minimal runtime overhead.

Honestly the motivation is, you are unlikely to outperform a hand-written GPU kernel for like ML workloads using Bend. In theory, HVM could act as glue, stitching together and parallelizing the dispatch order of compute kernels, but you need a good FFI to do that. Interaction nets are hard to translate across FFI boundaries. But, if you compile nets to C, keeping track of FFI compute kernel nodes embedded in the interaction network, you can recover a sensible FFI with no translation overhead.

The other option is implementing HVM in hardware, which I've been messing around with on a spare FPGA.

discuss

LightMachine|1 year ago

It is an interpreter that runs on GPUs, and a compiler to native C and CUDA. We don't target SPIR-V directly, but aim to. Sadly, while the C compiler results in the expected speedups (3x-4x, and much more soon), the CUDA runtime didn't achieve substantial speedups, compared to the non-compiled version. I believe this is due to warp-divergence: with non-compiled procedures, we can actually merge all function calls into a single "generic" interpreted function expander that can be reduced by warp threads without divergence. We'll be researching this more extensively looking forward.

animaomnium|1 year ago

Oh that's cool! Interested to see where your research leads. Could you drop me a link to where the interaction net → cuda compiler resides? I skimmed through the HVM2 repo and just read the .cu runtime file.

Edit: nvm, I read through the rest of the codebase. I see that HVM compiles the inet to a large static term and then links against the runtime.

https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...

Will have to play around with this and look at the generated assembly, see how much of the runtime a modern c/cu compiler can inline.

Btw, nice code, very compact and clean, well-organized easy to read. Rooting for you!