I'm more excited about StableHLO and IREE than about their integration into Pytorch, Tensorflow, etc.
I want to see a DSL that can be used to describe models elegantly and then export them either to a shared object or to something that can be run with a runtime (in this case IREE). Things like ONNX and TorchScript promised this but I've had little luck getting these to work well enough to trust them in large scale production deployments.
I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.
> I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.
You need to write some infrastructure around PyTorch to make it work. Something like a key/mapping in each checkpoint that says which architecture to choose with which parameters.
It sure could be easier, but is saving the model's code into the checkpoint enough? Things like the data pre-processing expected by the model would also have to be included for it to really be self-contained.
I'm curious about your view on ONNX. At work we did a few prototypes and it seemed to work well enough for our use cases, and we're moving to it. What is it that we haven't seen yet that gave you trouble?
Admittedly we're on a reasonably easy situation: we just have to deploy models (some from scikit-learn, some from Keras, some from PyTorch) to various users who mainly run a specific version of python under Windows and Linux, with CPU and GPU support.
I had a bit of a chuckle; the only member of the "AI/ML industry leaders" who didn't provide a quote was Apple (not that this is an indictment of Siri).
OpenXLA is an optimizing compiler... It's main purpose is to optimize stuff...
So why does there seem to be no published metrics showing performance of various common ML models on common hardware with OpenXLA vs other frameworks/compilers?
StableHLO seems like a good candidate for an abstraction layer for a Web ML API. Has the web machine learning working group looked at that yet? I haven't been following what they've been doing for a while.
The WebML WG has at least looked at StableHLO (and various other MLIR dialects), yeah. StableHLO is one of the first dialects in that ecosystem to focus directly on stability (and not just being a compiler IR), so it could be a good choice for runtimes / APIs that want to consume graphs of high level ML ops.
In IREE, we have prototypes targeting Wasm and WebGPU with ahead-of-time compilation, and we'd like to see more hardware exposed as compute devices via Vulkan/WebGPU (possibly leveraging extensions for computations like matrix multiplication).
Sounds promising indeed. So many cellphone & other chips have ml accelerators. Neither WebGPU nor wasm are a great fit for this hardware. If this tech proves to be quality, using it as an intermediary layer on the web could open up a lot of potential uses of this hardware!
Also worth pointing out IEEE intends to target wasm, vulkan/spir-v (webgpu's wgsl isn't entirely unrelated but would take work). So if you don't have ml on your system you still have good targetting options. If the web platform gets support, the browser could internally target vulkan as a baseline.
I'm curious how different the training vs inference needs are, and whether these tools can adequately serve both.
This like a great step to moving DL away from Nvidia chokehold. Given large LLMs can take upto few cents per token in cloud Nvidia GPUs, this looks like a great way to bring the cost down
Lots of DL is done on custom accelerators - things like Googles TPU. They generally work out far cheaper per FLOP than Nvidia hardware, but aren't widely available to the public to buy the hardware (yet).
They're solving the same "high level problem", but with very different approaches.
TensorRT is proprietary to Nvidia and Nvidia hardware. You'd take a {PyTorch, Tensorflow, <insert some other ML framework>} model and "export / convert" it into essentially a binary. Assuming all goes well (and in practice rarely does, at least on first try - more on this later), you now automatically leverage other Nvidia card features such as Tensor cores and can serve a model that runs significantly faster.
The problem is TensorRT being exclusive to Nvidia. The APIs for doing more advanced ML techniques like deep learning optimization requires significant lock-in to their APIs, if they are even available in the first place. And all these assuming they work as documented.
OpenXLA (and other players in the ecosystem like TVM) aim to "democratize" this so there are more support both upstream (# of supported ML frameworks) and downstream (# of hardware accelerators other than Nvidia). It's yet another layer or two that ML compiler engineers will need to stitch together, but once implemented, they in theory can do a lot of optimization techniques largely independent of the hardware targets underneath.
Note that further down in the article they mention other compiler frameworks like MLIR. You can then hypothetically lower (compiler terminology) it to a TensorRT MLIR dialect that then in turn runs on the Nvidia GPU.
OpenXLA is an open-source library for accelerating linear algebra computations on a variety of hardware platforms, while TensorRT is a proprietary library from NVIDIA that's specifically designed for optimizing neural network inference performance on NVIDIA GPUs.
openxla is a ML-ish compiler ecosystem built primarily around mlir that can target (through nvptx backend in llvm) and run on nvidia devices (on iree). tensorrt is a runtime for cuda programs. certainly they have features in common as a reflection of their common goals ("fast nn program training/inference") but the scope of tensorrt is much narrower.
This is much broader than ONNX its closer to ONNX Runtime + ONNX but it has some important advantages. StableHLO is the IR already supported by most HW accelerators including Inferentia/Trainium and TPU.
Much of this code is not "new" in the sense that much of the OpenXLA effort has been extracting the existing XLA representations and compiler from the TensorFlow codebase so it can be more modularly used by the ecosystem (including PyTorch).
A better frame is TensorFlow exporting its stable representation that many vendors have already built around, more than a "new" standard.
Replacement for the ONNX IR perhaps, but as far as I can see there is not (yet?) a file format for StableHLO (ONNX has a standardized on-disk format specified in Protobuf)
Maybe! I'd like to think of ONNX being the first standardization wave. That said, there are lots of technical limitations, such as:
1) It being a protobuf with 2GB file size hard limit. Makes it really hard and painful for large ML models.
2) Graph rewriting on these protobuf messages are extremely painful - takes significant engineering effort to productionize a ML model
Lots of innovation here. Time for a proper DL compiler.
Will this make it easier to ship ML models to consumers? Let's say I'm making a photo editor, can I ship trained models for various image effects and generation using this, and it will run on the client's best available hardware on Windows, Mac OS, Android, etc?
Sadly, I don't think so. Android already has NNAPI, but this post doesn't mention NNAPI at all. This seems focused on running on servers, instead of running on user devices.
That's a very good question and definitely something of interest. Note, that the compiler is only part of this story (as Mika also mentioned here). With OpenXLA we want to be able to take advantage of the best of what each platform can provide and opsets like StableHLO are meant to provide a portability layer while being expressive enough that targeting specialized hardware efficiently is possible. If you look inside the openxla/iree repo (as well as iree/iree-samples and iree/iree-jax repos, paper Scott cited or sers of IREE like Shark (https://github.com/nod-ai/SHARK#quick-start-for-shark-stable...)) you'll see some example.
I want to applaud the transition out of TensorFlow and into a new org / community. I’ve been following the community org issues and have enjoyed watching the governance, etc unfold. It’s cool to see processes like these actually happen!
Also, as someone interested in MLIR - I’m excited that (perhaps sometime in the future), I’ll be able to read the op semantics outside of the TensorFlow docs :)
You can today, though we're still narrowing some performance and feature set gaps. There's a downstream distribution of IREE called SHARK that runs Stable Diffusion and other models on AMD GPUs via Vulkan: https://nod.ai/sd-rdna3-ces2023/
Does OpenXLA allow automatic placement of tensors? Eg. if my GPU doesn't have enough RAM for every tensor in my model, can it decide which ones to shuffle off to system RAM, or recompute?
Can a large tensor be split into several small ones?
In the StableHLO spec, we are talking about this in more abstract terms - "StableHLO opset" - to be able to unambiguously reason about the semantics of StableHLO programs. However, in practice the StableHLO dialect is the primary implementation of the opset at the moment.
I wrote "primary implementation" because e.g. there is also ongoing work on adding StableHLO support to the TFLite flatbuffer schema: https://github.com/tensorflow/tensorflow/blob/master/tensorf.... Having an abstract notion of the StableHLO opset enables us to have a source of truth that all the implementations correspond to.
Triton is lower level than this. The post actually mentions Triton, search for it.
> Extension mechanisms such as Custom-call enable users to write deep learning primitives with CUDA, HIP, SYCL, Triton and other kernel languages so they can take full advantage of hardware features.
[+] [-] joshgev|3 years ago|reply
I want to see a DSL that can be used to describe models elegantly and then export them either to a shared object or to something that can be run with a runtime (in this case IREE). Things like ONNX and TorchScript promised this but I've had little luck getting these to work well enough to trust them in large scale production deployments.
I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.
[+] [-] plonk|3 years ago|reply
You need to write some infrastructure around PyTorch to make it work. Something like a key/mapping in each checkpoint that says which architecture to choose with which parameters.
It sure could be easier, but is saving the model's code into the checkpoint enough? Things like the data pre-processing expected by the model would also have to be included for it to really be self-contained.
[+] [-] dr_zoidberg|3 years ago|reply
Admittedly we're on a reasonably easy situation: we just have to deploy models (some from scikit-learn, some from Keras, some from PyTorch) to various users who mainly run a specific version of python under Windows and Linux, with CPU and GPU support.
[+] [-] btwillard|3 years ago|reply
[+] [-] ioedward|3 years ago|reply
Some of the largest deployments of ML are using PyTorch models, e.g. OpenAI, Meta, Microsoft.
[+] [-] wiradikusuma|3 years ago|reply
But for someone who wants to jump the bandwagon, does anyone have a "guide/map"? To put it simply, "How do I start AI/ML in 2023? And then what?"
The "2023" part is important. If you're bringing someone new to JS world, you probably show them Vue/React and not jQuery/Prototype.
[+] [-] schappim|3 years ago|reply
[+] [-] londons_explore|3 years ago|reply
[+] [-] blitzar|3 years ago|reply
[+] [-] bigdict|3 years ago|reply
[+] [-] londons_explore|3 years ago|reply
So why does there seem to be no published metrics showing performance of various common ML models on common hardware with OpenXLA vs other frameworks/compilers?
[+] [-] modeless|3 years ago|reply
[+] [-] burmako|3 years ago|reply
Also, OpenXLA is one of the external organizations in the Coordination section of the working group charter: https://w3c.github.io/machine-learning-charter/charter.html. We're looking forward to collaborating with WebML folks!
[+] [-] scotttodd|3 years ago|reply
In IREE, we have prototypes targeting Wasm and WebGPU with ahead-of-time compilation, and we'd like to see more hardware exposed as compute devices via Vulkan/WebGPU (possibly leveraging extensions for computations like matrix multiplication).
[+] [-] rektide|3 years ago|reply
Also worth pointing out IEEE intends to target wasm, vulkan/spir-v (webgpu's wgsl isn't entirely unrelated but would take work). So if you don't have ml on your system you still have good targetting options. If the web platform gets support, the browser could internally target vulkan as a baseline.
I'm curious how different the training vs inference needs are, and whether these tools can adequately serve both.
[+] [-] IshKebab|3 years ago|reply
[+] [-] neel8986|3 years ago|reply
[+] [-] londons_explore|3 years ago|reply
[+] [-] regecks|3 years ago|reply
[+] [-] tzhenghao|3 years ago|reply
TensorRT is proprietary to Nvidia and Nvidia hardware. You'd take a {PyTorch, Tensorflow, <insert some other ML framework>} model and "export / convert" it into essentially a binary. Assuming all goes well (and in practice rarely does, at least on first try - more on this later), you now automatically leverage other Nvidia card features such as Tensor cores and can serve a model that runs significantly faster.
The problem is TensorRT being exclusive to Nvidia. The APIs for doing more advanced ML techniques like deep learning optimization requires significant lock-in to their APIs, if they are even available in the first place. And all these assuming they work as documented.
OpenXLA (and other players in the ecosystem like TVM) aim to "democratize" this so there are more support both upstream (# of supported ML frameworks) and downstream (# of hardware accelerators other than Nvidia). It's yet another layer or two that ML compiler engineers will need to stitch together, but once implemented, they in theory can do a lot of optimization techniques largely independent of the hardware targets underneath.
Note that further down in the article they mention other compiler frameworks like MLIR. You can then hypothetically lower (compiler terminology) it to a TensorRT MLIR dialect that then in turn runs on the Nvidia GPU.
[+] [-] BlamSthzusa|3 years ago|reply
[+] [-] mathisfun123|3 years ago|reply
[+] [-] totoglazer|3 years ago|reply
[+] [-] jroesch|3 years ago|reply
Much of this code is not "new" in the sense that much of the OpenXLA effort has been extracting the existing XLA representations and compiler from the TensorFlow codebase so it can be more modularly used by the ecosystem (including PyTorch).
A better frame is TensorFlow exporting its stable representation that many vendors have already built around, more than a "new" standard.
[+] [-] misterdata|3 years ago|reply
[+] [-] tzhenghao|3 years ago|reply
Lots of innovation here. Time for a proper DL compiler.
[+] [-] mbgerring|3 years ago|reply
[+] [-] aix1|3 years ago|reply
[+] [-] meragrin_|3 years ago|reply
[+] [-] Havoc|3 years ago|reply
Definitely feels like critical mass
[+] [-] sorenjan|3 years ago|reply
[+] [-] sanxiyn|3 years ago|reply
[+] [-] JacquesPienaar|3 years ago|reply
[+] [-] mikaraento|3 years ago|reply
(I work for Google and I work on client-side StableHLO, but I don't speak for Google).
[+] [-] mccoyb|3 years ago|reply
Also, as someone interested in MLIR - I’m excited that (perhaps sometime in the future), I’ll be able to read the op semantics outside of the TensorFlow docs :)
[+] [-] polskibus|3 years ago|reply
[+] [-] 0cf8612b2e1e|3 years ago|reply
[+] [-] scotttodd|3 years ago|reply
[+] [-] gstvleite|3 years ago|reply
[+] [-] londons_explore|3 years ago|reply
Can a large tensor be split into several small ones?
[+] [-] mf_tomb|3 years ago|reply
[+] [-] burmako|3 years ago|reply
In the StableHLO spec, we are talking about this in more abstract terms - "StableHLO opset" - to be able to unambiguously reason about the semantics of StableHLO programs. However, in practice the StableHLO dialect is the primary implementation of the opset at the moment.
I wrote "primary implementation" because e.g. there is also ongoing work on adding StableHLO support to the TFLite flatbuffer schema: https://github.com/tensorflow/tensorflow/blob/master/tensorf.... Having an abstract notion of the StableHLO opset enables us to have a source of truth that all the implementations correspond to.
[+] [-] natt941|3 years ago|reply
[+] [-] crabbo|3 years ago|reply
[+] [-] faizshah|3 years ago|reply
[+] [-] sanxiyn|3 years ago|reply
> Extension mechanisms such as Custom-call enable users to write deep learning primitives with CUDA, HIP, SYCL, Triton and other kernel languages so they can take full advantage of hardware features.
[+] [-] stan_kirdey|3 years ago|reply