OpenXLA Is Available Now

[+] joshgev|3 years ago|reply

I'm more excited about StableHLO and IREE than about their integration into Pytorch, Tensorflow, etc.

I want to see a DSL that can be used to describe models elegantly and then export them either to a shared object or to something that can be run with a runtime (in this case IREE). Things like ONNX and TorchScript promised this but I've had little luck getting these to work well enough to trust them in large scale production deployments.

I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.

[+] plonk|3 years ago|reply

> I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.

You need to write some infrastructure around PyTorch to make it work. Something like a key/mapping in each checkpoint that says which architecture to choose with which parameters.

It sure could be easier, but is saving the model's code into the checkpoint enough? Things like the data pre-processing expected by the model would also have to be included for it to really be self-contained.

[+] dr_zoidberg|3 years ago|reply

I'm curious about your view on ONNX. At work we did a few prototypes and it seemed to work well enough for our use cases, and we're moving to it. What is it that we haven't seen yet that gave you trouble?

Admittedly we're on a reasonably easy situation: we just have to deploy models (some from scikit-learn, some from Keras, some from PyTorch) to various users who mainly run a specific version of python under Windows and Linux, with CPU and GPU support.

[+] btwillard|3 years ago|reply

We're working on some of the DSL-related parts of this in https://github.com/aesara-devs

[+] ioedward|3 years ago|reply

> I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.

Some of the largest deployments of ML are using PyTorch models, e.g. OpenAI, Meta, Microsoft.

[+] wiradikusuma|3 years ago|reply

AI/ML world is like JavaScript world, in the sense that there are so many new tech/tools that I can't make heads or tails. This is good thing btw.

But for someone who wants to jump the bandwagon, does anyone have a "guide/map"? To put it simply, "How do I start AI/ML in 2023? And then what?"

The "2023" part is important. If you're bringing someone new to JS world, you probably show them Vue/React and not jQuery/Prototype.

[+] schappim|3 years ago|reply

I had a bit of a chuckle; the only member of the "AI/ML industry leaders" who didn't provide a quote was Apple (not that this is an indictment of Siri).

[+] londons_explore|3 years ago|reply

I can imagine the paperwork to be able to officially make a statement on behalf of Apple is pretty high, even for people high up in the company.

[+] blitzar|3 years ago|reply

They asked Siri for a quote - its thinking still /s

[+] bigdict|3 years ago|reply

Siri is not the only or even the most important ML application for Apple, I would argue image processing and things like Face ID are.

[+] londons_explore|3 years ago|reply

OpenXLA is an optimizing compiler... It's main purpose is to optimize stuff...

So why does there seem to be no published metrics showing performance of various common ML models on common hardware with OpenXLA vs other frameworks/compilers?

[+] modeless|3 years ago|reply

StableHLO seems like a good candidate for an abstraction layer for a Web ML API. Has the web machine learning working group looked at that yet? I haven't been following what they've been doing for a while.

[+] burmako|3 years ago|reply

The WebML working group has kindly invited us to one of their meetings to present about StableHLO a few months ago. Here are the slides and the meeting minutes: https://www.w3.org/2022/11/17-webmachinelearning-minutes.htm....

Also, OpenXLA is one of the external organizations in the Coordination section of the working group charter: https://w3c.github.io/machine-learning-charter/charter.html. We're looking forward to collaborating with WebML folks!

[+] scotttodd|3 years ago|reply

The WebML WG has at least looked at StableHLO (and various other MLIR dialects), yeah. StableHLO is one of the first dialects in that ecosystem to focus directly on stability (and not just being a compiler IR), so it could be a good choice for runtimes / APIs that want to consume graphs of high level ML ops.

In IREE, we have prototypes targeting Wasm and WebGPU with ahead-of-time compilation, and we'd like to see more hardware exposed as compute devices via Vulkan/WebGPU (possibly leveraging extensions for computations like matrix multiplication).

[+] rektide|3 years ago|reply

Sounds promising indeed. So many cellphone & other chips have ml accelerators. Neither WebGPU nor wasm are a great fit for this hardware. If this tech proves to be quality, using it as an intermediary layer on the web could open up a lot of potential uses of this hardware!

Also worth pointing out IEEE intends to target wasm, vulkan/spir-v (webgpu's wgsl isn't entirely unrelated but would take work). So if you don't have ml on your system you still have good targetting options. If the web platform gets support, the browser could internally target vulkan as a baseline.

I'm curious how different the training vs inference needs are, and whether these tools can adequately serve both.

[+] IshKebab|3 years ago|reply

Do we even need a WebML API? There's already WASM and WebGPU. What can't you do with those?

[+] neel8986|3 years ago|reply

This like a great step to moving DL away from Nvidia chokehold. Given large LLMs can take upto few cents per token in cloud Nvidia GPUs, this looks like a great way to bring the cost down

[+] londons_explore|3 years ago|reply

Lots of DL is done on custom accelerators - things like Googles TPU. They generally work out far cheaper per FLOP than Nvidia hardware, but aren't widely available to the public to buy the hardware (yet).

[+] regecks|3 years ago|reply

ELI5 OpenXLA vs TensorRT? Are they solving the same problem, just that the former is not married to NVIDIA devices?

[+] tzhenghao|3 years ago|reply

They're solving the same "high level problem", but with very different approaches.

TensorRT is proprietary to Nvidia and Nvidia hardware. You'd take a {PyTorch, Tensorflow, <insert some other ML framework>} model and "export / convert" it into essentially a binary. Assuming all goes well (and in practice rarely does, at least on first try - more on this later), you now automatically leverage other Nvidia card features such as Tensor cores and can serve a model that runs significantly faster.

The problem is TensorRT being exclusive to Nvidia. The APIs for doing more advanced ML techniques like deep learning optimization requires significant lock-in to their APIs, if they are even available in the first place. And all these assuming they work as documented.

OpenXLA (and other players in the ecosystem like TVM) aim to "democratize" this so there are more support both upstream (# of supported ML frameworks) and downstream (# of hardware accelerators other than Nvidia). It's yet another layer or two that ML compiler engineers will need to stitch together, but once implemented, they in theory can do a lot of optimization techniques largely independent of the hardware targets underneath.

Note that further down in the article they mention other compiler frameworks like MLIR. You can then hypothetically lower (compiler terminology) it to a TensorRT MLIR dialect that then in turn runs on the Nvidia GPU.

[+] BlamSthzusa|3 years ago|reply

OpenXLA is an open-source library for accelerating linear algebra computations on a variety of hardware platforms, while TensorRT is a proprietary library from NVIDIA that's specifically designed for optimizing neural network inference performance on NVIDIA GPUs.

[+] mathisfun123|3 years ago|reply

openxla is a ML-ish compiler ecosystem built primarily around mlir that can target (through nvptx backend in llvm) and run on nvidia devices (on iree). tensorrt is a runtime for cuda programs. certainly they have features in common as a reflection of their common goals ("fast nn program training/inference") but the scope of tensorrt is much narrower.

[+] totoglazer|3 years ago|reply

ONNX replacement?

[+] jroesch|3 years ago|reply

This is much broader than ONNX its closer to ONNX Runtime + ONNX but it has some important advantages. StableHLO is the IR already supported by most HW accelerators including Inferentia/Trainium and TPU.

Much of this code is not "new" in the sense that much of the OpenXLA effort has been extracting the existing XLA representations and compiler from the TensorFlow codebase so it can be more modularly used by the ecosystem (including PyTorch).

A better frame is TensorFlow exporting its stable representation that many vendors have already built around, more than a "new" standard.

[+] misterdata|3 years ago|reply

Replacement for the ONNX IR perhaps, but as far as I can see there is not (yet?) a file format for StableHLO (ONNX has a standardized on-disk format specified in Protobuf)

[+] tzhenghao|3 years ago|reply

Maybe! I'd like to think of ONNX being the first standardization wave. That said, there are lots of technical limitations, such as: 1) It being a protobuf with 2GB file size hard limit. Makes it really hard and painful for large ML models. 2) Graph rewriting on these protobuf messages are extremely painful - takes significant engineering effort to productionize a ML model

Lots of innovation here. Time for a proper DL compiler.

[+] mbgerring|3 years ago|reply

The announcement notably doesn’t mention OpenAI or Microsoft

[+] aix1|3 years ago|reply

They do train SOTA models, for sure, but do either of them produce accelerators or ML frameworks?

[+] meragrin_|3 years ago|reply

What hardware do they produce which would need to support it?

[+] Havoc|3 years ago|reply

The fact that Intel AMD and Nvidia plus three cloud are on board suggests MS might get left behind

Definitely feels like critical mass

[+] sorenjan|3 years ago|reply

Will this make it easier to ship ML models to consumers? Let's say I'm making a photo editor, can I ship trained models for various image effects and generation using this, and it will run on the client's best available hardware on Windows, Mac OS, Android, etc?

[+] sanxiyn|3 years ago|reply

Sadly, I don't think so. Android already has NNAPI, but this post doesn't mention NNAPI at all. This seems focused on running on servers, instead of running on user devices.

[+] JacquesPienaar|3 years ago|reply

That's a very good question and definitely something of interest. Note, that the compiler is only part of this story (as Mika also mentioned here). With OpenXLA we want to be able to take advantage of the best of what each platform can provide and opsets like StableHLO are meant to provide a portability layer while being expressive enough that targeting specialized hardware efficiently is possible. If you look inside the openxla/iree repo (as well as iree/iree-samples and iree/iree-jax repos, paper Scott cited or sers of IREE like Shark (https://github.com/nod-ai/SHARK#quick-start-for-shark-stable...)) you'll see some example.

[+] mikaraento|3 years ago|reply

We've shared some plans for client-side inference (TFLite) support in https://www.w3.org/2022/11/17-webmachinelearning-minutes.htm.... The presentation was for WebML but mentions non-WebML work. It's not shipping yet though.

(I work for Google and I work on client-side StableHLO, but I don't speak for Google).

[+] mccoyb|3 years ago|reply

I want to applaud the transition out of TensorFlow and into a new org / community. I’ve been following the community org issues and have enjoyed watching the governance, etc unfold. It’s cool to see processes like these actually happen!

Also, as someone interested in MLIR - I’m excited that (perhaps sometime in the future), I’ll be able to read the op semantics outside of the TensorFlow docs :)

[+] polskibus|3 years ago|reply

Strange to not see Microsoft on the list of supporters.

[+] 0cf8612b2e1e|3 years ago|reply

Does this mean I might be able to practically use an AMD GPU in the future? Or would this still be dependent upon ROCm?

[+] scotttodd|3 years ago|reply

You can today, though we're still narrowing some performance and feature set gaps. There's a downstream distribution of IREE called SHARK that runs Stable Diffusion and other models on AMD GPUs via Vulkan: https://nod.ai/sd-rdna3-ces2023/

[+] gstvleite|3 years ago|reply

Why do we need two compilers, XLA and IREE? Is the idea to move away from XLA and towards IRE in the future?

[+] londons_explore|3 years ago|reply

Does OpenXLA allow automatic placement of tensors? Eg. if my GPU doesn't have enough RAM for every tensor in my model, can it decide which ones to shuffle off to system RAM, or recompute?

Can a large tensor be split into several small ones?

[+] mf_tomb|3 years ago|reply

Can someone explain to me why they created yet another IR instead of building an MLIR dialect? Especially since they’re targeting MLIR byte code.

[+] burmako|3 years ago|reply

If you mean StableHLO, then it has an MLIR dialect: https://github.com/openxla/stablehlo/blob/main/stablehlo/dia....

In the StableHLO spec, we are talking about this in more abstract terms - "StableHLO opset" - to be able to unambiguously reason about the semantics of StableHLO programs. However, in practice the StableHLO dialect is the primary implementation of the opset at the moment.

I wrote "primary implementation" because e.g. there is also ongoing work on adding StableHLO support to the TFLite flatbuffer schema: https://github.com/tensorflow/tensorflow/blob/master/tensorf.... Having an abstract notion of the StableHLO opset enables us to have a source of truth that all the implementations correspond to.

[+] natt941|3 years ago|reply

Anyone know how this relates to what Modular is building? (I gather Chris Lattner had been involved with XLA while at Google.)

[+] crabbo|3 years ago|reply

He was "involved" in the same way that Attila was "involved" with the Romans.

[+] faizshah|3 years ago|reply

Is this an alternative to triton or is it somehow using triton for hardware specific optimizations?

[+] sanxiyn|3 years ago|reply

Triton is lower level than this. The post actually mentions Triton, search for it.

> Extension mechanisms such as Custom-call enable users to write deep learning primitives with CUDA, HIP, SYCL, Triton and other kernel languages so they can take full advantage of hardware features.

[+] stan_kirdey|3 years ago|reply

Another Deepspeed?

77 comments