Also generally I think CoreML isn't the best. The best solution for ORT would probably be to introduce a pure MPS provider (https://github.com/microsoft/onnxruntime/issues/21271), but given they've already bought into CoreML the effort may not be worth the reward for the core team. Which fair enough as it's a pretty mammoth task
if you double click the coreml file in a mac and open xcode there is a profiler you can run. the profiler will show you the operations it's using and what the bit depth is.
My experiences with ONNX have not been pleasant. Conversions from models written with Tensorflow and Pytorch often fail. I recommend using TFLite or Executorch for deployment to edge devices instead.
Agreed, I have seen some speedups with ONNX if I'm being honest but the process especially on MacOS is a bit messy. I'll try out Executorch and see how it compares, cheers for the recommendation
On the coreml side this is likely because the neural engine supports fp16 and offloading some/all layers to ANE significantly increases inference time and power usage when running models. You can inspect in the Xcode profiler to see what is running on each part of the device at what precision.
Yeah I can see why they let it be that way, but the fact it is pretty undefined is what bugged me. I suppose it depends on what your goals are - efficiency vs reproducibility.
Also I did run a test of FP16 vs FP32 for a large matmul on the Apple GPU and the FP16 calculation was 1.28x faster so it makes sense that they'd go for FP16 as a default.
While this is a bit too harsh - and the solution is naive at best - the problem is real.
The idea of bitwise reproducibility for floating point computations is completely laughable in any part of the DL landscape. Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.
To frameworks somehow being even worse. Where the best you can do is order the frameworks in terms of how bad they are - with tensorflow being far down at the bottom and jax being (currently) at the top - and try to use the best one.
This is a huge issue to anyone serious about developing novel models and I see no one talking about it, let alone trying to solve it.
smcleod|2 months ago
Two_hands|2 months ago
Also generally I think CoreML isn't the best. The best solution for ORT would probably be to introduce a pure MPS provider (https://github.com/microsoft/onnxruntime/issues/21271), but given they've already bought into CoreML the effort may not be worth the reward for the core team. Which fair enough as it's a pretty mammoth task
trashtensor|2 months ago
Two_hands|2 months ago
nuc1e0n|2 months ago
Two_hands|2 months ago
yousifa|2 months ago
Two_hands|2 months ago
Also I did run a test of FP16 vs FP32 for a large matmul on the Apple GPU and the FP16 calculation was 1.28x faster so it makes sense that they'd go for FP16 as a default.
DiabloD3|2 months ago
[deleted]
noosphr|2 months ago
The idea of bitwise reproducibility for floating point computations is completely laughable in any part of the DL landscape. Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.
From NVidia not guaranteeing bitwise reproducibility even on the same GPU: https://docs.nvidia.com/deeplearning/cudnn/backend/v9.17.0/d...
To frameworks somehow being even worse. Where the best you can do is order the frameworks in terms of how bad they are - with tensorflow being far down at the bottom and jax being (currently) at the top - and try to use the best one.
This is a huge issue to anyone serious about developing novel models and I see no one talking about it, let alone trying to solve it.
omneity|2 months ago
ipython|2 months ago