Inference (token generation) is memory-bound, KV cache prefill (prompt processing) is compute-bound. The ARM Macintoshes have lots of memory bandwidth but not a lot of compute power, so they're great for outputting text but terrible for tasks like analyzing documents. I've never done fine-tuning but my understanding is that that is a highly-parallelizable compute hog as well.
Gracana|1 year ago
You might like this article, which looks at the arithmetic intensity of LLM processing: https://www.baseten.co/blog/llm-transformer-inference-guide/