top | item 41669190

(no title)

ra120271 | 1 year ago

For my education, would you be able to expand on what it doesn't work for and why? Thank you!

discuss

Gracana|1 year ago

Inference (token generation) is memory-bound, KV cache prefill (prompt processing) is compute-bound. The ARM Macintoshes have lots of memory bandwidth but not a lot of compute power, so they're great for outputting text but terrible for tasks like analyzing documents. I've never done fine-tuning but my understanding is that that is a highly-parallelizable compute hog as well.

You might like this article, which looks at the arithmetic intensity of LLM processing: https://www.baseten.co/blog/llm-transformer-inference-guide/