top | item 35584178

(no title)

m1el | 2 years ago

If you're doing inference on neural networks, each weight has to be read at least once per token. This means you're going to read at least the size of the entire model, per token, at least once during inference. If your model is 60GB, and you're reading it from the hard drive, then your bare minimum time of inference per token will be limited by your hard drive read throughput. Macbooks have ~4GB/s sequential read speed. Which means your inference time per token will be strictly more than 15 seconds. If your model is in RAM, then (according to Apple's advertising) your memory speed is 400GB/s, which is 100x your hard drive speed, and just the memory throughput will not be as much of a bottleneck here.

discuss

rahimnathwani|2 years ago

Your answer applies equally to GPU and CPU, no?

The comment to which you replied was asking about the need for a GPU, not the need for a lot of RAM.