top | item 44404026

(no title)

terafo | 8 months ago

MLA uses way more flops in order to conserve memory bandwidth, H20 has plenty of memory bandwidth and almost no flops. MLA makes sense on H100/H800, but on H20 GQA-based models are a way better option.

discuss

pama|8 months ago

Not sure what you are referring to—do you have a pointer to a technical writeup perhaps? In training and inference MLA has way less flops than MHA, which is the gold standard, and way better accuracy (model performance) than GQA (see comparisons in the DeepSeek papers or try deepseek models vs llama for long context.)

More generally, with any hardware architecture you use, you can optimize the throughput for your main goal (initially training; later inference) by balancing other parameters of the architecture. Even if training is suboptimal, if you want to make a global impact with a public model, you aim for the next NVidia inference hardware.

cma|8 months ago

Didn't deep-seek figure out how to train with mixed precision and so get much more out of the cards, with a lot of the training steps able to run at what was traditionally post training quantization type precisions (block compressed).

reliabilityguy|8 months ago

MLA as in multi-head latent attention?

terafo|8 months ago

Yes