(no title)
heinrichf | 11 months ago
- Gemma3 12B: ~100 t/s on prompt eval; 15 t/s on eval
- MistralSmall3 24B: ~500 t/s on prompt eval; 10 t/s on eval
Do you know what different in architecture could make the prompt eval (prefill) so much slower on the 2x smaller Gemma3 model?
alekandreev|11 months ago
remuskaos|11 months ago
When I set the context size to 2048 (openwebui's default), the inference is almost twice as fast as when I set it to 4096. I can't set the conext size any higher because my GPU only has 12GB of RAM and ollama crashes for larger context sizes.
Still, I find that thoroughly odd. Using the larger conetext size (4096), the GPU usage is only 50% as seen in nvtop. I have no idea why.