(no title)
artembugara | 6 months ago
so, the 20b model.
Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?
mlyle|6 months ago
Multiply the number of A100's you need as necessary.
Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.
Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...
d3m0t3p|6 months ago
petuman|6 months ago
3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.
mythz|6 months ago
[1] https://ollama.com/library/gpt-oss
dragonwriter|6 months ago
artembugara|6 months ago
but I need to understand 20 x 1k token throughput
I assume it just might be too early to know the answer
PeterStuer|6 months ago
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.
vl|6 months ago
mrm_crackerhues|6 months ago
[deleted]
spott|6 months ago
You are unlikely to match groq on off the shelf hardware as far as I'm aware.
coolspot|6 months ago