top | item 40111484

(no title)

namanski | 1 year ago

Hey Christoph, thanks for trying it out - we're running this on the cloud, particularly GCP, on A100s (80g).

On your query about running these models locally, I'm not sure if just upgrading your RAM would have the same throughput as what you see on the website. You can upgrade your RAM but you might get pretty bad tokens/sec.

discuss

ChristophGeske|1 year ago

Thanks for the reply.

I am currently testing the limits and got llama 3 70B in a 2bit-quantized form to run on my laptop with very low specs RTX3080 8GB VRAM (laptop version) and 16GB system RAM. It runs with 1,2 tokens/s which is a bit slow. The biggest issue however is the time it takes for the first token to be printed which fluctuates and takes between 1.8s to 45s.

I tested the same model on a 4070 with 16GB VRAM (desktop pc version) and 32GB system RAM and it runs at about 3-4 tokens per second. The 4070 also has the issue with quite long time for the first token to be displayed i think it was around 12s in my limited testinh.

I still try to find out how to speed the time to initial token up. 4 tokens a second is usable for many cases because that's about reading speed.

There are also 1bit-quantized 70B models appearing so there might be ways to make it even a bit faster on consumer GPUs.

I think we are at the bare edge of usability here and I keep testing.

I can not tell exactly how this strong quantization affects output quality information about that is mixed and seems to depand on the form of quantization as well.