It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.
rahimnathwani|1 day ago
I'm curious which one you're using.
suprjami|1 day ago
msuniverse2026|1 day ago
pja|1 day ago
Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.
Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.
wirybeige|1 day ago