top | item 44821026 (no title) asabla | 6 months ago I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size. discuss order hn newest steinvakt2|6 months ago And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or? PeterStuer|6 months ago I don't think the 4090 has native 4bit support, which will probably have a significant impact. diggan|6 months ago > And flash attention doesn't work on 5090 yet, right?Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all. modeless|6 months ago Cool, what software? asabla|6 months ago Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time
steinvakt2|6 months ago And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or? PeterStuer|6 months ago I don't think the 4090 has native 4bit support, which will probably have a significant impact. diggan|6 months ago > And flash attention doesn't work on 5090 yet, right?Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.
PeterStuer|6 months ago I don't think the 4090 has native 4bit support, which will probably have a significant impact.
diggan|6 months ago > And flash attention doesn't work on 5090 yet, right?Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.
modeless|6 months ago Cool, what software? asabla|6 months ago Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time
asabla|6 months ago Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time
steinvakt2|6 months ago
PeterStuer|6 months ago
diggan|6 months ago
Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.
modeless|6 months ago
asabla|6 months ago