top | item 47192229

(no title)

And if you don't want to buy a Mac? A 80 GB NVidia GPU costs $10,000K (equivalent to 30 years of ChatGPT Plus subscription) and will probably be obsolete in 5-7 years anyway. What are my options if I want a decent coding agent at a reasonable price?

discuss

timschmidt|1 day ago

I'm able to run the Unsloth quants on an ancient dual socket Xeon 1U server I keep around for homelab stuff. It has 8 DDR3 channels, which gives me about as much memory bandwidth as two channels of DDR5 :-/ But 16 sockets and cheaper prices. So it has 256gb in it right now. I have to run the minimum size Unsloth quant for the largest open weight models. They definitely feel a bit dazed. This machine can support up to 1.5TB of DDR3, which would allow me to run many of the largest models unquantized, but at 1/4 of the already abysmal speeds I see of ~ 1 Token / s which is only really usable with multiple agents running a kanban style async development process. Nothing interactive. That said, I picked up the hardware at the local surplus for $25 and it's vintage ~2010. Pretty impressive what this enterprise gear can do.

Power consumption? Don't ask. A subscription is cheaper.

paganel|1 day ago

> Power consumption

That’a the thing, at the end of it all power consumption will matter more for the end-user who doesn’t have money to burn away, because I suspect that power-consumption will, in the majority of cases, exceed the price of the HW itself in a matter of just a few months of intense use, let’s say a year.

zepearl|1 day ago

I downloaded Ollama ( https://github.com/ollama/ollama/releases ) and experimented with a few Qwen models ( https://huggingface.co/Qwen/collections ).

My performance when using an RTX 5070 12GiB VRAM, Ryzen 7 9700X 8 cores CPU, 32GiB DDR5 6000MT (2 sticks):

  - "qwen2.5:7b": ~128 tokens/second (this model fits 100% in the VRAM).
  - "qwen2.5:32b": ~4.6 tokens/second.
  - "qwen3:30b-a3b": ~42 tokens/second (this is a MoE model with multiple specialized "brains") (this uses all 12GiB VRAM + 9GiB system RAM, but the GPU usage during tests is only ~25%).
  - qwen3.5:35b-a3b: ~17 tokens/second, but it's highly unstable and crashes -> currently not usable for me.

So currently my sweet spot is "qwen3:30b-a3b" - even if the model doesn't completely fit on the GPU it's still fast enough. "qwen3.5" was disappointing so far, but maybe things will change in the future (maybe Ollama needs some special optimizations for the 3.5-series?).

I would therefore deduce that the most important thing is the amount of VRAM and that performance would be similar even when using an older GPU (e.g. an RTX 3060 with as well 12GiB RAM)?

Performance without a GPU, tested by using a Ryzen 9 5950X 16 cores CPU, 128GiB DDR4 3200 MT:

  - "qwen2.5:7b": ~9 tokens/second
  - "qwen3:32b": ~2 tokens/second
  - "qwen3:30b-a3b": ~16 tokens/second

siquick|1 day ago

Rent a H100 on Modal which scales down to zero when not in use - you can set the time out period.

Cold boot times are around 5m but if your usage periods are predictable it can work out ok. Works out at $2 an hour.

Still far more expensive than a ChatGPT sub.

flyingjoe|1 day ago

Do you have some reference on what setup you're talking about? I'd like to integrate it into my IDE (cursor/vscode) - are there docs on such a setup?

segmondy|1 day ago

GPUs are not going obsolete anytime soon. the nvidia p40/p100 launched in 2016, 10 years ago and is popular in the local space. My first set of GPUs were a bunch of P40s from 3 years ago for $150 a piece. They at one point went up all the way to $450, but price is now down to $200 range. I think I have gotten my value from those and I suspect I'll still have them crunching out tokens for at least 3 more years. They still beat 90% of cpu/memory inference combo.

krenerd|1 day ago

Indeed, the point is that it's going for 150$

Keyframe|1 day ago

What are my options if I want a decent coding agent at a reasonable price?

I'd even come from another angle.. What are my options if I want a decent coding agent, on the level of what Claude does at any given price? Let's say few tens of thousands of dollars? I've had a limited look at what's available to be run locally and nothing is on par.

renewiltord|1 day ago

Does not exist AFAIK. Even other labs struggle with Claude level performance in real world task. My experience is that no open model is close. You can get RTX 6000 Pro Blackwell (Max-Q is better for power is half). I have heard good things about Qwen3 coder next but I could not get tool calling to be high performance but it’s likely to be pebkac.

If you want to spend big bucks get h200 141 GB but honestly RTX 6000 pro is good enough till you know what you want. Workstation edition is good. It takes care of cooling etc.

Tbh even better is to just get model through cloud. If you want you can rent GPU. Then see if it’s what you want.

atwrk|1 day ago

A Strix Halo with 128GB unified memory is less than $2k and the more suitable alternative to a mac. I'm pretty happy with my device (Bosgame M5).

segmondy|1 day ago

the macs outperform it and I figure it's a better general purpose computer than strix halo. if budget is a problem, then a strix halo is a decent alternative.

Keyframe|1 day ago

A Strix Halo with 128GB unified memory is less than $2k

Where did you get that price? Wherever I looked it's around 3k euros which is around $3.5k

rookonaut|1 day ago

Can you elaborate more on your use cases, models, setup,...?

khalic|1 day ago

You can rent GPUs, this comes with a security, maintenance and performance overhead, but also has a few advantages.

But right now, a Mac is the easiest way because of their memory architecture.

am17an|1 day ago

Honestly you can run this on a 16GB VRAM GPU with llama.cpp. Just try it!