top | item 45857993

(no title)

pahae | 3 months ago

I wish the big providers would offer some sort of trial period where you can evaluate models in a _realistic_ setting yourself (i.e cli tools or IDE integrations). I wouldn't even mind strict limits -- just give me two hours or so of usage and I'd already be happy. Seriously.

My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.

As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.

I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.

Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?

discuss

order

tim333|3 months ago

There's the github copilot 30 day trial? "Access to Anthropic Claude Sonnet 4, GPT-5, Gemini 2.5 Pro, and more 300 premium requests to use the latest models and code review"