(no title)
h4ny | 6 months ago
At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models (many benchmarks can be gamed).
The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.
epolanski|6 months ago
You are not going to get the same output from GPT5 or Sonnet every time.
And this obviously compounds across many different steps.
E.g. give GPT5 the code to a feature (by pointing some files and tests) and tell it to review it and find improvement opportunities and write them down: depending on the size of the code, etc, the answers will slightly different.
I often do it in Cursor by having multiple agents review a PR and each of them: - has to write down their pr-number-review-model.md (e.g. pr-15-review-sonnet4.md) - has to review the reviews of the other files
Then I review it myself and try to decide what's valuable in there and what not. And to my disappointment (towards myself): - often they do point to valid flaws I would've not thought about - miss the "end-to-end" or general view of how the code fits in a program/process/business. What do I mean: sometimes the real feedback would be that we don't need it at all. But you need to have these conversations with AI earlier.
x187463|6 months ago
physix|6 months ago
vineyardmike|6 months ago
I think you’ll always have some disagreement generally in life, but especially for things like this. Code has a level of subjectivity. Good variable names, correct amount of abstraction, verbosity, over complexity, etc are at least partially opinions. That makes benchmarking something subjective tough. Furthermore, LLMs aren’t deterministic, and sometimes you just get a bad seed in the RNG.
Not only that, but the harness and prompt used to guide the model make a difference. Claude responds to the word “ultrathink”, but if GPT-5 uses “think harder”, then what should be in the prompt?
Anecdotally, I’ve had the best luck with agentic coding when using Claude Code with Sonnet. Better than Sonnet with other tools, and better than Claude Code with other models. But I mostly use Go and Dart and I aggressively manage the context. I’ve found GPTs can’t write zig at all, but Gemini can, but they can both write python excellently. All that said, if I didn’t like an answer, I’d prompt again, but liked the answer, never tried again with a different model to see if I’d like it even more. So it’s hard to know what could’ve been.
I’ve used a ton of models and harnesses. Cursor is good too, and I’ve been impressed with more models in cursor. I don’t get the hype of Qwen though because I’ve found it makes lots of small(er) changes in a loop, and that’s noisy and expensive. Gemini is also very smart but worse at following my instructions, but I never took the time to experiment with prompting.
jjfoooo4|6 months ago
I heavily discount same day commentary, there's a quid pro quo on early access vs favorable reviews (and yes, folks publishing early commentary aren't explicitly agreeing to write favorable things, but there's obvious bias baked in).
I don't think it's all particularly concerning, you can discount reviews that are coming out so quickly that's it's unlikely the reviewer has really used it very much.
muzani|6 months ago
Just pick something and use it. AI models are interchangeable. It's not as big a decision as buying a car or even a durian.
isaacremuant|6 months ago
Can't help but laugh at this. It's like you just discovered skepticism and how the world actually works.
qsort|6 months ago