top | item 44471763

(no title)

how do you measure ? genuinely interested, because I'm confronted with the same problem

discuss

Not OP, but what I do is that I have specific test prompts saved I feed to new models. Something that's only from my brain and not copied from the internet.

For example Simon Willison (@simonw) has them draw a SVG of a pelican on a bicycle[0], something that has never ever been photographed ever, nor is it something the LLM can have a reference on, so it has to figure it out.

I have similar things, but not as visual. I know what the result should be and what it should look like. Then I compare and contrast.

Mistral.ai, for example, is by far the fastest (hardware clearly overprovisioned compared to load), but it also produces utter bullshit and hallucinations. With ChatGPT and Claude I can kinda feel the results slowing down or getting worse when USAians are awake and hogging the resources (They're either throttling or just plain using a shittier model unrer load). Deepseek even has a cheaper API price for off-hours.

[0]https://simonwillison.net/2025/Jun/6/six-months-in-llms/