(no title)
13pixels | 19 days ago
Have you found that forcing the LLM into a structured scoring framework reduces its tendency to hallucinate specs? Or does it just hallucinate the scores with more confidence?
Also, curious if you've tried different models for the "scoring" vs "reasoning" steps. We've found Claude is much better at adhering to complex constraints than GPT-4o for tasks like this.
boundedreason|18 days ago
The balance I have been trying to find is between "show me all you work" so you know the math is all there and correct (and perfect for enterprise) and then trying to tell friends or my parents who don't care to see the math (the background) that it needs to be there for this to work right. I have seen scores "ballparked" by ChatGPT. The rankings didn't change in the end, but the scores were a couple tenths off.
I've used ChatGPT 5.2, Claude and Gemini, but have never switched between steps which sounds interesting! I have found the same as you with Claude. ChatGPT is a close second and Gemini doesn't give me the type of response I'd prefer to keep things smooth and traceable. I'm looking to buy a new car right now and each of the 3 models has given me the same top 3 each time so I find that as re-assurance.