top | item 46975995

(no title)

13pixels | 19 days ago

This is a really interesting application of LLMs. The lack of "repeatable, traceable results" is indeed a huge issue for any serious use case (we see this constantly in enterprise adoption).

Have you found that forcing the LLM into a structured scoring framework reduces its tendency to hallucinate specs? Or does it just hallucinate the scores with more confidence?

Also, curious if you've tried different models for the "scoring" vs "reasoning" steps. We've found Claude is much better at adhering to complex constraints than GPT-4o for tasks like this.

discuss

boundedreason|18 days ago

Most of the language I use in my prompting is structured around weeding out hallucinations. At each step along the way I ensure I'm asked to confirm the previous step's output.

The balance I have been trying to find is between "show me all you work" so you know the math is all there and correct (and perfect for enterprise) and then trying to tell friends or my parents who don't care to see the math (the background) that it needs to be there for this to work right. I have seen scores "ballparked" by ChatGPT. The rankings didn't change in the end, but the scores were a couple tenths off.

I've used ChatGPT 5.2, Claude and Gemini, but have never switched between steps which sounds interesting! I have found the same as you with Claude. ChatGPT is a close second and Gemini doesn't give me the type of response I'd prefer to keep things smooth and traceable. I'm looking to buy a new car right now and each of the 3 models has given me the same top 3 each time so I find that as re-assurance.