top | item 46788104

(no title)

If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.

discuss

kyeb|1 month ago

(disclaimer: I work at Falconer)

you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.

in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"

aryamanagraw|1 month ago

That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.