you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.
in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"
That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.
kyeb|1 month ago
you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.
in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"
aryamanagraw|1 month ago