(no title)
disconcision | 7 months ago
while it is not realistic to insist every study account for every possible objection, i would argue that for this kind of capability work, it is in general worth at least modest effort to establish a human baseline.
i can understand why people might not care about this, for example if their only goal is assessing whether or not an llm-based component can achieve a certain level of reliability as part of a larger system. but i also think that there is similar, and perhaps even more pressing broad applicability for considering the degree to which llm failure patterns approximate human ones. this is because at this point, human are essentially the generic all-purpose subsystem used to fill gaps in larger systems which cannot be filled (practically, or in principle) by simpler deterministic systems. so when it comes to a problem domain like this one, it is hard to avoid the conclusion that humans provide a convenient universal benchmark to which comparison is strongly worth considering.
(that said, i acknowledge that authors probably cannot win here. if they provided even a modest-scale human study, i am confident commenters would criticize their sample size)
No comments yet.