top | item 45886250

Measuring What Matters: Construct Validity in Large Language Model Benchmarks

1 points| Cynddl | 3 months ago |arxiv.org

discuss

order

No comments yet.