This strikes me as kind of ironic -- you'd think a language model would do better on questions like essay prompts and multiple choice reading comprehension questions regarding passages than it would in calculations. I wonder if there are more details about these benchmarks somewhere, so we can see what's actually happening in these cases.
jltsiren|3 years ago