Did you look at the examples? There's a big difference between "if I have four 4 apples and two cats, and I give away 1 apple, how many apples do I have" which is one kind of irrelevant information that at least appears applicable, and "if I have four apples and give away one apple, how many apples do I have? Also, did you know cats use their tails to help balance?", which really wouldn't confuse most humans.
krisoft|7 months ago
And i think it would. I think a lot of people would ask the invigilator to see if something is wrong with the test, or maybe answer both questions, or write a short answer on the cat question too or get confused and give up.
That is the kind of question where if it were put to a test I would expect kids to start squirming, looking at each other and the teacher, right as they reach that one.
I’m not sure how big this effect is, but it would be very surprising if there is no effect and unsuspecting, and unwarned people perform the same on the “normal” and the “distractions” test. Especially if the information is phrased as a question like in your example.
I heard it from teachers that students get distracted if they add irrelevant details to word problems. This is obviously anecdotal, but the teachers who I chatted about this thought it is because people are trained through their whole education that all elements of world problems must be used. So when they add extra bits people’s minds desperately try to use it.
But the point is not that i’m right. Maybe i’m totaly wrong. The point is that if the paper want to state as a fact one way or an other they should have performed an experiment. Or cite prior research. Or avoided stating an unsubstantiated opinion about human behaviour and stick to describing the AI.
diamond559|7 months ago
bugbuddy|7 months ago
throwanem|7 months ago
CJefferson|7 months ago
Many students clear try to answer exams by pattern matching, and I've seen a lot of exams of students "matching" on a pattern based on one word on a question and doing something totally wrong.
jonathanlydall|7 months ago
For example, customer service reps tend to often vaguely match your request with a possibly or only vaguely applicable templated response.
Technically savvy customers who tend to try explain problems in detail are probably more likely to get an actually non-applicable canned response as the CS rep gets frustrated with the amount of information and will latch onto the first phrase which relates to a templated response without really considering context.
My reply’s getting a little tangential now, but I feel this is good life advice, I’ve found I’m more likely to get decent customer service if I keep my requests as short as possible.
The first sentence needs to essentially state the issue I need help with. In some cases a bulleted list of things I’ve tried helps and then I’m sure to include essential info like an account number, e.g.
I’m getting error 13508 when I try log into my account. I’ve already tried the following solutions with no success:
- Clearing my browser cache and cookies.
- Restarting my computer.
- Running all software updates.
My account number: xxx
What is the next step here?
jaccola|7 months ago
kazinator|7 months ago
viccis|7 months ago
But in general I do not think these models are claiming at being good at replicating the performance of a distracted or otherwise low performing pupil. I think they should be evaluated against humans who are capable of completing word problems containing context that is not inherently necessary to the math question. The reason those tests I mentioned use these word problems is that it's a way to evaluate someone's ability to think in abstract mathematical terms about everyday situations, which obviously involve lots of unimportant information the person must choose to consider or not.
tl;dr: I think a reasonably competent high school student could answer the apple and cat question, which is absolutely a reasonable bar for an LLM to clear. If university students are failing these questions, then they have not been taught test taking skills, which should be considered a mathematical failure just as unacceptable as that of the LLM, not a mitigating similarity for the latter.
wagwang|7 months ago
wongarsu|7 months ago
kazinator|7 months ago
We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.
The issue is that AI models which, on the surface, appear to be similar to the smarter quantile of humans in solving certain problems, become confused in ways that humans in that problem-solving class would not be.
That's obviously because the language model is not generally intelligent it's just retrieving tokens from a high-dimensional statistically fit function. The extra info injects noise into the calculation which confounds it.
cantor_S_drug|7 months ago
lawlessone|7 months ago
metalman|7 months ago
graeme|7 months ago
lupusreal|7 months ago
Maybe I'm totally wrong about that, but they really should have tested humans too, without that context this result seems lacking.