top | item 44726587

(no title)

jmilloy | 7 months ago

Did you look at the examples? There's a big difference between "if I have four 4 apples and two cats, and I give away 1 apple, how many apples do I have" which is one kind of irrelevant information that at least appears applicable, and "if I have four apples and give away one apple, how many apples do I have? Also, did you know cats use their tails to help balance?", which really wouldn't confuse most humans.

discuss

krisoft|7 months ago

> which really wouldn't confuse most humans

And i think it would. I think a lot of people would ask the invigilator to see if something is wrong with the test, or maybe answer both questions, or write a short answer on the cat question too or get confused and give up.

That is the kind of question where if it were put to a test I would expect kids to start squirming, looking at each other and the teacher, right as they reach that one.

I’m not sure how big this effect is, but it would be very surprising if there is no effect and unsuspecting, and unwarned people perform the same on the “normal” and the “distractions” test. Especially if the information is phrased as a question like in your example.

I heard it from teachers that students get distracted if they add irrelevant details to word problems. This is obviously anecdotal, but the teachers who I chatted about this thought it is because people are trained through their whole education that all elements of world problems must be used. So when they add extra bits people’s minds desperately try to use it.

But the point is not that i’m right. Maybe i’m totaly wrong. The point is that if the paper want to state as a fact one way or an other they should have performed an experiment. Or cite prior research. Or avoided stating an unsubstantiated opinion about human behaviour and stick to describing the AI.

diamond559|7 months ago

Yeah you're right, if that human is 5 years old or has crippling ADHD.

bugbuddy|7 months ago

LLM’s source of “knowledge” is almost purely statistical. The prompt injections create statistical noise that make the token search a crapshoot. My guess is there are certain words and phrases that generate and amplifies the statistical noise.

throwanem|7 months ago

I wonder if there's variation at play here in testing culture, whether spatially or temporally or both.

CJefferson|7 months ago

As someone who has written and graded a lot of University exams, I'm sure a decent number of students would write the wrong answer to that. A bunch of students would write 5 (adding all the numbers). Others would write "3 apples and 2 cats", which is technically not what I'm looking for (but personally I would give full marks for, some wouldn't).

Many students clear try to answer exams by pattern matching, and I've seen a lot of exams of students "matching" on a pattern based on one word on a question and doing something totally wrong.

jonathanlydall|7 months ago

Many professionals with lower skilled jobs sometimes lean too heavily on pattern matching too.

For example, customer service reps tend to often vaguely match your request with a possibly or only vaguely applicable templated response.

Technically savvy customers who tend to try explain problems in detail are probably more likely to get an actually non-applicable canned response as the CS rep gets frustrated with the amount of information and will latch onto the first phrase which relates to a templated response without really considering context.

My reply’s getting a little tangential now, but I feel this is good life advice, I’ve found I’m more likely to get decent customer service if I keep my requests as short as possible.

The first sentence needs to essentially state the issue I need help with. In some cases a bulleted list of things I’ve tried helps and then I’m sure to include essential info like an account number, e.g.

I’m getting error 13508 when I try log into my account. I’ve already tried the following solutions with no success:

- Clearing my browser cache and cookies.

- Restarting my computer.

- Running all software updates.

My account number: xxx

What is the next step here?

jaccola|7 months ago

Parents whole point is contrary to this (they agree with you), the context didn't even include numbers to pattern match on!

kazinator|7 months ago

When you try wing your way through a question by pattern matching, then you are not applying intelligence. Your interests lie elsewhere and so you are just fumbling your way through the activity at hand just to get through it.

viccis|7 months ago

I agree that poor test takers are easily distracted, and this is the reason that "word problems" are heavily emphasized in preparation for tests like the SAT or state proficiency exams.

But in general I do not think these models are claiming at being good at replicating the performance of a distracted or otherwise low performing pupil. I think they should be evaluated against humans who are capable of completing word problems containing context that is not inherently necessary to the math question. The reason those tests I mentioned use these word problems is that it's a way to evaluate someone's ability to think in abstract mathematical terms about everyday situations, which obviously involve lots of unimportant information the person must choose to consider or not.

tl;dr: I think a reasonably competent high school student could answer the apple and cat question, which is absolutely a reasonable bar for an LLM to clear. If university students are failing these questions, then they have not been taught test taking skills, which should be considered a mathematical failure just as unacceptable as that of the LLM, not a mitigating similarity for the latter.

wagwang|7 months ago

Yes, especially interview questions that include a stupid "real life example" that is usually irrelevant to the question.

wongarsu|7 months ago

If asked verbally that would absolutely confuse some humans. Easily enough to triple the error rate for that specific question (granted, that's easier than the actual questions, but still). Even in a written test with time pressure it would probably still have a statistically significant effect

kazinator|7 months ago

The problem with your reasoning is that some humans cannot solve the problem even without the irrelevant info about cats.

We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.

The issue is that AI models which, on the surface, appear to be similar to the smarter quantile of humans in solving certain problems, become confused in ways that humans in that problem-solving class would not be.

That's obviously because the language model is not generally intelligent it's just retrieving tokens from a high-dimensional statistically fit function. The extra info injects noise into the calculation which confounds it.

cantor_S_drug|7 months ago

Is the model thinking what is cat doing here? Then start thinking it is being tested?

lawlessone|7 months ago

a human would immediately identify it as a trick.

metalman|7 months ago

"wouldn't confuse most humans", yes but no first presumption is that we are talking about humans doing math, in some sort of internet setting. second presumption is that this human has been effected by the significant percentage of the internet devoted to cats and that there response is going to be likely frustration and outrage at cats invading math, or massive relief in having cat meems worked into something otherwise tedious and then the third presumption is that a large number of "humans" wont be aware of the cats in math thing, because they imediatly offloaded the task to an LLM

graeme|7 months ago

It absolutely would if you start hitting working memory constraints. And at the margins some people who would be 50:50 on a given math problem will have working memory constraints.

lupusreal|7 months ago

Any kind of distraction is likely to impact human test scores, unless the test is well below their level or they're otherwise very comfortable with the subject matter. Math specifically makes most of the general public feel a bit in over their head, so tossing random cat facts into the mix is going to get people more confused and nervous.

Maybe I'm totally wrong about that, but they really should have tested humans too, without that context this result seems lacking.