top | item 42002065

(no title)

Demonstrably false.

https://chatgpt.com/share/6722ca8a-6c80-800d-89b9-be40874c5b...

https://chatgpt.com/share/6722ca97-4974-800d-99c2-bb58c60ea6...

discuss

TZubiri|1 year ago

It's worth noting that this may not be result of a pure LLM, it's possible that ChatGPT is using "actions", explicitly:

1- running the query through a classifier to figure out if the question involves numbers or math 2- Extract the function and the operands 3- Do the math operation with standard non-LLM mechanisms 4- feed back the solution to the LLM 5- Concatenate the math answer with the LLM answer with string substitution.

So in a strict sense this is not very representative of the logical capabilities of an LLM.

digging|1 year ago

Then what's the point of ever talking about LLM capabilities again? We've already hooked them up to other tools.

This confusion was introduced at the top of the thread. If the argument is "LLMs plus tooling can't do X," the argument is wrong. If the argument is "LLMs alone can't do X," the argument is worthless. In fact, if the argument is that binary at all, it's a bad argument and we should laugh it out of the room; the idea that a lay person uninvolved with LLM research or development could make such an assertion is absurd.

thomashop|1 year ago

It shows you when it's calling functions. I also did the same test with Llama, which runs locally and cannot access function calls and it works.

unknown|1 year ago

[deleted]

astrange|1 year ago

Minor edits to well known problems do easily fool current models though. Here's one 4o and o1-mini fail on, but o1-preview passes. (It's the mother/surgeon riddle so kinda gore-y.)

https://chatgpt.com/share/6723477e-6e38-8000-8b7e-73a3abb652...

https://chatgpt.com/share/6723478c-1e08-8000-adda-3a378029b4...

https://chatgpt.com/share/67234772-0ebc-8000-a54a-b597be3a1f...

_flux|1 year ago

I think you didn't use the "share" function; I cannot open any of these links. Can you do it in a private browser session (so you're not logged in)?

TaylorAlexander|1 year ago

At this point I really only take rigorous research papers in to account when considering this stuff. Apple published research just this month that the parent post is referring to. A systematic study is far more compelling than an anecdote.

https://machinelearning.apple.com/research/gsm-symbolic

famouswaffles|1 year ago

That study shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks(some even see increases). The one that isn't involves changing more than names.

Changing names does not affect the performance of Sota models.

zmgsabst|1 year ago

Only if there isn’t a systemic fault, eg bad prompting.

Their errors appear to disappear when you correctly set the context from conversational to adversarial testing — and Apple is actually testing the social context and not its ability to reason.

I’m just waiting for Apple to release their GSM-NoOp dataset to validate that; preliminary testing shows it’s the case, but we’d prefer to use the same dataset so it’s an apples-to-apples comparison. (They claim it will be released “soon”.)

gruez|1 year ago

To be fair, the claim wasn't that it always produced the wrong answer, just that there exists circumstances where it does. A pair of examples where it was correct hardly justifies a "demonstrably false" response.

thomashop|1 year ago

Conversely, a pair of examples where it was incorrect hardly justifies the opposite response.

If you want a more scientific answer there is this recent paper: https://machinelearning.apple.com/research/gsm-symbolic