(no title)
bt1a | 6 months ago
i only call this out because you're selling it and don't hypothesize* on why they fail your simple problems. i suppose an easily aced bench wouldn't be very marketable
bt1a | 6 months ago
i only call this out because you're selling it and don't hypothesize* on why they fail your simple problems. i suppose an easily aced bench wouldn't be very marketable
Kuinox|6 months ago
Most of the time they make a correct summation table but fail to copy correctly the sum result into a final result. That is not a tokenisation problem (you can change the output format to make sure of it). I have a separated benchmark that test specifically this, when the input is too large, the LLMs fails to accuratly copy the correct token. I suppose the positional embedding, are not perfectly learned and it sometimes cause a mistake.
The prompt is quite short, it use structured output, and I can generate a nice graph of % of good response accross difficulity of the question (which is just the total digit count of the input numbers.
LLMs have 100% success rate on theses sum until they reach a frontier, past that their accuracy collapse at various speed depending of the model.
bwfan123|6 months ago
Even when the algorithm steps are laid out precisely, they cannot be followed. Perhaps, LLMs should be trained on turing machine specs and be given a tape lol.
Constraint satisfaction and combinatorics are where the search space is exponential, and the techniques are not formalized (not enough data in training set), and remain hard for machines as seen in the Problem 6 of IMO which could not be solved by LLMs. I suspect, there is this aspect of human intelligence which is not yet captured in LLMs.
[1] - https://machinelearning.apple.com/research/illusion-of-think...
energy123|6 months ago
The temp 0.7-1.0 defaults are not designed for reconstructing context with perfect accuracy.