top | item 30305380

(no title)

mlb_hn | 4 years ago

I get the tokenization argument and it may influence it a bit, but I suspect the n-digit math issue has to do more with search the way it samples (in the bpe link gwern references some experiements I'd done with improving n-digit math by chunking using commas, http://gptprompts.wikidot.com/logic:math). I think since it samples left to right on the first pass, it's not able to predict well if things carry from right to left.

I think can mitigate the search issue a bit if you have the prompt double-check itself after the fact (e.g. https://towardsdatascience.com/1-1-3-wait-no-1-1-2-how-to-ha...). Works different depending on the size of the model tho.

discuss

order

moyix|4 years ago

Yup, quite possible that this has something to do with it. There is other work showing that giving LMs a "scratchpad" for intermediate computations allows them to do much better not just at arithmetic but also things like predicting the output of some code: https://arxiv.org/abs/2112.00114

mlb_hn|4 years ago

definitely. also works on text translation/comprehension like emojis! https://aidungeon.medium.com/introducing-ai-dungeon-translat.... For actual benchmarks, scratchpad improves GPT-Davinci WIC from 50% accuracy (chance) to nearly 70%.

I think the check and validate is a different sort of scratchpad but maybe not. Seems like at least 3 types - soe for pulling implicit info out of the network viz wic, sometimes for intermediary steps viz coding, sometimes for verification like here.

ravi-delia|4 years ago

I feel like attention would largely mitigate that, no? Has anyone looked at what the weights are while doing addition?