Wow, a lot of grumpiness in here. If it's true that adding like 20 or so tokens to encode column location / decimal spot triples math performance in out of band tasks, that's a big deal. It's a simple fix, it improves performance A LOT, and they even indicate it's not just a party trick, in that the LLM can use the information to do better on related tasks like sorting and list making.
This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.
I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?
This is cool, but special casing digits is unsatisfying.
It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.
I'm not sure if such a thing is possible, but if it is, it would feel more complete. (Fwiw, positional embeddings have had issues for a long time! So a general solution to this would benefit more than just arithmetic. Helpfully, we now have a really good specific example to serve as a baseline for any generalization we seek)
> This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.
This is muchhhhh different from how tokenization works today. Adding tokens to the vocabulary is free, everything outside that (i.e. string -> tokens) is going to be a major pain in the ass. Doable but annoying and error prone
It's also obvious and it's hacky. Frankly I'm stunned this hasn't been tried yet. The people thinking this is a stepping stone to More Intelligence are missing the forest for the trees.
Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.
Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.
I think the problem here is that 'understanding' is not the same as curve fitting.
If all one is doing is giving a model lots of data and fitting curves it's not really 'understanding' but brute forcing it's way (with gradient descent) and then storing the weights and finally approximate the solution when a query is passed in.
This is not the same as understanding. Human intelligence can operate deterministically as well as non-deterministically. We can listen to language, which is by it's nature non-deterministic and convert that into deterministic operations and vice a versa. IE we can operate on some logic and explain it in multiple ways to other people.
Understanding requires much less data than brute forcing your way into pattern recognition.
When you see a simple number like this 2 * 4 you are able to understand that it's equivalent to 2 + 2 + 2 + 2 and that in turn means 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 <- Count that and you've got your answer.
Because you 'understand' this basic concept and all the operations in between you are able to compute more examples. But you only need to understand it once. Once you understand multiplications and additions and all the tricks in between you are able to compute 23 * 10 without being fed 23 * 10 as prior data. Understanding is very different from fitting a curve. You can reach conclusions and understanding through pattern recognition, but it's important to differentiate 'approximation' from 'calculation'. If you understand something in it's entirety you should be able to calculate an outcome deterministically.
Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.
I think you are mixing layers of abstraction. To make a crude but I think not unhelpful analogy: 'Understanding' is a natural language concept that is our way to describe whats happening in our heads, and like most other such concepts is resistant to any clear definition and will exhibit sorites type paradoxes when one is attempted. It belongs to the presentation layer of the stack. While the process of curve fitting, however it is implemented, with whatever NN structure (like transformers) or maybe something else entirely belongs to the physical layer of the stack -- akin to frequency modulation.
While I am unsure whether LLMs are really understanding, whatever that means, I think it is not difficult to believe that any form of understanding we implement will involve 'curve fitting' as a central part.
This seems like its confusing how we conceptualize the training/learning process with what the system is actually doing. We conceptualize tuning parameters as curve fitting, and we conceptualize predicting the next token as maximizing probability. But that doesn't mean there is anything like curve fitting or probability maxxing happening as the system's parameters converge.
The core feature of curve fitting is learning explicit examples and then interpolating (in an uninformative manner) between unlearned examples. But there's no reason to think this completely describes what the system is doing, in the sense that there are no more informative descriptions of its behavior. Take an example that LLMs are surprisingly good at, creating poetry given arbitrary constraints. Imagine the ratio of the poems it has seen during its training over the number of unique poems it could create in principle. This number would be vanishingly small. Interpolating between two strings representing well-formed poems in an uninformative manner (i.e. some finite polynomial) will not generate well-formed poems. The only way you could move between two examples of well-formed poems while staying on the manifold of well-formed poems is if you captured all relevant features of the manifold. But I fail to see a difference between capturing all relevant features of the poetry-manifold and understanding poetry.
What LLMs do can be described as curve fitting in only the most uninformative description possible. What they do is discover features of the structures referred to by the training text and competently deploy these features in predicting the next token. A human that could do this would be consider to understand said structure.
It seems like a hack to be honest. Problem at hand is not to make transformers do addition of 100 digit numbers. Problem is the current systems can’t reason about things, math included.
Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.
> Problem is the current systems can’t reason about things
Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.
I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.
I do think that the combination of a transformer network and calls to customized reasoning chips (systems that search and deduce answers, like Wolfram Alpha or logic/proof systems) may be a short-stop to something that can perform reason and execution of actions better than humans, but is not AGI.
As I understand, conceptually they just changed 346 + 23 = ? to (1: 3, 2: 4, 3: 6) + (1: 2, 2: 3) = ?
So it is not that much of a specific hack. There could be a broader principle here where something is holding transformers back in a general fashion, and we might be able to improve on the architecture!
how do you argue that these models are not able to reason?
deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)
the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.
I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized
We're as humanity building a reasoning machine bottom up. It can't reason... yet. Expecting a magical switch that will make it reason about anything and everything is unreasonable. Starting with arithmetic makes perfect sense.
I didn’t test with all LLM out there, but all of thus I tested failed with something as basic as "What is the number of words in the sentence coming before the next one? Please answer."
For things like this where we have computationally cheap, well understood, reliable tools available (aka calculator) it seems better to train the model in tool use.
I guess perhaps the techniques could be generalized though?
Generalizable techniques is mostly the point of papers like this one yes. What they show here is that apparently fundamental problems with transformer reasoning can be fixed by encoding data in a more sophisticated manner. This is exciting. I've been thinking for a long time that the tokenization schemes are a low hanging fruit for improving coding LLM performance, this isn't exactly the same thing but it's in the same general area. Smartness and reasoning ability with the current set of algorithmic techniques seems to have topped out around GPT-4 level, which implies that further leaps in mental abilities must come from improving other things beyond training set size.
For example, whilst replacing the need for a calculator isn't very important, one obvious research direction would be to explore adding extra embeddings to code inputs, perhaps that are being computed by an IDE.
I think understanding mathematics is what LLM really need at the moment far more important than video generation that is just another form of CGI [1]. After deep learning and transformer, understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM and a turning point for humanity.
[1] Why LLMs like ChatGPT and Google Bard are bad at math:
> understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM
Why?
I definitely agree that such capabilities would represent a major advance (and very likely go together with game changing increases of capabilities in other areas). I also think using AI to write formal math proofs in e.g. Lean is very cool.
However, by itself, it seems like this capability wouldn't be very useful, commercially for example. Do you think this capability is exceptionally informative merely because it has to go together with other capabilities? It's not impossible to have a (maybe somewhat limited) formal math AI that will remain mostly irrelevant to the everyday world (like FormalGeo).
Something I've been thinking about is how the Minds -- the super-human AI hyper-computers that fly the ships in the Culture series of novels are described. The image built up in my head[1] is that they're hybrids blending neural networks and regular compute substrates. They can calculate, simulate, and reason in combination.
There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.
What I would like to see is a GPT-style AI that also has compute blocks, not just transformer blocks. I don't mean compute in the sense of "matrix multiply for weights and biases", but literally an ALU-style block of basic maths operations available for use by the neurons.
One thought that I had was that this could be via activations that have both a floating-point activation value and "baggage" such as a numerical value from the input. Like a token in a traditional parser, that can represent a constant string or an integer with its decoded value.
The newer, truly multi-modal models gave me a related idea: Just like how they can have "image" tokens and "audio" tokens, I wonder if they could be given "numeric data" tokens or "math symbol" tokens. Not in the same way that they're given mixed-language text tokens, but dedicated tokens that are fed into both the transformer blocks and also into ALU blocks.
Just an idle thought...
[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...
The problem, if you embed an ALU like that, is how to train it to use them properly. And then it's not clear if they actually need to be able to do that in the middle of a pass that, at the end, is going to produce a single token anyway.
Controlling that stuff via output tokens actually kinda makes sense by analogy, since that is how we use calculators etc. But I do agree that specialized tokens that are used specifically to activate tools like that might be a better idea than just using plain text to signal in-band. And production of such specialized tokens can be easily trained.
I like this idea a lot. Right now we are going the long/hard way round, and post training asking an LLM to know it needs compute, then write a compute request, then feed back the compute answer into a tokenization loop.
It probably does make sense to add a mini CPU as a layer / tool / math primitive. I wonder how you'd train it to use such a thing? In my mind it's not really a layer per-se, but it's a set of function calls a layer could route to when it wants, and weight the response appropriately.
I just wonder if numbers were written right to left, llms would be much better at arithmetic. You can 'predict' the least significant digit by reusing the already written digits in the computation, but to generate most significant ones, you generally need to do the entire computation in one go.
Yes. This has already been demonstrated by "Teaching Arithmetic to Small Transformers" https://arxiv.org/abs/2307.03381 , I'm not sure what OP adds except demonstrating that you can do that via the embedding itself rather than the tokenization.
> We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges.
This is an interesting idea but probably hard to verify.
A tangent is that positional systems were originally invented with least digit first, I believe.
The Babylonian sexagesimal system was like that as was the Arabic one (where first is on the right).
The most significant digit first convention came when right-to left numbers were used in left-to-right systems without reversing them in writing. To this day we read the more common smaller numbers least significant digit first to varying degrees.
16 = six teen, sech zehn
98 = acht und neunzig, achten negentig, ثمانية وتسعون
I'm curious about the framing of research like this.. "The poor performance of transformers on arithmetic tasks" (relative to what?) and how that informs the adjacent conversation on progress towards AGI.
Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.
For example with arithmetic, this study cites another [Dziri et al. 2023], that says:
"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."
I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.
DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40
Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.
Exactly, we need a much more granular approach to evaluating intelligence and generality. Our current conception of intelligence largely works because humans share evolutionary history and partake in the same 10+ years of standardized training. As such, many dimensions of our intelligence correlate quite a bit, and you can likely infer a person's "general" proficiency or education by checking only a subset of those dimensions. If someone can't do arithmetic then it's very unlikely that they'll be able to compute integrals.
LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.
AGI is like consciousness, 75% of the people in any given conversation are talking about different things.
Truthfully we're going to see that improving language models towards AGI works out the same way self driving cars did - we're going to feel like we're 85% of the way there out of the gate, then we're going to keep tripping over things for the next 15 years.
At least with AGI, we can just throw up our hands, use an easier definition and take the W.
I don't understand the framing of your comment. You act like the LLM's feelings are going to be hurt if you say it isn't a real AGI. "Well, you can't do basic math expected of fifth graders, but there are dumb fifth graders too, so here's the 'human-level intelligence' participation trophy anyway."
The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)
In particular, the lack of reasoning capability. And what the pessimists argue here is that there is no road to get there for current systems. Transformers are approximation machines, and are generalized for that specific task. But that's also where it stops, they can't do things that aren't such pattern-approximation.
Optimizing a transformer for arithmetic isn't a step towards AGI, because it is not generalizing. You'd need to do this for every conceivable task and subtask. This is the exact reason why imperative-programmed AI architectures were discarded.
Put bluntly, this approach will never get you a transformer that won't shit itself when asked to do novel reasoning tasks, such as novel mathematics. (Which I will remind the reader, anything but the basic programming work counts as)
And critically, the fundamental architecture of these transformer systems doesn't allow the combination of them into other AI systems to acquire generalized capabilities. There's no way to make an LLM hook into a computer-algebra-system, you can only feed 'finished' output of one system into another.
The other day I was wondering if LLMs are bad at at maths because they don't have readily apparent access to the concept of "columns". Apparently the answer is yes.
Vertical alignment across lines is pretty important for humans to learn operations on digits, but the way we encode lines with a \n separator doesn't really help. In a recent codebullet video gpt really struggled with any kind of vertical alignment task. I wonder if it would do better on a fixed 80 column width...
Isn't it more that they don't have ready access to the much-more-fundamental concept of decimal numbers?
My understanding was that they tokenized them into chunks and tried to learn associations between the chunks, the same as if one was breaking apart English words.
So "2+2=4" isn't being treated that differently from "all's well that ends well." This might lead to a kind of Benny's Rules [0] situation, where sufficient brute-force can make a collection of overfitted non-arithmetic rules appear to work.
I went through the paper and thought immediately about how did they implement it; I missed they published their code as well. Here is the link for everyone who skimmed past it: https://github.com/mcleish7/arithmetic/tree/main
It's basically the same as feature engineering in pre-deep machine learning: constructing features with high information content can significantly reduce the amount of data and computation needed to fit a useful model. And sometimes it's impossible to fit a useful model without careful feature engineering, either because the model itself is constrained in some way or because there isn't enough data or both.
It's analogous to making a choice of inductive bias within the model itself. We literally could not do LLMs without the carefully-constructed transformer architecture. Why should we expect to make further progress without paying more attention to the embeddings?
Since models are very good at writing very short computer programs, and computer programs are very good at mathematical calculations, would it not be considerably more efficient to train them to recognise a "what is x + y" type problem, and respond with the answer to "write and execute a small javascript program to calculate x + y, then share the result"?
From a getting answers perspective yes, from an understanding LLMs perspective no. If you read the avstract you can see how this goes beyond arithmetic and helps with longform reasoning
But that's not all that relevant to the question "can LLMs do math". People don't really need ChatGPT to replace a calculator. They are interested in whether the LLM has learned higher reasoning skills from it's training on language (especially since we know it has "read" more math books than any human could in a lifetime). Responding with a program that reuses the + primitive in JS proves no such thing. Even responding with a description of the addition algorithm doesn't prove that it has "understood" maths, if it can't actually run that algorithm itself - it's essentially looking up a memorized definition. The only real proof is actually having the LLM itself perform the addition (without any special-case logic).
This question is of course relevant only in a research sense, in seeking to understand to what extent and in what ways the LLM is acting as a stochastic parrot vs gaining a type of "understanding", for lack of a better word.
This is a cromulent approach, though it would be far more effective to have the LLM generate computer-algebra-system instructions.
The problem is that it's not particularly useful: As the problem complexity increases, the user will need to be increasingly specific in the prompt, rapidly approaching being fully exact. There's simply no point to it if your prompt has to (basically) spell out the entire program.
And at that point, the user might as well use the backing system directly, and we should just write a convenient input DSL for that.
"Syntax-Aware Transformer Models for Neural Machine Translation" by Yang et al. (2019). This model enhances the transformer architecture with syntax-aware attention mechanisms that consider dependency parse trees.
Context-Aware Neural Machine Translation Learns Anaphora Resolution" by Bawden et al. (2018). This paper explores integrating context and syntax into neural machine translation models.
I think the main problem is the way we turn the raw mathematics symbols or equations into tokens, and these suboptimal tokenization may decreases the performance
I thinks that's far from the only problem.
To me the most obvious problem is that we use right-to-left numbers (think about the order you're writing digits when doing long addition) in a left-to-right language.
Without a special number-flipping step; the transformer is forced to produce the output token-by-token, i.e. from left-to-right. Without the ability to store additional internal state, this turns addition into an O(N²) problem purely due to the suboptimal output ordering!
What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.
And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.
We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).
So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
Ok you want the general answer? Consider a discrete time Markov process with memory length N on a finite state space. Train a transformer with context length N on sample trajectories with SGD. Can you expect the transformer to become a good approximation for the dynamics of the Markov process? More specifically, suppose your Markov process is generated by some algorithm/Turing machine couple with some random data. Then, can you expect the transformer to learn to emulate the behavior of the underlying Turing machine, even when run on data which was notnin the initial distribution?
Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?
This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.
Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.
> With positions resolved, we can study the logical extrapolation ability of transformers
They are interested in how well they can make a neural net logically extrapolate outside its training set, once encoding barriers are removed. They show that in fact even quite small language models can do this successfully once we're not confusing them with bad encodings anymore.
This seems like fundamental work. It was only a few years ago that Google employees were arguing LLMs were nothing more than "stochastic parrots". Well, that take will go down in history as one of the worst takes on AI ever. I don't think anyone really had any doubt by 2024 that this wasn't true, but the huge and opaque datasets meant people could always argue that maybe this wasn't an example of logical reasoning or extrapolation, maybe it had just seen this specific question before. But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers. It's not just repeating answers it's seen in its dataset. It should kill off the parrot meme for good.
It’s not about arithmetic but about embeddings. The positional embeddings used in transformers are rather simplistic. If they can add this one new capability to transformers by using different embeddings then maybe there are other capabilities that are within reach.
I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network. Maybe there are other subsequences that could be annotated in this way? Per paragraph, tokens per word, who knows.
Obviously, the "best" way to do addition on a computer is by doing it exactly.
One is that research into what the limits of the architecture are is useful. Maths has a nice property of being very easy to verify and you can construct logical processes with it. It's a useful testbed.
Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.
>What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.
Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.
>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?
And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?
> What is the point of this work? [...] We already know how to hard-code a (literally) infinitely more accurate addition machine.
There are many situations where it is useful for the LLM to get basic arithmetic right.
For example, if someone asks your LLM to explain this line of code [1] which takes a 28x28 px input image, is the right explanation that 28×28÷4×64=9216 ? Or is that the wrong explanation?
And being able to get 100-digit arithmetic right 99% of the time might make use feel reassured that the 4-digit arithmetic we need from the model will be right an even higher % of the time.
Seriously? They say it right in the introduction. The goal is to learn how to infer algorithmic processes directly from data. Much like how MNIST was used in the early days of NNs, you have to start with small toy problems that are representative of the problem domain. Once you have success with that, you can scale up problem complexity.
General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
I would even appreciate seeing more papers on approaches that didn’t work very well so it saves other researchers from going in the wrong direction. That alone would be enough justification for publishing an article.
Meanwhile I'm over here using Claude 3 Opus to do trig and calculus problems as well as generate the LaTex representation of the equations. It's not necessary to be 100% in my case (purely for fun) but I follow its reasoning and it's pretty consistent at least enough for "orders of magnitude" and first order effects. I was gonna post some of the chats about physics but probably nobody cares.
I did do some followup research. The math in its complex reasoning "tracks" but when I asked it to do 4 digit x 4 digit multiplication, it got most of it right except for a weird random digit error in the middle (?!) of the correct answer, lol. Now I want to run CLUTTR against Claude since it seems nobody has published that yet.
It's probably on-par or better than humans get unaided. Hell, I'd bet due to transcription errors it's better than what humans get in a lot of settings, even when aided by a calculator.
vessenes|1 year ago
This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.
I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?
wrsh07|1 year ago
It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.
I'm not sure if such a thing is possible, but if it is, it would feel more complete. (Fwiw, positional embeddings have had issues for a long time! So a general solution to this would benefit more than just arithmetic. Helpfully, we now have a really good specific example to serve as a baseline for any generalization we seek)
nprateem|1 year ago
refulgentis|1 year ago
This is muchhhhh different from how tokenization works today. Adding tokens to the vocabulary is free, everything outside that (i.e. string -> tokens) is going to be a major pain in the ass. Doable but annoying and error prone
badrunaway|1 year ago
uoaei|1 year ago
Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.
Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.
zacksiri|1 year ago
If all one is doing is giving a model lots of data and fitting curves it's not really 'understanding' but brute forcing it's way (with gradient descent) and then storing the weights and finally approximate the solution when a query is passed in.
This is not the same as understanding. Human intelligence can operate deterministically as well as non-deterministically. We can listen to language, which is by it's nature non-deterministic and convert that into deterministic operations and vice a versa. IE we can operate on some logic and explain it in multiple ways to other people.
Understanding requires much less data than brute forcing your way into pattern recognition.
When you see a simple number like this 2 * 4 you are able to understand that it's equivalent to 2 + 2 + 2 + 2 and that in turn means 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 <- Count that and you've got your answer.
Because you 'understand' this basic concept and all the operations in between you are able to compute more examples. But you only need to understand it once. Once you understand multiplications and additions and all the tricks in between you are able to compute 23 * 10 without being fed 23 * 10 as prior data. Understanding is very different from fitting a curve. You can reach conclusions and understanding through pattern recognition, but it's important to differentiate 'approximation' from 'calculation'. If you understand something in it's entirety you should be able to calculate an outcome deterministically.
Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.
zyklu5|1 year ago
While I am unsure whether LLMs are really understanding, whatever that means, I think it is not difficult to believe that any form of understanding we implement will involve 'curve fitting' as a central part.
hackinthebochs|1 year ago
The core feature of curve fitting is learning explicit examples and then interpolating (in an uninformative manner) between unlearned examples. But there's no reason to think this completely describes what the system is doing, in the sense that there are no more informative descriptions of its behavior. Take an example that LLMs are surprisingly good at, creating poetry given arbitrary constraints. Imagine the ratio of the poems it has seen during its training over the number of unique poems it could create in principle. This number would be vanishingly small. Interpolating between two strings representing well-formed poems in an uninformative manner (i.e. some finite polynomial) will not generate well-formed poems. The only way you could move between two examples of well-formed poems while staying on the manifold of well-formed poems is if you captured all relevant features of the manifold. But I fail to see a difference between capturing all relevant features of the poetry-manifold and understanding poetry.
What LLMs do can be described as curve fitting in only the most uninformative description possible. What they do is discover features of the structures referred to by the training text and competently deploy these features in predicting the next token. A human that could do this would be consider to understand said structure.
msoad|1 year ago
Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.
sshine|1 year ago
Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.
I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.
I do think that the combination of a transformer network and calls to customized reasoning chips (systems that search and deduce answers, like Wolfram Alpha or logic/proof systems) may be a short-stop to something that can perform reason and execution of actions better than humans, but is not AGI.
golol|1 year ago
josehackernews|1 year ago
deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)
the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.
I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized
baq|1 year ago
psychoslave|1 year ago
grumpopotamus|1 year ago
Have you tried asking GPT-4 any questions that require reasoning to solve? If so, what did you ask, and what did it get wrong?
Havoc|1 year ago
I guess perhaps the techniques could be generalized though?
mike_hearn|1 year ago
For example, whilst replacing the need for a calculator isn't very important, one obvious research direction would be to explore adding extra embeddings to code inputs, perhaps that are being computed by an IDE.
0-_-0|1 year ago
verticalscaler|1 year ago
teleforce|1 year ago
[1] Why LLMs like ChatGPT and Google Bard are bad at math:
https://www.xda-developers.com/why-llms-are-bad-at-math/
staunton|1 year ago
Why?
I definitely agree that such capabilities would represent a major advance (and very likely go together with game changing increases of capabilities in other areas). I also think using AI to write formal math proofs in e.g. Lean is very cool.
However, by itself, it seems like this capability wouldn't be very useful, commercially for example. Do you think this capability is exceptionally informative merely because it has to go together with other capabilities? It's not impossible to have a (maybe somewhat limited) formal math AI that will remain mostly irrelevant to the everyday world (like FormalGeo).
jiggawatts|1 year ago
There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.
What I would like to see is a GPT-style AI that also has compute blocks, not just transformer blocks. I don't mean compute in the sense of "matrix multiply for weights and biases", but literally an ALU-style block of basic maths operations available for use by the neurons.
One thought that I had was that this could be via activations that have both a floating-point activation value and "baggage" such as a numerical value from the input. Like a token in a traditional parser, that can represent a constant string or an integer with its decoded value.
The newer, truly multi-modal models gave me a related idea: Just like how they can have "image" tokens and "audio" tokens, I wonder if they could be given "numeric data" tokens or "math symbol" tokens. Not in the same way that they're given mixed-language text tokens, but dedicated tokens that are fed into both the transformer blocks and also into ALU blocks.
Just an idle thought...
[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...
int_19h|1 year ago
Controlling that stuff via output tokens actually kinda makes sense by analogy, since that is how we use calculators etc. But I do agree that specialized tokens that are used specifically to activate tools like that might be a better idea than just using plain text to signal in-band. And production of such specialized tokens can be easily trained.
vessenes|1 year ago
I like this idea a lot. Right now we are going the long/hard way round, and post training asking an LLM to know it needs compute, then write a compute request, then feed back the compute answer into a tokenization loop.
It probably does make sense to add a mini CPU as a layer / tool / math primitive. I wonder how you'd train it to use such a thing? In my mind it's not really a layer per-se, but it's a set of function calls a layer could route to when it wants, and weight the response appropriately.
torginus|1 year ago
gwern|1 year ago
> We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges.
weinzierl|1 year ago
A tangent is that positional systems were originally invented with least digit first, I believe.
The Babylonian sexagesimal system was like that as was the Arabic one (where first is on the right).
The most significant digit first convention came when right-to left numbers were used in left-to-right systems without reversing them in writing. To this day we read the more common smaller numbers least significant digit first to varying degrees.
16 = six teen, sech zehn
98 = acht und neunzig, achten negentig, ثمانية وتسعون
lupire|1 year ago
17 + 14 = 20 + 11 = 30 + 1 = 31
vs 17 + 14 = 10 + 10 + 10 + 1 = 31
spencerchubb|1 year ago
pmayrgundter|1 year ago
Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.
For example with arithmetic, this study cites another [Dziri et al. 2023], that says:
"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."
But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.
I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.
DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40
Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.
Last5Digits|1 year ago
LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.
CuriouslyC|1 year ago
Truthfully we're going to see that improving language models towards AGI works out the same way self driving cars did - we're going to feel like we're 85% of the way there out of the gate, then we're going to keep tripping over things for the next 15 years.
At least with AGI, we can just throw up our hands, use an easier definition and take the W.
edflsafoiewq|1 year ago
ADeerAppeared|1 year ago
This nitpicking is a red herring.
The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)
In particular, the lack of reasoning capability. And what the pessimists argue here is that there is no road to get there for current systems. Transformers are approximation machines, and are generalized for that specific task. But that's also where it stops, they can't do things that aren't such pattern-approximation.
Optimizing a transformer for arithmetic isn't a step towards AGI, because it is not generalizing. You'd need to do this for every conceivable task and subtask. This is the exact reason why imperative-programmed AI architectures were discarded.
Put bluntly, this approach will never get you a transformer that won't shit itself when asked to do novel reasoning tasks, such as novel mathematics. (Which I will remind the reader, anything but the basic programming work counts as)
And critically, the fundamental architecture of these transformer systems doesn't allow the combination of them into other AI systems to acquire generalized capabilities. There's no way to make an LLM hook into a computer-algebra-system, you can only feed 'finished' output of one system into another.
infogulch|1 year ago
Vertical alignment across lines is pretty important for humans to learn operations on digits, but the way we encode lines with a \n separator doesn't really help. In a recent codebullet video gpt really struggled with any kind of vertical alignment task. I wonder if it would do better on a fixed 80 column width...
Terr_|1 year ago
My understanding was that they tokenized them into chunks and tried to learn associations between the chunks, the same as if one was breaking apart English words.
So "2+2=4" isn't being treated that differently from "all's well that ends well." This might lead to a kind of Benny's Rules [0] situation, where sufficient brute-force can make a collection of overfitted non-arithmetic rules appear to work.
[0] https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...
matrix2596|1 year ago
topherjaynes|1 year ago
byt3h3ad|1 year ago
nerdponx|1 year ago
It's basically the same as feature engineering in pre-deep machine learning: constructing features with high information content can significantly reduce the amount of data and computation needed to fit a useful model. And sometimes it's impossible to fit a useful model without careful feature engineering, either because the model itself is constrained in some way or because there isn't enough data or both.
It's analogous to making a choice of inductive bias within the model itself. We literally could not do LLMs without the carefully-constructed transformer architecture. Why should we expect to make further progress without paying more attention to the embeddings?
Shrezzing|1 year ago
Grimblewald|1 year ago
simiones|1 year ago
This question is of course relevant only in a research sense, in seeking to understand to what extent and in what ways the LLM is acting as a stochastic parrot vs gaining a type of "understanding", for lack of a better word.
gmerc|1 year ago
ADeerAppeared|1 year ago
The problem is that it's not particularly useful: As the problem complexity increases, the user will need to be increasingly specific in the prompt, rapidly approaching being fully exact. There's simply no point to it if your prompt has to (basically) spell out the entire program.
And at that point, the user might as well use the backing system directly, and we should just write a convenient input DSL for that.
unknown|1 year ago
[deleted]
andrepd|1 year ago
kjhcvkek77|1 year ago
skyde|1 year ago
Basically if a word contain a Prefix, suffix or root word. We could have a token position relative to the start of the word in the embedding.
skyde|1 year ago
"Syntax-Aware Transformer Models for Neural Machine Translation" by Yang et al. (2019). This model enhances the transformer architecture with syntax-aware attention mechanisms that consider dependency parse trees.
Context-Aware Neural Machine Translation Learns Anaphora Resolution" by Bawden et al. (2018). This paper explores integrating context and syntax into neural machine translation models.
michaelnny|1 year ago
ynik|1 year ago
threatofrain|1 year ago
wantsanagent|1 year ago
winddude|1 year ago
CyberDildonics|1 year ago
YeGoblynQueenne|1 year ago
And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.
We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).
So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
golol|1 year ago
Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?
This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.
Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.
mike_hearn|1 year ago
> With positions resolved, we can study the logical extrapolation ability of transformers
They are interested in how well they can make a neural net logically extrapolate outside its training set, once encoding barriers are removed. They show that in fact even quite small language models can do this successfully once we're not confusing them with bad encodings anymore.
This seems like fundamental work. It was only a few years ago that Google employees were arguing LLMs were nothing more than "stochastic parrots". Well, that take will go down in history as one of the worst takes on AI ever. I don't think anyone really had any doubt by 2024 that this wasn't true, but the huge and opaque datasets meant people could always argue that maybe this wasn't an example of logical reasoning or extrapolation, maybe it had just seen this specific question before. But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers. It's not just repeating answers it's seen in its dataset. It should kill off the parrot meme for good.
zarzavat|1 year ago
dagss|1 year ago
If everyone was using horses what would you had said about the first prototype car? Probably a very slow and clumsy and failureprone thing.
toxik|1 year ago
Obviously, the "best" way to do addition on a computer is by doing it exactly.
Chinjut|1 year ago
IanCal|1 year ago
One is that research into what the limits of the architecture are is useful. Maths has a nice property of being very easy to verify and you can construct logical processes with it. It's a useful testbed.
Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.
famouswaffles|1 year ago
Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.
>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?
And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?
michaelt|1 year ago
There are many situations where it is useful for the LLM to get basic arithmetic right.
For example, if someone asks your LLM to explain this line of code [1] which takes a 28x28 px input image, is the right explanation that 28×28÷4×64=9216 ? Or is that the wrong explanation?
And being able to get 100-digit arithmetic right 99% of the time might make use feel reassured that the 4-digit arithmetic we need from the model will be right an even higher % of the time.
[1] https://github.com/pytorch/examples/blob/37a1866d0e0118875d5...
Xcelerate|1 year ago
Seriously? They say it right in the introduction. The goal is to learn how to infer algorithmic processes directly from data. Much like how MNIST was used in the early days of NNs, you have to start with small toy problems that are representative of the problem domain. Once you have success with that, you can scale up problem complexity.
General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
I would even appreciate seeing more papers on approaches that didn’t work very well so it saves other researchers from going in the wrong direction. That alone would be enough justification for publishing an article.
r2_pilot|1 year ago
r2_pilot|1 year ago
gmerc|1 year ago
traverseda|1 year ago
mike_hearn|1 year ago