> the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”
You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.
All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).
It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).
I created a code review pipeline at work with a similar tradeoff and we found the cost is worth it. Time is a non-issue.
We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).
So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.
"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.
> You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.
I made a cursed CPU in the game 'Turing Complete'; and had an older version of claude build me an assembler for it?
Good luck finding THAT in the training data. :-P
(just to be sure, I then had it write actual programs in that new assembly language)
>You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data.
This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.
A while ago i did the nanogpt tutorial, i went through some math with pen and paper and noticed the loss function for 'predict the next token' and 'predict the next 2 tokens' (or n tokens) is identical.
That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.
LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.
Yes, most people (including myself) do not understand how modern LLMs work (especially if we consider the most recent architectural and training improvements).
There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.
The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.
I highly recommend Build a large language model from scratch [1] by Sebastian Raschka. It provides a clear explanation of the building blocks used in the first versions of ChatGPT (GPT 2 if I recall correctly). The output of the model is a huge vector of n elements, where n is the number of tokens in the vocabulary. We use that huge vector as a probability distribution to sample the next token given an input sequence (i.e., a prompt). Under the hood, the model has several building blocks like tokenization, skip connections, self attention, masking, etc. The author makes a great job explaining all the concepts. It is very useful to understand how LLMs works.
The whole next word thing is interesting isn't it. I like to see it with Dennett's "Competence and comprehension" lens. You can predict the next word competently with shallow understanding. But you could also do it well with understanding or comprehension of the full picture. A mental model that allows you to predict better. Are the AIs stumbling into these mental models? Seems like it. However, because these are such black boxes, we do not know how they are stringing these mental models together. Is it a random pick from 10 models built up inside the weights? Is there any system-wide cohesive understanding, whatever that means?
Exploring what a model can articualate using self-reflection would be interesting. Can it point to internal cognitive dissonance because it has been fed both evolution and intelligent design, for example? Or these exist as separate models to invoke depending on the prompt context, because all that matters is being rewarded by the current user?
Given their failure on novel logic problems, generation of meaningless text, tendency to do things like delete tests and incompetence at simple mathematics, it seems very unlikely they have built any sort of world model. It’s remarkable how competent they are given the way they work.
Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.
‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.
The Turing test had too much confidence in humans it seems.
It always occurred to me that LLMs may be like the language center of the brain. And there should be a "whole damn rest of the brain" behind it to steer it.
LLMs miss very important concepts, like the concept of a fact. There is no "true", just consensus text on the internet given a certain context. Like that study recently where LLMs gave wrong info if there was the biography of a poor person in the context.
> You can predict the next word competently with shallow understanding.
I don't get this. When you say "predict the next word" what you mean is "predict the word that someone who understands would write next". This cannot be done without an understanding that is as complete as that of the human whose behaviour you are trying to predict. Otherwise you'd have the paradox that understanding doesn't influence behaviour.
Dennett also came to my mind, reading the title, but in a different sense. When people came up with theory of evolution, it was hard to conceive for many people, how do we get from "subtly selecting from random changes" to "build a complex mechanism such as human". I think Dennett offers a nice analogy with a skyscraper, how it can be built if cranes are only so tall?
In a similar way, LLMs build small abstractions, first on words, how to subtly rearrange them without changing meaning, then they start to understand logic patterns such as "If A follows B, and we're given A, then B", and eventually they learn to reason in various ways.
It's the scale of the whole process that defies human understanding.
(Also modern LLMs are not just next word predictors anymore, there is reinforcement learning component as well.)
> Are the AIs stumbling into these mental models? Seems like it.
Since nature decided to deprive me of telepathic abilities, when I want to externalize my thoughts to share with others, I'm bound to this joke of a substitute we call language. I must either produce sounds that encode my meaning, or gesture, or write symbols, or basically find some way to convey my inner world by using bodily senses as peripherals. Those who receive my output must do the work in reverse to extract my meaning, the understanding in my message. Language is what we call a medium that carries our meaning to one another's psyche.
LLMs, as their name alludes, are trained on language, the medium, and they're LARGE. They're not trained on the meaning, like a child would be, for instance. Saying that by their sole analysis of the structure and patterns in the medium they're somehow capable of stumbling upon the encoded meaning is like saying that it's possible to become an engineer, by simply mindlessly memorizing many perfectly relevant scripted lines whose meaning you haven't the foggiest.
Yes, on the surface the illusion may be complete, but can the medium somehow become interchangeable with the meaning it carries? Nothing indicates this. Everything an LLM does still very much falls within the parameters of "analyze humongous quantity of texts for patterns with massive amount of resources, then based on all that precious training, when I feed you some text, output something as if you know what you're talking about".
I think the seeming crossover we perceive is just us becoming neglectful in our reflection of the scale and significance of the required resources to get them to fool us.
Searle's Chinese Room experiment but without knowing what's in the room, and when you try to peek in you just see a cloud of fog and are left to wonder if it's just a guy with that really big dictionary or something more intelligent.
It's honestly disheartening and a bit shocking how everyone has started repeating the predict the next syllable criticism.
The language model predicts the next syllable by FIRST arriving in a point in space that represents UNDERSTANDING of the input language. This was true all the way back in 2017 at the time of Attention Is All You Need. Google had a beautiful explainer page of how transformers worked, which I am struggling to find. Found it. https://research.google/blog/transformer-a-novel-neural-netw...
The example was and is simple and perfect. The word bank exists. You can tell what bank means by its proximity to words, such as river or vault. You compare bank to every word in a sentence to decide which bank it is. Rinse, repeat. A lot. You then add all the meanings together. Language models are making a frequency association of every word to every other word, and then summing it to create understanding of complex ideas, even if it doesn't understand what it is understanding and has never seen it before.
That all happens BEFORE "autocompleting the next syllable."
The magic part of LLMs is understanding the input. Being able to use that to make an educated guess of what comes next is really a lucky side effect. The fact that you can chain that together indefinitely with some random number generator thrown in and keep saying new things is pretty nifty, but a bit of a show stealer.
What really amazes me about transformers is that they completely ignored prescriptive linguistic trees and grammar rules and let the process decode the semantic structure fluidly and on the fly. (I know google uses encode/decode backwards from what I am saying here.) This lets people create crazy run on sentences that break every rule of english (or your favorite language) but instructions that are still parsable.
It is really helpful to remember that transformers origins are language translation. They are designed to take text and apply a modification to it, while keeping the meaning static. They accomplish this by first decoding meaning. The fact that they then pivoted from translation to autocomplete is a useful thing to remember when talking to them. A task a language model excels at is taking text, reducing it to meaning, and applying a template. So a good test might be "take Frankenstein, and turn it into a magic school bus episode." Frankenstein is reduced to meaning, the Magic School Bus format is the template, the meaning is output in the form of the template. This is a translation, although from English to English, represented as two completely different forms. Saying "find all the Wild Rice recipes you can, normalize their ingredients to 2 cups of broth, and create a table with ingredient ranges (min-max) for each ingredient option" is closer to a translation than it is to "autocomplete." Input -> Meaning -> Template -> Output. With my last example the template itself is also generated from its own meaning calculation.
A lot has changed since 2017, but the interpreter being the real technical achievement still holds true imho. I am more impressed with AI's ability to parse what I am saying than I am by it's output (image models not withstanding.)
Predict the next token' is true but not explanatory. It's like saying humans 'fire neurons.' Technically correct, explains nothing useful about the behavior you're actually observing. The debate isn't whether the description is accurate - it's whether it's at the right level of abstraction.
It is probably the first-time aha moment the author is talking about. But under the hood, it is probably not as magical as it appears to be.
Suppose you prompted the underlying LLM with "You are an expert reviewer in..." and a bunch of instructions followed by the paper. LLM knows from the training that 'expert reviewer' is an important term (skipping over and oversimplifying here) and my response should be framed as what I know an expert reviewer would write. LLMs are good at picking up (or copying) the patterns of response, but the underlying layer that evaluates things against a structural and logical understanding is missing. So, in corner cases, you get responses that are framed impressively but do not contain any meaningful inputs. This trait makes LLMs great at demos but weak at consistently finding novel interesting things.
If the above is true, the author will find after several reviews that the agent they use keeps picking up on the same/similar things (collapsed behavior that makes it good at coding type tasks) and is blind to some other obvious things it should have picked up on. This is not a criticism, many humans are often just as collapsed in their 'reasoning'.
LLMs are good at 8 out of 10 tasks, but you don't know which 8.
I think this is a thing not often discussed here, but I too have this experience. An LLM can be fantastic if you write a 25-pager then later need to incorporate a lot of comments with sometimes conflicting arguments/viewpoints.
LLMs can be really good at "get all arguments against this", "Incorporated this view point in this text while making it more concise.", "Are these views actually contradicting or can I write it such that they align. Consider incentives".
If you know what you're doing and understand the matter deeply (and that is very important) you'll find that the LLM is sometimes better at wording what you actually mean, especially when not writing in your native language. Of course, you study the generated text, make small changes, make it yours, make sure you feel comfortable with it etc. But man can it get you over that "how am I going to write this down"-hump.
Also: "Make an executive summary" "Make more concise", are great. Often you need to de-linkedIn the text, or tell it to "not sound like an American waiter", and "be business-casual", "adopt style of rest of doc", etc. But it works wonders.
The "predict the next word" to a current llm is at the same level as a "transistor" (or gate) is to a modern cpu. I don't understand llms enough to expand on that comparison, but I can see how having layers above that feed the layers below to "predict the next word" and use the output to modify the input leading to what we see today. It is turtles all the way down.
It’s a good comparison. It’s about abstraction and layers. Modern LLMs aren’t just models, they’re all the infrastructure around promoting and context management and mixtures of experts.
The next-word bit may be slightly higher than an individual transistor, possibly functional units.
There is a big difference, because I understand how those transistors produce a picture on a screen, I don’t understand how LLMs do what they do. The difference is so big that the comparison is useless.
It's clear that in the general case "predict the next word" requires arbitrarily good understanding of everything that can be described with language. That shouldn't be mysterious. What's mysterious is how a simple training procedure with that objective can in practice achieve that understanding. But then again, does it? The base model you get after that simple training procedure is not capable of doing the things described in the article. It is only useful as a starting point for a much more complex reinforcement learning procedure that teaches the skills an agent needs to achieve goals.
RL is where the magic comes from, and RL is more than just "predict the next word". It has agents and environments and actions and rewards.
Superhuman chess engines are now trained just from one bit reward signal: win / lose. This says absolutely nothing about the complexity that the model develops inside. They even learned the rule of the games just from that reward.
IMO, the writer is overzealous with their comments on LLMs. As a coder, it feels like an outsider trying out a product that was amazed me over and over so many times.
> They aren’t perfect, but the kind of analysis the program is able to do is past the point where technology looks like magic.
But as you use this product over a long period of time, there are many obvious gaps - hallucinations / repeated tool calls / out of context outputs / etc.
To me, refine.ink sounds like a company that has built heavy tooling around some super high context window LLMs and then some very good prompts. Their claim is to compare it against any good off-the-shelf LLM with any prompt. But when you are spending bunch of money to build a whole ecosystem around LLMs, it's obvious that it's not going to beat their output.
I won't be surprised if the next version of an LLM within the next few months completely outperforms their output -- that's usually the case with all the coding tools and scaffoldings. They are rendered useless by a superior LLM.
It’s called emergent behavior. We understand how an llm works, but do not have even a theory about how the behavior emerges from among the math. We understand ants pretty well, but how exactly does anthill behavior come from ant behavior? It’s a tricky problem in system engineering where predicting emergent behavior (such as emergencies) would be lovely.
> but do not have even a theory about how the behavior emerges from among the math
Actually we have an awful lot of those.
I'm not sure if emergent is quite the right term here. We carefully craft a scenario to produce a usable gradient for a black box optimizer. We fully expect nontrivial predictions of future state to result in increasingly rich world models out of necessity.
It gets back to the age old observation about any sufficiently accurate model being of equal complexity as the system it models. "Predict the next word" is but a single example of the general principle at play.
The good news is that despite being incredibly complex, it’s still a lot simpler than ants because it is at least all statistical linguistics (as far as LLMs are concerned anyways).
> but do not have even a theory about how the behavior emerges
We fully do. There is a significant quality difference between English language output and other languages which lends a huge hint as to what is actually happening behind the scenes.
> but how exactly does anthill behavior come from ant behavior?
You can't smell what ants can. If you did I'm sure it would be evident.
> Nothing you write will matter if it is not quickly adopted to the training dataset.
That is my take too, I was surprised to see how many people object to their works being trained on. It's how you can leave your mark, opening access for AI, and in the last 25 years opening to people (no restrictions on access, being indexed in Google).
"On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs"
You're words will be like a drop in the ocean, an ocean where the water volume keeps increasing every year. Also if nobody reads anything anymore what's the point?
Most people value their time and work and don't want to give it away for free to some billionaire so they can reproduce it as slop for their own private profit.
That's to say, most people recognize when they're getting fucked over and are correct to object to it.
People who produced the works LLMs are trained on are not compensated for the value they are now producing, and their skills are increasingly less valued in a world with LLMs. The value the LLMs are producing is being captured by employees of AI companies who are driving up rent in the Bay Area, and driving up the cost of electricity and water everywhere else.
Your surprise to people’s objections makes sense if you can’t count.
The article talks about LLMs reviewing Econ papers.
I’m hesitant to call this an outright win, though.
Perhaps the review service the author is using is really good.
Almost certainly the taste, expertise and experience of the author is doing unseen heavy lifting.
I found that using prompts to do submission reviews for conferences tended to make my output worse, not better.
Letting the LLM analyze submissions resulted in me disconnecting from the content. To the point I would forget submissions after I closed the tab.
I ended up going back to doing things manually, using them as a sanity check.
On the flip side, weaker submissions using generative tools became a nightmare, because you had to wade through paragraphs of fluff to realize there was no substantive point.
It’s to the point that I dread reviewing.
I am going to guess that this is relatively useful for experts, who will submit stronger submissions, than novices and journeymen, who will still make foundational errors.
> On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs. Nothing you write will matter if it is not quickly adopted to the training dataset. As the art of pushing your results to the top of the google search was the 1990s game, getting your ideas into the LLMs is today’s. Refine is no different. It’s so good, everyone will use it. So whether refine and its cousins take a FTPL or new Keynesian view in evaluating papers is now all determining for where the consensus of the profession goes.
I expected a bit more cynicism and less merrily going with the downward spiral flow from a “grumpy” blogger.
I think it’s funny that at Google I invented and productized next word (and next action) predictor in Gmail and hangouts chat and I’ve never had a single person come to me and ask how this all works.
To me LLMs are incredibly simple. Next word next sentence next paragraph and next answer are stacked attention layers which identify manifolds and run in reverse to then keep the attention head on track for next token. It’s pretty straight forward math and you can sit down and make a tiny LLM pretty easily on your home computer with a good sized bag of words and context
To me it’s baffling everyone goes around saying constantly that not even Nobel prize winners know how this works it’s a huge mystery.
Has anyone thought to ask the actual people like me and others who invented this?
A lot of people in tech thrive on the mystery and don't like explaining things in simple terms. It makes what they do seem more valuable if no one can understand what they're talking about. At the same time, being vague and mysterious can help hide someone's own misunderstandings. When you speak clearly you need to be accurate, because it's more obvious when you're wrong.
It’s interesting to read about the use and leverage of LLMs outside of programming.
I’m not too familiar with the history, but the import of this article is brushing up on my nose hairs in a way that makes me think a sort of neo-Sophistry is on the horizon.
It is really interesting how great and also how terrible LLMs can be at the same time. For example, I had a really annoying bug yesterday, I missed one character, "_". Asking ChatGPT for help led to a lot of feedback that was arguably okay but not currently relevant (because there was a fatal flaw in the code).
> On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs. Nothing you write will matter if it is not quickly adopted to the training dataset. As the art of pushing your results to the top of the google search was the 1990s game, getting your ideas into the LLMs is today’s. Refine is no different. It’s so good, everyone will use it. So whether refine and its cousins take a FTPL or new Keynesian view in evaluating papers is now all determining for where the consensus of the profession goes.
The scaling will continue until morale improves. I advise people to skate to where the puck will be, and to ask themselves: "if I knew for a fact that LLMs could do something I am doing in 1-2 years, would I still want to do it? If not, what should I be doing now instead?"
Do you think the submitter intended this as an ad? His post history doesn't seem suspicious.
Or do you think article's author wrote this an an ad? He's a reputable academic who seems impressed with an AI tool he used and is honestly sharing his thoughts.
I know this sounds insane but I've been dwelling on it. Language models are digital Ouija boards. I like the metaphor because it offers multiple conflicting interpretations. How does a Ouija board work? The words appear. Where do they come from? It can be explained in physical terms. Or in metaphysical terms. Collective summing of psychomotor activity. Conduits to a non-corporeal facet of existence. Many caution against the Ouija board as a path to self-inflicted madness, others caution against the Ouija board as a vehicle to bring poorly understood inhuman forces into the world.
There's 2 completely different ways to understand how a Ouija board works. Occult, and Scientific.
Scientific: It's a combined response from everyone's collective unconscious blend of everyone participating. In other words, its a probabilistic result of an "answer" to the question everyone hears.
Occult: If an entity is present, it's basically the unshielded response of that entity by collectively moving everyone's body the same way, as a form of a mild channel. Since Ouija doesn't specific to make a circle and request presence of a specific entity, there's a good chance of some being hostile. Or, you all get nothing at all, and basically garbage as part of the divination/communication.
But comparing Ouija to LLMs? The LLM, with the same weights, with the same hyperparameters, and same questions will give the same answers. That is deterministic, at least in that narrow sense. An Ouija board is not deterministic, and cannot be tested in any meaningful scientific sense.
I have come to think “predict the next token” is not a useful way to explain how LLMs work to people unfamiliar with LLM training and internals. It’s technically correct, but at this point saying that and not talking about things like RLVR training and mechanistic interpretability is about as useful as framing talking with a person as “engaging with a human brain generating tokens” and ignoring psychology.
At least AI-haters don’t seem to be talking about “stochastic parrots” quite so much now. Maybe they finally got the memo.
I think talking to people unfamiliar with LLM training using words like "RLVR training and mechanistic interpretability" is about as useful as a grave robber in a crematorium.
Must one be an "AI-hater" to use the term "stochastic parrot"? Which is probably in response to all the emergent AGI claims and pointless discussions about LLMs being conscious.
Sampling over a probability distribution is not as catchy as "stochastic parrot" but I have personally stopped telling believers that their imagined event horizon of transistor scale is not going to deliver them to their wished for automated utopia b/c one can not reason w/ people who did not reach their conclusions by reasoning.
Technical concepts can be broken down into ideas anyone can understand if they're interested. Token prediction is at the core of what these tools do, and is a good starting point for more complex topics.
On the other hand, calling these tools "intelligent", capable of "reasoning" and "thought", is not only more confusing and can never be simplified, but dishonest and borderline gaslighting.
>I don’t know how you get here from “predict the next word.”
The question puts horse behind the buggy. The main point isn't "from", it is how you get to “predict the next word.” During the training the LLM builds inside itself compressed aggregated representation - a model - of what is fed into it. Giving the model you can "predict the next word" as well as you can do a lot of other things.
For simple starting point for understanding i'd suggest to look back at the key foundational stone that started it all - "sentiment neuron"
Economics is the attempt to take sociology and add numbers to make it look like a hard science. The fintechbros then seem to think because they can make numbers go up that this proof it's a hard science.
We've banned this account. You can't post vile comments like this, no matter who or what it's about, and we obviously have to ban accounts that do this if we're to have any standards at all. You've been warned before about your style of commenting, so it's not that you don't know the expectations here.
I use LLMs for different research-related tasks and I surely can relate. In the past few months, the latest models have become better than me at many tasks. And I am not an ad.
wavemode|5 days ago
You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.
All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).
It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).
jjmarr|5 days ago
We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).
So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.
"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.
Kim_Bruning|5 days ago
I made a cursed CPU in the game 'Turing Complete'; and had an older version of claude build me an assembler for it?
Good luck finding THAT in the training data. :-P
(just to be sure, I then had it write actual programs in that new assembly language)
selridge|5 days ago
This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.
sasjaws|5 days ago
That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.
Hope this is usefull to someone.
317070|5 days ago
LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.
apexalpha|4 days ago
croon|4 days ago
But this might be misleadingly interpreted as an LLM having "thought out an answer" before generating tokens, which is an incorrect conclusion.
Not suggesting you did.
sputknick|5 days ago
pushedx|5 days ago
There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.
The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.
auraham|5 days ago
[1] https://www.manning.com/books/build-a-large-language-model-f...
measurablefunc|5 days ago
ChaitanyaSai|5 days ago
grey-area|5 days ago
Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.
‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.
The Turing test had too much confidence in humans it seems.
mzhaase|5 days ago
LLMs miss very important concepts, like the concept of a fact. There is no "true", just consensus text on the internet given a certain context. Like that study recently where LLMs gave wrong info if there was the biography of a poor person in the context.
throw310822|5 days ago
I don't get this. When you say "predict the next word" what you mean is "predict the word that someone who understands would write next". This cannot be done without an understanding that is as complete as that of the human whose behaviour you are trying to predict. Otherwise you'd have the paradox that understanding doesn't influence behaviour.
js8|4 days ago
In a similar way, LLMs build small abstractions, first on words, how to subtly rearrange them without changing meaning, then they start to understand logic patterns such as "If A follows B, and we're given A, then B", and eventually they learn to reason in various ways.
It's the scale of the whole process that defies human understanding.
(Also modern LLMs are not just next word predictors anymore, there is reinforcement learning component as well.)
mekoka|4 days ago
Since nature decided to deprive me of telepathic abilities, when I want to externalize my thoughts to share with others, I'm bound to this joke of a substitute we call language. I must either produce sounds that encode my meaning, or gesture, or write symbols, or basically find some way to convey my inner world by using bodily senses as peripherals. Those who receive my output must do the work in reverse to extract my meaning, the understanding in my message. Language is what we call a medium that carries our meaning to one another's psyche.
LLMs, as their name alludes, are trained on language, the medium, and they're LARGE. They're not trained on the meaning, like a child would be, for instance. Saying that by their sole analysis of the structure and patterns in the medium they're somehow capable of stumbling upon the encoded meaning is like saying that it's possible to become an engineer, by simply mindlessly memorizing many perfectly relevant scripted lines whose meaning you haven't the foggiest.
Yes, on the surface the illusion may be complete, but can the medium somehow become interchangeable with the meaning it carries? Nothing indicates this. Everything an LLM does still very much falls within the parameters of "analyze humongous quantity of texts for patterns with massive amount of resources, then based on all that precious training, when I feed you some text, output something as if you know what you're talking about".
I think the seeming crossover we perceive is just us becoming neglectful in our reflection of the scale and significance of the required resources to get them to fool us.
halyconWays|5 days ago
basch|5 days ago
The language model predicts the next syllable by FIRST arriving in a point in space that represents UNDERSTANDING of the input language. This was true all the way back in 2017 at the time of Attention Is All You Need. Google had a beautiful explainer page of how transformers worked, which I am struggling to find. Found it. https://research.google/blog/transformer-a-novel-neural-netw...
The example was and is simple and perfect. The word bank exists. You can tell what bank means by its proximity to words, such as river or vault. You compare bank to every word in a sentence to decide which bank it is. Rinse, repeat. A lot. You then add all the meanings together. Language models are making a frequency association of every word to every other word, and then summing it to create understanding of complex ideas, even if it doesn't understand what it is understanding and has never seen it before.
That all happens BEFORE "autocompleting the next syllable."
The magic part of LLMs is understanding the input. Being able to use that to make an educated guess of what comes next is really a lucky side effect. The fact that you can chain that together indefinitely with some random number generator thrown in and keep saying new things is pretty nifty, but a bit of a show stealer.
What really amazes me about transformers is that they completely ignored prescriptive linguistic trees and grammar rules and let the process decode the semantic structure fluidly and on the fly. (I know google uses encode/decode backwards from what I am saying here.) This lets people create crazy run on sentences that break every rule of english (or your favorite language) but instructions that are still parsable.
It is really helpful to remember that transformers origins are language translation. They are designed to take text and apply a modification to it, while keeping the meaning static. They accomplish this by first decoding meaning. The fact that they then pivoted from translation to autocomplete is a useful thing to remember when talking to them. A task a language model excels at is taking text, reducing it to meaning, and applying a template. So a good test might be "take Frankenstein, and turn it into a magic school bus episode." Frankenstein is reduced to meaning, the Magic School Bus format is the template, the meaning is output in the form of the template. This is a translation, although from English to English, represented as two completely different forms. Saying "find all the Wild Rice recipes you can, normalize their ingredients to 2 cups of broth, and create a table with ingredient ranges (min-max) for each ingredient option" is closer to a translation than it is to "autocomplete." Input -> Meaning -> Template -> Output. With my last example the template itself is also generated from its own meaning calculation.
A lot has changed since 2017, but the interpreter being the real technical achievement still holds true imho. I am more impressed with AI's ability to parse what I am saying than I am by it's output (image models not withstanding.)
ruhith|5 days ago
GodelNumbering|5 days ago
Suppose you prompted the underlying LLM with "You are an expert reviewer in..." and a bunch of instructions followed by the paper. LLM knows from the training that 'expert reviewer' is an important term (skipping over and oversimplifying here) and my response should be framed as what I know an expert reviewer would write. LLMs are good at picking up (or copying) the patterns of response, but the underlying layer that evaluates things against a structural and logical understanding is missing. So, in corner cases, you get responses that are framed impressively but do not contain any meaningful inputs. This trait makes LLMs great at demos but weak at consistently finding novel interesting things.
If the above is true, the author will find after several reviews that the agent they use keeps picking up on the same/similar things (collapsed behavior that makes it good at coding type tasks) and is blind to some other obvious things it should have picked up on. This is not a criticism, many humans are often just as collapsed in their 'reasoning'.
LLMs are good at 8 out of 10 tasks, but you don't know which 8.
Kim_Bruning|5 days ago
teekert|5 days ago
LLMs can be really good at "get all arguments against this", "Incorporated this view point in this text while making it more concise.", "Are these views actually contradicting or can I write it such that they align. Consider incentives".
If you know what you're doing and understand the matter deeply (and that is very important) you'll find that the LLM is sometimes better at wording what you actually mean, especially when not writing in your native language. Of course, you study the generated text, make small changes, make it yours, make sure you feel comfortable with it etc. But man can it get you over that "how am I going to write this down"-hump.
Also: "Make an executive summary" "Make more concise", are great. Often you need to de-linkedIn the text, or tell it to "not sound like an American waiter", and "be business-casual", "adopt style of rest of doc", etc. But it works wonders.
callmeal|5 days ago
brookst|5 days ago
The next-word bit may be slightly higher than an individual transistor, possibly functional units.
ejolto|5 days ago
echelon|5 days ago
Now the machines are getting better than we are. It's exciting and a little bit terrifying.
We were polymers that evolved intelligence. Now the sand is becoming smart.
modeless|5 days ago
RL is where the magic comes from, and RL is more than just "predict the next word". It has agents and environments and actions and rewards.
singularity2001|4 days ago
vb7132|4 days ago
belZaah|5 days ago
fc417fc802|5 days ago
Actually we have an awful lot of those.
I'm not sure if emergent is quite the right term here. We carefully craft a scenario to produce a usable gradient for a black box optimizer. We fully expect nontrivial predictions of future state to result in increasingly rich world models out of necessity.
It gets back to the age old observation about any sufficiently accurate model being of equal complexity as the system it models. "Predict the next word" is but a single example of the general principle at play.
netfortius|5 days ago
[1] https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F
devmor|5 days ago
themafia|5 days ago
We fully do. There is a significant quality difference between English language output and other languages which lends a huge hint as to what is actually happening behind the scenes.
> but how exactly does anthill behavior come from ant behavior?
You can't smell what ants can. If you did I'm sure it would be evident.
visarga|5 days ago
That is my take too, I was surprised to see how many people object to their works being trained on. It's how you can leave your mark, opening access for AI, and in the last 25 years opening to people (no restrictions on access, being indexed in Google).
Morromist|5 days ago
You're words will be like a drop in the ocean, an ocean where the water volume keeps increasing every year. Also if nobody reads anything anymore what's the point?
heavyset_go|5 days ago
That's to say, most people recognize when they're getting fucked over and are correct to object to it.
mbgerring|5 days ago
Your surprise to people’s objections makes sense if you can’t count.
intended|5 days ago
I’m hesitant to call this an outright win, though.
Perhaps the review service the author is using is really good.
Almost certainly the taste, expertise and experience of the author is doing unseen heavy lifting.
I found that using prompts to do submission reviews for conferences tended to make my output worse, not better.
Letting the LLM analyze submissions resulted in me disconnecting from the content. To the point I would forget submissions after I closed the tab.
I ended up going back to doing things manually, using them as a sanity check.
On the flip side, weaker submissions using generative tools became a nightmare, because you had to wade through paragraphs of fluff to realize there was no substantive point.
It’s to the point that I dread reviewing.
I am going to guess that this is relatively useful for experts, who will submit stronger submissions, than novices and journeymen, who will still make foundational errors.
keybored|4 days ago
I expected a bit more cynicism and less merrily going with the downward spiral flow from a “grumpy” blogger.
Oh, but it’s a grumpy economist.
tsunamifury|5 days ago
To me LLMs are incredibly simple. Next word next sentence next paragraph and next answer are stacked attention layers which identify manifolds and run in reverse to then keep the attention head on track for next token. It’s pretty straight forward math and you can sit down and make a tiny LLM pretty easily on your home computer with a good sized bag of words and context
To me it’s baffling everyone goes around saying constantly that not even Nobel prize winners know how this works it’s a huge mystery.
Has anyone thought to ask the actual people like me and others who invented this?
kosh2|4 days ago
When people talk about understanding, they mean as knowing how the underlying mechanism works often by finding an analog in real life.
booleandilemma|5 days ago
tolerance|5 days ago
I’m not too familiar with the history, but the import of this article is brushing up on my nose hairs in a way that makes me think a sort of neo-Sophistry is on the horizon.
gammalost|5 days ago
Remade the conversation with personal information stripped here https://chatgpt.com/share/699fef77-b530-8007-a4ed-c3dda9461d...
asymmetric|4 days ago
gwern|4 days ago
> On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs. Nothing you write will matter if it is not quickly adopted to the training dataset. As the art of pushing your results to the top of the google search was the 1990s game, getting your ideas into the LLMs is today’s. Refine is no different. It’s so good, everyone will use it. So whether refine and its cousins take a FTPL or new Keynesian view in evaluating papers is now all determining for where the consensus of the profession goes.
For more recent comments, see https://dwarkesh.com/p/gwern-branwen https://gwern.net/llm-writing https://www.lesswrong.com/posts/34J5qzxjyWr3Tu47L/is-buildin... https://gwern.net/blog/2025/ai-cannibalism https://gwern.net/blog/2025/good-ai-samples https://gwern.net/style-guide
The scaling will continue until morale improves. I advise people to skate to where the puck will be, and to ask themselves: "if I knew for a fact that LLMs could do something I am doing in 1-2 years, would I still want to do it? If not, what should I be doing now instead?"
TYPE_FASTER|4 days ago
mnewme|5 days ago
pianom4n|5 days ago
Or do you think article's author wrote this an an ad? He's a reputable academic who seems impressed with an AI tool he used and is honestly sharing his thoughts.
For reference he published the 80 page inflation mini-book 2 weeks ago asking for feedback: https://www.grumpy-economist.com/p/inflation
pharrington|5 days ago
retrac|5 days ago
nekusar|5 days ago
Scientific: It's a combined response from everyone's collective unconscious blend of everyone participating. In other words, its a probabilistic result of an "answer" to the question everyone hears.
Occult: If an entity is present, it's basically the unshielded response of that entity by collectively moving everyone's body the same way, as a form of a mild channel. Since Ouija doesn't specific to make a circle and request presence of a specific entity, there's a good chance of some being hostile. Or, you all get nothing at all, and basically garbage as part of the divination/communication.
But comparing Ouija to LLMs? The LLM, with the same weights, with the same hyperparameters, and same questions will give the same answers. That is deterministic, at least in that narrow sense. An Ouija board is not deterministic, and cannot be tested in any meaningful scientific sense.
brookst|5 days ago
libraryofbabel|5 days ago
At least AI-haters don’t seem to be talking about “stochastic parrots” quite so much now. Maybe they finally got the memo.
qsera|5 days ago
That is the exact thing to say because that is exactly what it does, despite how it does so.
It is not useful to say it if you are an AI-shill though. You bought up AI-hater, so I think I am entitled to bring up AI-shills.
dylan604|5 days ago
goatlover|5 days ago
measurablefunc|5 days ago
stephenr|5 days ago
I prefer to use the term "spicy autocomplete" myself.
imiric|5 days ago
On the other hand, calling these tools "intelligent", capable of "reasoning" and "thought", is not only more confusing and can never be simplified, but dishonest and borderline gaslighting.
Alex_L_Wood|5 days ago
unknown|5 days ago
[deleted]
trhway|4 days ago
The question puts horse behind the buggy. The main point isn't "from", it is how you get to “predict the next word.” During the training the LLM builds inside itself compressed aggregated representation - a model - of what is fed into it. Giving the model you can "predict the next word" as well as you can do a lot of other things.
For simple starting point for understanding i'd suggest to look back at the key foundational stone that started it all - "sentiment neuron"
https://openai.com/index/unsupervised-sentiment-neuron/
"simply predicting the next character in Amazon reviews resulted in discovering the concept of sentiment.
...
Digging in, we realized there actually existed a single “sentiment neuron” that’s highly predictive of the sentiment value."
joquarky|4 days ago
mrorigo|5 days ago
unknown|4 days ago
[deleted]
WD-42|5 days ago
sp4cemoneky|5 days ago
cyanydeez|5 days ago
throawayonthe|5 days ago
[deleted]
leptons|5 days ago
[deleted]
margalabargala|5 days ago
Personally I believe them, considering the content of the article.
bdhcuidbebe|5 days ago
[deleted]
tomhow|4 days ago
Alex_L_Wood|5 days ago
[deleted]
rossant|5 days ago
I use LLMs for different research-related tasks and I surely can relate. In the past few months, the latest models have become better than me at many tasks. And I am not an ad.
themafia|5 days ago
Sort of the lowest hanging fruit imaginable. Just because it became "fundamental" to the process doesn't mean it gained any quality.