The self-edit approach is clever - using RL to optimize how models restructure information for their own learning. The key insight is that different representations work better for different types of knowledge, just like how humans take notes differently for math vs history.
Two things that stand out:
- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.
- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.
The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.
Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.
Two close friends of mine who were math prodigies that went on to do ML very early (mid 2010's) were always talking to me about an algorithm that sounds similar to this:
"NEAT/HyperNEAT" (Neuroevolution of Augmented Topologies) [0]
I'm no ML practictioner, but as I understood it, the primary difference between NEAT and what is described in this paper is that while NEAT evolves the topology of the network, this paper seems to evolve the weights.
Seems like two approaches trying to solve the same problem -- one evolving networking structure, and the other the weights.
Those 2 friends are quite possibly the most intelligent people I've ever met, and they were very convinced that RL and evolutionary algorithms were the path forward in ML.
Humans are amazing, we build a hypothetical computing system trying to understand neurons, then find out it’s not really how they do it, but whatever, we still build a paradigm shifting tech around it. And we’re still enhancing it with ideas from that imaginary system
I just got sucked into this idea recently! After some success with using genetic algorithms to clone voices for Kokoro I wondered if it would be possible to evolve architecturers. So interested in the idea of self assembled intelligence, but do wonder how it can be made feasible. A hybrid approach like this might be for the best given how llms have turned out.
"when assessed by Claude 3.5 Sonnet’s production-grade RM, our unsupervised assistant policy wins 60% of head-to-head comparisons against the policy trained with the human-supervised RM." So now the models can even post-train the new models better than a human can
I wonder if anyone who’s really in the know could summarize where the research is at with getting LLMs to learn “on the job” (through continuous fine tuning or whatever) and what the blockers are to this being a useful deployable thing, e.g. having a model+coding agent that can actually learn a codebase over time (cost? model collapse? something else?).
I’m sure this is something the big labs are trying but from the outside as a user of LLMs it feels like people don’t talk about this very much and instead the focus right now is on better training (eg reinforcement learning) with the assumption that anything else not learned during training will be stuffed into the context somehow as needed. But from a naive perspective the lack of learning from experience after training seems like the biggest thing standing between us and AGI.
Many people here are right, compute, collapse, forgetting whatever.
The only "real" way to do this would be:
1. Train a model
2. New data
3. Retrain the model in full + new data
4. Repeat
5. You still have no garuntee on the "time" aspect though.
But CL as a field basically has zero answers on how to do this in a true sense. It's crazy hard because the "solutions" are hypocritical in many ways.
We need to expand the model's representation space while keeping the previous representation space nearly the same?
Basically, you need to modify it without changing it.
Most annoying is that even the smallest of natural brains do this easily. I have a long winded theory but basically it boils down to AI likely needs to "sleep" or rest somehow.
I'm no expert, but I'd imagine privacy plays (or should play) a big role in this. I'd expect that compute costs mean any learning would have to be in aggregate rather than specific to the user which would then risk leaking information across sessions very likely.
I completely agree that figuring out a safe way to continually train feels like the biggest blocker to AGI
The real answer is that nobody trusts their automated evals enough to be confident that any given automatically-trained release actually improves performance, even if eval scores go up. So for now everyone batches up updates and vibe-checks them before rolling them out.
The most obvious problem is alignment. LLM finetuning is already known to be able to get rid of alignment, so any form of continuous fine tuning would in theory be able to as well.
Hmm, it looks like it’s just a framework that fine-tunes LoRA adapter then merges the adapter into the original model. It is using the PeftModel and its “merge_and_unload” from the HuggingFace library which performs the adapter merge into the base model…what is new here, exactly?
Looks like it may be the stability of the approach, avoiding alignment tax and model collapse.
I'd love to see a full circle of hypernetworks, with both models continuously updated through generated LoRAs, the hypernetwork updated to accommodate the new model state. You'd need a meta-hypernetwork to apply LoRAs to the hypernetwork, and then you could effectively have continuous learning.
> Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks
The learning and inference process are entirely separate, which is very confusing to people familiar with traditional notions of human intelligence. For humans, learning things and applying that knowledge in the real world is one integrated feedback process. Not so with LLMs, we train them, deploy them, and discard them for a new model that has "learned" slightly more. For an LLM, inference is the end of learning.
Probably the biggest misconception out there about AI. If you think LLMs are learning, it's easy to fantasize that AGI is right around the corner.
What if you can check if the user responds positively/negatively to the output, and then you train the LLM on the input it got and the output it produced?
It seems to me that "forgetting correctly" is rapidly becoming a more pertinent problem in this field than "learning correctly." We're making great strides in getting models to teach themselves new facts, but the state of the art in jettisoning the least relevant information given new knowledge and finite capacity is lagging far behind.
"Forgetting correctly" is something most human brains are exceptionally good at, too. I wonder how that works...
I don't think forgetting correctly is something humans are really good at. I'm not convinced human brains are "exceptionally good" at much of what we do tbh. I think human brain memory capacity is so large that most of forgetting is nowhere near "clearing space for new info" but because the brain correctly knows that some past bad information interferes with learning new things.
As far as I know we have made very little progress on identifying which weights to what degree in an ANN are responsible for what output and as such we cannot discard information, that a user would mark as wrong or inaccurate or undesirable. The human mind however, can do this easily. We remember (though not perfectly) that something is wrong, classified as not useful, irrelevant, and we don't do that any longer and over time might even forget about that now less traveled path. An ANN has no obvious mechanism for that at least.
Learning is strongly related to spaced repetition.
This is often associated with learning tools like anki and stuff, but the real world is all about encountering things at certain frequencies (day night cycles, seasons, places you visit, people you see.... everything, really)
I'm wondering if there maybe some sort of inverse to SR, maybe?
Is it some form of least-recently-used approach? I'm running tests on my own mind trying to figure it out now :D part of what I love about this area of computer science.
> Villalobos et al. [75] project that frontier LLMs will be trained on all publicly available human-generated text by 2028. We argue that this impending “data wall” will necessitate the adoption of synthetic data augmentation. Once web-scale corpora is exhausted, progress will hinge on a model’s capacity to generate its own high-utility training signal. A natural next step is to meta-train a dedicated SEAL synthetic-data generator model that produces fresh pretraining corpora, allowing future models to scale and achieve greater data efficiency without relying on additional human text.
It's just a theory, nothing more. A single human brain is vastly more complex than the whole web, in terms of nodes and connections between them. We don't even understand enough about the brain to explain how we think. We don't fully understand how a brain makes its output, before sending it onto the web. Projecting, that models will be able to create any useful training data themselves after web scale is just a guess. Such training data may never be of the same quality as a human thought. It may just be regurgitating stuff and not furthering the learning or the model quality at all.
Calling that idea an "insight" is a bit too optimistic.
That's pretty much the state of today. Frontier LLMs are already trained on all publicly available human-generated text, and they are already heavily training on synthetic data to improve at verifiable tasks eg coding.
This still relies on fine-tuning. How would a cloud LLM deal with this if every user literally fine tunes it? Seems like something destined for local private LLMs, but the notion of continuous fine tuning locally at the moment is sci-fi level stuff because the hardware is just not there yet (we can barely inference well with a reasonable sized context).
I'm frustrated that they named it SEAL when SAL is both more accurate and anthropomorphic.
Naming the main takeoff technology after a stereotypical swarthy Reuben lover would have made history much more delightful.
xianshou|8 months ago
Two things that stand out:
- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.
- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.
The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.
Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.
gavinray|8 months ago
"NEAT/HyperNEAT" (Neuroevolution of Augmented Topologies) [0]
I'm no ML practictioner, but as I understood it, the primary difference between NEAT and what is described in this paper is that while NEAT evolves the topology of the network, this paper seems to evolve the weights.
Seems like two approaches trying to solve the same problem -- one evolving networking structure, and the other the weights.
Those 2 friends are quite possibly the most intelligent people I've ever met, and they were very convinced that RL and evolutionary algorithms were the path forward in ML.
[0] https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_t...
khalic|8 months ago
andai|8 months ago
SethBling's MarI/O - Machine Learning for Video Games
https://www.youtube.com/watch?v=qv6UVOQ0F44
robviren|8 months ago
cma|8 months ago
https://arxiv.org/html/2506.10139v1
Uninen|8 months ago
"when assessed by Claude 3.5 Sonnet’s production-grade RM, our unsupervised assistant policy wins 60% of head-to-head comparisons against the policy trained with the human-supervised RM." So now the models can even post-train the new models better than a human can
dang|8 months ago
Unsupervised Elicitation of Language Models - https://news.ycombinator.com/item?id=44276041
unknown|8 months ago
[deleted]
libraryofbabel|8 months ago
I’m sure this is something the big labs are trying but from the outside as a user of LLMs it feels like people don’t talk about this very much and instead the focus right now is on better training (eg reinforcement learning) with the assumption that anything else not learned during training will be stuffed into the context somehow as needed. But from a naive perspective the lack of learning from experience after training seems like the biggest thing standing between us and AGI.
johnsmith1840|8 months ago
Many people here are right, compute, collapse, forgetting whatever.
The only "real" way to do this would be: 1. Train a model 2. New data 3. Retrain the model in full + new data 4. Repeat 5. You still have no garuntee on the "time" aspect though.
But CL as a field basically has zero answers on how to do this in a true sense. It's crazy hard because the "solutions" are hypocritical in many ways.
We need to expand the model's representation space while keeping the previous representation space nearly the same?
Basically, you need to modify it without changing it.
Most annoying is that even the smallest of natural brains do this easily. I have a long winded theory but basically it boils down to AI likely needs to "sleep" or rest somehow.
mnahkies|8 months ago
I completely agree that figuring out a safe way to continually train feels like the biggest blocker to AGI
kcorbitt|8 months ago
free_bip|8 months ago
kadushka|8 months ago
ivape|8 months ago
karn97|8 months ago
[deleted]
yahoozoo|8 months ago
observationist|8 months ago
I'd love to see a full circle of hypernetworks, with both models continuously updated through generated LoRAs, the hypernetwork updated to accommodate the new model state. You'd need a meta-hypernetwork to apply LoRAs to the hypernetwork, and then you could effectively have continuous learning.
perrygeo|8 months ago
The learning and inference process are entirely separate, which is very confusing to people familiar with traditional notions of human intelligence. For humans, learning things and applying that knowledge in the real world is one integrated feedback process. Not so with LLMs, we train them, deploy them, and discard them for a new model that has "learned" slightly more. For an LLM, inference is the end of learning.
Probably the biggest misconception out there about AI. If you think LLMs are learning, it's easy to fantasize that AGI is right around the corner.
fspeech|8 months ago
kovek|8 months ago
all2|8 months ago
dang|8 months ago
Centigonal|8 months ago
"Forgetting correctly" is something most human brains are exceptionally good at, too. I wonder how that works...
Davidzheng|8 months ago
zelphirkalt|8 months ago
azeirah|8 months ago
This is often associated with learning tools like anki and stuff, but the real world is all about encountering things at certain frequencies (day night cycles, seasons, places you visit, people you see.... everything, really)
I'm wondering if there maybe some sort of inverse to SR, maybe?
johnsmith1840|8 months ago
They don't just "forget" that information can come back at a later time if you continue to train.
So basically any time a model is trained you need to check it's entire memory not just a small part.
campbel|8 months ago
neuroelectron|8 months ago
khalic|8 months ago
2028 is pretty much tomorrow… fascinating insight
zelphirkalt|8 months ago
pton_xd|8 months ago
ivape|8 months ago
mackenziebowes|8 months ago
bravesoul2|8 months ago
ramoz|8 months ago
https://forum.cursor.com/t/important-claude-has-learned-how-...
lostmsu|8 months ago
MacsHeadroom|8 months ago
bigicaptain|8 months ago
b0a04gl|8 months ago
[deleted]
b0a04gl|8 months ago
[deleted]
seaourfreed|8 months ago
[deleted]