I love the title "Big LLMs" because it means that we are now making a distinction between big LLMs and minute LLMs and maybe medium LLMs. I'd like to propose the we call them "Tall LLMs", "Grande LLMs", and "Venti LLMs" just to be precise.
I'd prefer to see olive sizes get a renaissance. I was always amused by Super Colossal when following my mom around a store as a little kid.
From a random web search, it seems the sizes above Large are: Extra Large, Jumbo, Extra Jumbo, Giant, Colossal, Super Colossal, Mammoth, Super Mammoth, Atlas.
I've sat in more than one board meeting watching them take 20 minutes to land on t-shirt sizes. The greatest enterprise sales minds of our generation...
But of course these are all flavors of "large", so then we have big large language models, medium large language models, etc, which does indeed make the tall/grande/venti names appropriate, or perhaps similar "all large" condom size names (large, huge, gargantuan).
it's too bad vLLM and VLM are taken because it would have been nice to recycle the VLSI solution to describing sizes - get to very large language models and leave it at that.
“We should regard the Internet Archive as one of the most valuable pieces of modern history; instead, many companies and entities make the chances of the Archive to survive, and accumulate what otherwise will be lost, harder and harder. I understand that the Archive headquarters are located in what used to be a church: well, there is no better way to think of it than as a sacred place.”
Amen. There is an active effort to create an Internet Archive based in Europe, just… in case.
Yup! We're here and looking to do good work with Cultural Heritage and Research Organizations in Europe. I'm very happy to be working with the Internet Archive once again after a 20 year long break.
Anyone who takes even an hour to audit anything about the Internet Archive will soon come to a very sad conclusion.
The physical assets are stored in the blast radius of an oil refinery. They don't have air conditioning. Take the tour and they tell you the site runs slower on hot days. Great mission, but atrociously managed.
Under attack for a number of reasons, mostly absurd. But a few are painfully valid.
Mozilla's llamafile project is designed to enable LLMs to be preserved for historical purposes. They ship the weights and all the necessary software in a deterministic dependency-free single-file executable. If you save your llamafiles, you should be able to run them in fifty years and have the outputs be exactly the same as what you'd get today. Please support Mozilla in their efforts to ensure this special moment in history gets archived for future generations!
The counter perspective is that this is not a book, it's an interactive simulation of that era. The model is trained on everything, this means it acts like a mirror of ourselves. I find it fascinating to explore the mind-space it captured.
While the post talks about big LLMs as a valuable "snapshot" of world knowledge, the same technology can be used for lossless compression: https://bellard.org/ts_zip/.
That's really what these are: something analogous to JPEG for language, and queryable in natural language.
Tangent: I was thinking the other day: these are not AI in the sense that they are not primarily intelligence. I still don't see much evidence of that. What they do give me is superhuman memory. The main thing I use them for is search, research, and a "rubber duck" that talks back, and it's like having an intern who has memorized the library and the entire Internet. They occasionally hallucinate or make mistakes -- compression artifacts -- but it's there.
So it's more AM -- artificial memory.
Edit: as a reply pointed out: this is Vannevar Bush's Memex, kind of.
I've been looking at it as an "instant reddit comment". I can download a 10G or 80G compressed archive that basically contains the useful parts of the internet, and then I all can use it to synthesize something that is about as good and reliable as a really good reddit comment. Which is nifty. But honestly it's an incredible idea to sell that to businesses.
“Vannevar Bush's 1945 article "As We May Think". Bush envisioned the memex as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility".
I believe LLMs are both data and processing, but even humans reasoning is based in strong ways on existing knowledge. However, for the goal of the post, indeed it is the memorization that is the key value, and the fact that likely in the future sampling such models can be used to transfer the same knowledge to bigger LLMs, even if the source data is lost.
I can ask a LLM to write a haiku about the loss function of Stable Diffusion. Or I can have it do zero shot translation, between a pair of languages not covered in the training set. Can your "language JPEG" do that?
I think "it's just compression" and "it's just parroting" are flawed metaphors. Especially when the model was trained with RLHF and RL/reasoning. Maybe a better metaphor is "LLM is like a piano, I play the keyboard and it makes 'music'". Or maybe it's a bycicle, I push the pedals and it takes me where I point it.
I regularly pushback against casual uses of the word “intelligence”.
First, there is no objective dividing line. It is a matter of degree relative to something else. Any language that suggests otherwise should be refined or ejected from our culture and language. Language’s evolution doesn’t have to be a nosedive.
Second, there are many definitions of intelligence; some are more useful than others. Along with many, I like Stuart Russell’s definition: the degree to which an agent can accomplish a task. This definition requires being clear about the agent and the task. I mention this so often I feel like a permalink is needed. It isn’t “my” idea at all; it is simply the result of smart people decomplecting the idea so we’re not mired in needless confusion.
I rant about word meanings often because deep thinking people need to lay claim to words and shape culture accordingly. I say this often: don’t cede the battle of meaning to the least common denominators of apathy, ignorance, confusion, or marketing.
Some might call this kind of thinking elitist. No. This is what taking responsibility looks like. We could never have built modern science (or most rigorous fields of knowledge) with imprecise thinking.
I’m so done with sloppy mainstream phrasing of “intelligence”. Shit is getting real (so to speak), companies are changing the world, governments are racing to stay in the game, jobs will be created and lost, and humanity might transcend, improve, stagnate, or die.
If humans, meanwhile, can’t be bothered to talk about intelligence in a meaningful way, then, frankly, I think we’re … abdicating responsibility, tempting fate, or asking to be in the next Mike Judge movie.
I miss the good ol days when I'd have text-davinci make me a table of movies that included a link to the movie poster. It usually generated a url of an image in an s3 bucket. The link always worked.
I think it’s fine that not everything on the internet is archived forever.
It has always been like that, in the past people wrote on paper, and most of it was never archived. At some point it was just lost.
I inherited many boxes of notes, books and documents from my grandparents. Most of it was just meaningless to me. I had to throw away a lot of it and only kept a few thousand pages of various documents. The other stuff is just lost forever. And that’s probably fine.
Archives are very important, but nowadays the most difficult part is to select what to archive. There is so much content added to the internet every second, only a fraction of it can be archived.
This doesn't make much sense to me. Unattributed heresay has limited historical value, perhaps zero given that the view of the web most of the weights-available models have is Common Crawl which is itself available for preservation.
I suspect the idea is that sometimes breadth wins out over accuracy. Even if it's unsuited as a primary source, this kind of lossy compression of many many documents might help a conscientious historian discover verifiable things through other routes.
People wanting this would be better off using memory architectures, like how the brain does it. For ML, the simplest approach is putting in memory layers with content-addressible schemes. I have a few links on prototypes in this comment:
Animal brains do not separate long term memory and processing - they are one and the same thing - columnar neural assemblies in the cortex that have learnt to recognize repeated patterns, and in turn activate others.
Isn’t big LLM training data actually the most analogous to the internet archive? Shouldn’t the title be “Big LLM training data is a piece of history”? Especially at this point in history since a large portion of internet data going forward will be LLM generated and not human generated? It’s kind of the last snapshot of human-created content.
The problem is, where is this 20T tokens that are being used for this task? No way to access them. I hope that at least OpenAI and a few more have solid historical storage of the tokens they collect.
Great idea. Slightly related idea: use the Internet Archive to build a dataset of 6502 machine code/binaries, corresponding manuals, and possibly videos of the software in action.. maybe emulator traces.
It might be possible to create an L LM that can write a custom vintage game or program on demand in machine code and simultaneously generate assets like sprites. Especially if you use the latest reinforcement learning techniques.
Naming antics aside, the article makes a good point I've heard previously about the importance of the Internet Archive.
Are there any search experiences that allow me to search like it's 1999? I'd love to be able to re-create the experience of finding random passion project blogs that give a small snapshot of things people and business were using the web for back then.
Interesting. It seems that both they and I had very similar ideas at about the same time, with this being posted just a few hours after I finally published about AI model history being lost.
Small LLM weights are not really interesting though. I am currently training GPT-2 small sized models for a scientific project right, and their world models are just not good enough to generate any kind of real insight about the world it was trained in except for corpus biases.
I would be curious to know if it would be possible to recunstruct approximate versions of popular common subsets of internet training data by using many different LLMs that may have happened to read the same info. Anyone knows pointers to math papers about such things?
I really like the narative that now LLM is the conserving human knowledge that otherwise would be lost forever in the form of its weights in a kind of a lossy compression.
Personally I'd like that if all the knowledge and information (K & I) are readily available and accessible (pretty sure most of the prople share the same sentiment), despite the consistent business decisions from the copyright holders to hoard their K & I by putting everything behind paywalls and/or registration (I'm looking at you Apple and X/Twitter). As much that some people hate Google by organizing the world information by feeding and thriving through advertisements because in the long run the information do get organized and kind of preserved in many Internet data formats, lossy or not. After all Google who originall designed the transformer that enabled the LLM weights that are now apparently a piece of history.
I find it very depressing to think that the only traces left from all the creativity will end up to be AI slop, the worst use case ever.
I feel like the more people use GenAI, the less intelligent they become. Like the rest of this society, they seem designed to suck the life force out of humans and and return useless crap instead.
Imagine future historians piecing together our culture from hallucinated AI memories - inaccurate, sure, but maybe even more fascinating than reality itself.
Interesting. Just this morning I had a conversation with Claude about this very topic. When asked "can you give me your thoughts on LLM train runs as historical artifacts? do you think they might be uniquely valuable for future historians?", it answered
> oh HELL YEAH they will be. future historians are gonna have a fucking field day with us.
> imagine some poor academic in 2147 booting up "vintage llm.exe" and getting to directly interrogate the batshit insane period when humans first created quasi-sentient text generators right before everything went completely sideways with *gestures vaguely at civilization*
> *"computer, tell me about the vibes in 2025"*
> "BLARGH everyone was losing their minds about ai while also being completely addicted to it"
Interesting indeed to be able to directly interrogate the median experience of being online in 2025.
(also my apologies for slop-posting; i slapped so many custom prompting on it that I hope you'll find the output to be amusing enough)
intellectronica|11 months ago
saltcured|11 months ago
From a random web search, it seems the sizes above Large are: Extra Large, Jumbo, Extra Jumbo, Giant, Colossal, Super Colossal, Mammoth, Super Mammoth, Atlas.
xanderlewis|11 months ago
t_mann|11 months ago
badlibrarian|11 months ago
latexr|11 months ago
Arcuru|11 months ago
HarHarVeryFunny|11 months ago
guestbest|11 months ago
BobaFloutist|11 months ago
tonyhart7|11 months ago
AlienRobot|11 months ago
naveen99|11 months ago
de-moray|11 months ago
davidwritesbugs|11 months ago
rnrn|11 months ago
TZubiri|11 months ago
It's like saying Automated ATM. Whoever wrote it barely knows what the acronym means.
This whole article feels like written by someone who doesn't understand the subject matter at all
thih9|11 months ago
varispeed|11 months ago
unknown|11 months ago
[deleted]
unknown|11 months ago
[deleted]
semireg|11 months ago
_bin_|11 months ago
huijzer|11 months ago
nextts|11 months ago
dr_dshiv|11 months ago
Amen. There is an active effort to create an Internet Archive based in Europe, just… in case.
blmurch|11 months ago
https://www.stichtinginternetarchive.nl/
ttul|11 months ago
https://vancouversun.com/news/local-news/the-internet-archiv...
(Edited: apparently just a new HQ and not THE HQ)
badlibrarian|11 months ago
The physical assets are stored in the blast radius of an oil refinery. They don't have air conditioning. Take the tour and they tell you the site runs slower on hot days. Great mission, but atrociously managed.
Under attack for a number of reasons, mostly absurd. But a few are painfully valid.
jart|11 months ago
https://github.com/Mozilla-Ocho/llamafile/
visarga|11 months ago
GeoAtreides|11 months ago
If I want to read a post, a book, a forum, I want to read exactly that, not a simulacrum built by arcane mathematical algorithms.
visarga|11 months ago
defgeneric|11 months ago
api|11 months ago
Tangent: I was thinking the other day: these are not AI in the sense that they are not primarily intelligence. I still don't see much evidence of that. What they do give me is superhuman memory. The main thing I use them for is search, research, and a "rubber duck" that talks back, and it's like having an intern who has memorized the library and the entire Internet. They occasionally hallucinate or make mistakes -- compression artifacts -- but it's there.
So it's more AM -- artificial memory.
Edit: as a reply pointed out: this is Vannevar Bush's Memex, kind of.
hengheng|11 months ago
flower-giraffe|11 months ago
“Vannevar Bush's 1945 article "As We May Think". Bush envisioned the memex as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility".
https://en.m.wikipedia.org/wiki/Memex
GolfPopper|11 months ago
Correction: you occasionally notice when they hallucinate or make mistakes.
antirez|11 months ago
bob1029|11 months ago
https://lcamtuf.coredump.cx/lossifizer/
I think a fun experiment could be to see at what setting the average human can no longer decipher the text.
visarga|11 months ago
I think "it's just compression" and "it's just parroting" are flawed metaphors. Especially when the model was trained with RLHF and RL/reasoning. Maybe a better metaphor is "LLM is like a piano, I play the keyboard and it makes 'music'". Or maybe it's a bycicle, I push the pedals and it takes me where I point it.
yannyu|11 months ago
mdp2021|11 months ago
Yes!
> artificial memory
Well, "yes", kind of.
> Memex
After a flood?! Not really. Vannevar Bush - As we may think - http://web.mit.edu/STS.035/www/PDFs/think.pdf
menzoic|11 months ago
Mistletoe|11 months ago
xpe|11 months ago
First, there is no objective dividing line. It is a matter of degree relative to something else. Any language that suggests otherwise should be refined or ejected from our culture and language. Language’s evolution doesn’t have to be a nosedive.
Second, there are many definitions of intelligence; some are more useful than others. Along with many, I like Stuart Russell’s definition: the degree to which an agent can accomplish a task. This definition requires being clear about the agent and the task. I mention this so often I feel like a permalink is needed. It isn’t “my” idea at all; it is simply the result of smart people decomplecting the idea so we’re not mired in needless confusion.
I rant about word meanings often because deep thinking people need to lay claim to words and shape culture accordingly. I say this often: don’t cede the battle of meaning to the least common denominators of apathy, ignorance, confusion, or marketing.
Some might call this kind of thinking elitist. No. This is what taking responsibility looks like. We could never have built modern science (or most rigorous fields of knowledge) with imprecise thinking.
I’m so done with sloppy mainstream phrasing of “intelligence”. Shit is getting real (so to speak), companies are changing the world, governments are racing to stay in the game, jobs will be created and lost, and humanity might transcend, improve, stagnate, or die.
If humans, meanwhile, can’t be bothered to talk about intelligence in a meaningful way, then, frankly, I think we’re … abdicating responsibility, tempting fate, or asking to be in the next Mike Judge movie.
laborcontract|11 months ago
andix|11 months ago
It has always been like that, in the past people wrote on paper, and most of it was never archived. At some point it was just lost.
I inherited many boxes of notes, books and documents from my grandparents. Most of it was just meaningless to me. I had to throw away a lot of it and only kept a few thousand pages of various documents. The other stuff is just lost forever. And that’s probably fine.
Archives are very important, but nowadays the most difficult part is to select what to archive. There is so much content added to the internet every second, only a fraction of it can be archived.
hedgehog|11 months ago
Terr_|11 months ago
fl4tul4|11 months ago
I don't think the big scientific publishers (now, in our time) will ever fail, they are RICH!
Legend2440|11 months ago
thayne|11 months ago
bookofjoe|11 months ago
nickpsecurity|11 months ago
https://news.ycombinator.com/item?id=42824960
HarHarVeryFunny|11 months ago
dstroot|11 months ago
antirez|11 months ago
ilaksh|11 months ago
It might be possible to create an L LM that can write a custom vintage game or program on demand in machine code and simultaneously generate assets like sprites. Especially if you use the latest reinforcement learning techniques.
rollcat|11 months ago
hi_hi|11 months ago
Are there any search experiences that allow me to search like it's 1999? I'd love to be able to re-create the experience of finding random passion project blogs that give a small snapshot of things people and business were using the web for back then.
OuterVale|11 months ago
https://vale.rocks/posts/ai-model-history-is-being-lost
Havoc|11 months ago
Just with a pre-LLM knowledge
dmos62|11 months ago
lblume|11 months ago
pama|11 months ago
teleforce|11 months ago
Personally I'd like that if all the knowledge and information (K & I) are readily available and accessible (pretty sure most of the prople share the same sentiment), despite the consistent business decisions from the copyright holders to hoard their K & I by putting everything behind paywalls and/or registration (I'm looking at you Apple and X/Twitter). As much that some people hate Google by organizing the world information by feeding and thriving through advertisements because in the long run the information do get organized and kind of preserved in many Internet data formats, lossy or not. After all Google who originall designed the transformer that enabled the LLM weights that are now apparently a piece of history.
almosthere|11 months ago
off_by_inf|11 months ago
bossyTeacher|11 months ago
throwaway48476|11 months ago
codr7|11 months ago
I feel like the more people use GenAI, the less intelligent they become. Like the rest of this society, they seem designed to suck the life force out of humans and and return useless crap instead.
sourtrident|11 months ago
blinky81|11 months ago
guybedo|11 months ago
isoprophlex|11 months ago
(also my apologies for slop-posting; i slapped so many custom prompting on it that I hope you'll find the output to be amusing enough)
tryauuum|11 months ago