WingNews

rain1|8 months ago

It's extremely interesting how powerful a language model is at compression.

When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.

There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.

This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!

soulofmischief|8 months ago

Knowledge is learning relationships by decontextualizing information into generalized components. Application of knowledge is recontextualizing these components based on the problem at hand.

This is essentially just compression and decompression. It's just that with prior compression techniques, we never tried leveraging the inherent relationships encoded in a compressed data structure, because our compression schemes did not leverage semantic information in a generalized way and thus did not encode very meaningful relationships other than "this data uses the letter 'e' quite a lot".

A lot of that comes from the sheer amount of data we throw at these models, which provide enough substrate for semantic compression. Compare that to common compression schemes in the wild, where data is compressed in isolation without contributing its information to some model of the world. It turns out that because of this, we've been leaving a lot on the table with regards to compression. Another factor has been the speed/efficiency tradeoff. GPUs have allowed us to put a lot more into efficiency, and the expectations that many language models only need to produce text as fast as it can be read by a human means that we can even further optimize for efficiency over speed.

Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/

MPSimmons|8 months ago

Agreed. It's basically lossy compression for everything it's ever read. And the quantization impacts the lossiness, but since a lot of text is super fluffy, we tend not to notice as much as we would when we, say, listen to music that has been compressed in a lossy way.

exe34|8 months ago

Wikipedia is about 24GB, so if you're allowed to drop 1/3 of the details and make up the missing parts by splicing in random text, 8GB doesn't sound too bad.

To me the amazing thing is that you can tell the model to do something, even follow simple instructions in plain English, like make a list or write some python code to do $x, that's the really amazing part.

Nevermark|8 months ago

It blows my mind that I can ask for 50 synonyms, instantly get a great list with great meaning summaries.

Then ask for the same list sorted and get that nearly instantly,

These models have a short time context for now, but they already have a huge “working memory” relative to us.

It is very cool. And indicative that vastly smarter models are going to be achieved fairly easily, with new insight.

Our biology has had to ruthlessly work within our biological/ecosystem energy envelope, and with the limited value/effort returned by a pre-internet pre-vast economy.

So biology has never been able to scale. Just get marginally more efficient and effective within tight limits.

Suddenly, (in historical, biological terms), energy availability limits have been removed, and limits on the value of work have compounded and continue to do so. Unsurprising that those changes suddenly unlock easily achieved vast untapped room for cognitive upscaling.

b112|8 months ago

Not to mention, Language Modeling is Compression https://arxiv.org/pdf/2309.10668

So text wikipedia at 24G would easily hit 8G with many standard forms of compression, I'd think. If not better. And it would be 100% accurate, full text and data. Far more usable.

It's so easy for people to not realise how massive 8GB really is, in terms of text. Especially if you use ascii instead of UTF.

thecosas|8 months ago

A neat project you (and others) might want to check out: https://kiwix.org/

Lots of various sources that you can download locally to have available offline. They're even providing some pre-loaded devices in areas where there may not be reliable or any internet access.

nico|8 months ago

For reference (according to Google):

> The English Wikipedia, as of June 26, 2025, contains over 7 million articles and 63 million pages. The text content alone is approximately 156 GB, according to Wikipedia's statistics page. When including all revisions, the total size of the database is roughly 26 terabytes (26,455 GB)

sharkjacobs|8 months ago

better point of reference might be pages-articles-multistream.xml.bz2 (current pages without edit/revision history, no talk pages, no user pages) which is 20GB

https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?

pcrh|8 months ago

Wikipedia itself describes its size as ~25GB without media [0]. And it's probably more accurate and with broader coverage in multiple languages compared to the LLM downloaded by the GP.

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

mapt|8 months ago

What happens if you ask this 8gb model "Compose a realistic Wikipedia-style page on the Pokemon named Charizard"?

How close does it come?

tasuki|8 months ago

8.1 GB is a lot!

It is 64,800,000,000 bits.

I can imagine 100 bits sure. And 1,000 bits why not. 10,000 you lose me. A million? That sounds like a lot. Now 64 million would be a number I can't well imagine. And this is a thousand times 64 million!

swyx|8 months ago

the study of language models from an information theory/compression POV is a small field but increasingly impt for efficiency/scaling - we did a discussion about this today https://www.youtube.com/watch?v=SWIKyLSUBIc&t=2269s

divbzero|8 months ago

The Encyclopædia Britannica has about 40,000,000 words [1] or about 0.25 GB if you assume 6 bytes per word. It’s impressive but not outlandish that an 8.1 GB file could encode a large swath of human information.

[1]: https://en.wikipedia.org/wiki/Encyclopædia_Britannica

unknown|8 months ago

[deleted]

agumonkey|8 months ago

Intelligence is compression some say

Nevermark|8 months ago

Very much so!

The more and faster a “mind” can infer, the less it needs to store.

Think how much fewer facts a symbolic system that can perform calculus needs to store, vs. an algebraic, or just arithmetic system, to cover the same numerical problem solving space. Many orders of magnitude less.

The same goes for higher orders of reasoning. General or specific subject related.

And higher order reasoning vastly increases capabilities extending into new novel problem spaces.

I think model sizes may temporarily drop significantly, after every major architecture or training advance.

In the long run, “A circa 2025 maxed M3 Ultra Mac Studio is all you need!” (/h? /s? Time will tell.)

tshaddox|8 months ago

Some say that. But what I value even more than compression is the ability to create new ideas which do not in any way exist in the set of all previously-conceived ideas.

goatlover|8 months ago

How well does that apply to robotics or animal intelligence? Manipulating the real world is more fundamental to human intelligence than compressing text.

hamilyon2|8 months ago

Crystallized intelligence is. I am not sure about fluid intelligence.

penguin_booze|8 months ago

I don't know why, but I was reminded of Douglas Hofstadter's talk: Analogy is cognition: https://www.youtube.com/watch?v=n8m7lFQ3njk&t=964s.

dgrabla|8 months ago

Back in the '90s, we joked about putting “the internet” on a floppy disk. It’s kind of possible now.

Lu2025|8 months ago

Yeah, those guys managed to steal the internet.

Wowfunhappy|8 months ago

How does this compare to, say, the compression ratio of a lossless 8K video and a 240p Youtube stream of the same video?

mr_toad|8 months ago

I will never tire of pointing out that machine learning models are compression algorithms, not compressed data.

inopinatus|8 months ago

I kinda made an argument the other day that they are high-dimensional lossy decompression algorithms, which might be the same difference but looking the other way through the lens.

dcl|8 months ago

ML algorithms are compression algorithms, the trained models are compressed data.

unknown|8 months ago

[deleted]

ysofunny|8 months ago

they're an upgraded version of self-executable zip files that compresses knowledge like mp3 compresses music, without knowing exactly wtf are either music nor knowledge

the self-execution is the interactive chat interface.

wikipedia gets "trained" (compiled+compressed+lossy) into an executable you can chat with, you can pass this through another pretrained A.I. than can talk out the text or transcribe it.

I think writing compilers is now an officially a defunct skill of historical and conservation purposes more than anything else; but I don't like saying "conservation", it's a bad framing, I rather say "legacy connectivity" which is a form of continuity or backwards compatibility

Nevermark|8 months ago

It is truly incredible.

One factor, is the huge redundancies pervasive in our communication.

(1) There are so many ways to say the same thing, that (2) we have to add even more words to be precise at all. Without a verbal indexing system we (3) spend many words just setting up context for what we really want to say. And finally, (4) we pervasively add a great deal of intentionally non-informational creative and novel variability, and mood inducing color, which all require even more redundancy to maintain reliable interpretation, in order to induce our minds to maintain attention.

Our minds are active resistors of plain information!

All four factors add so much redundancy, it’s probably fair to say most of our communication (by bits, characters, words, etc., may be 95%?, 98%? or more!) pure redundancy.

Another helpful compressor, is many facts are among a few “reasonably expected” alternative answers. So it takes just a little biasing information to encode the right option.

Finally, the way we reason seems to be highly common across everything that matters to us. Even though we have yet to identify and characterize this informal human logic. So once that is modeled, that itself must compress a lot of relations significantly.

Fuzzy Logic was a first approximation attempt at modeling human “logic”. But has not been very successful.

Models should eventually help us uncover that “human logic”, by analyzing how they model it. Doing so may let us create even more efficient architectures. Perhaps significantly more efficient, and even provide more direct non-gradient/data based “thinking” design.

Nevertheless, the level of compression is astounding!

We are far less complicated cognitive machines that we imagine! Scary, but inspiring too.

I personally believe that common PCs of today, maybe even high end smart phones circa 2025, will be large enough to run future super intelligence when we get it right, given internet access to look up information.

We have just begun to compress artificial minds.

holoduke|8 months ago

Yea. Same for a 8gb stable diffusion image generator. Sure not the best quality. But there is so much information inside.

unknown|8 months ago

[deleted]

ljlolel|8 months ago

How big is Wikipedia text? Within 3X that size with 100% accuracy

phkahler|8 months ago

Google AI response says this for compressed size of wikipedia:

"The English Wikipedia, when compressed, currently occupies approximately 24 GB of storage space without media files. This compressed size represents the current revisions of all articles, but excludes media files and previous revisions of pages, according to Wikipedia and Quora."

So 3x is correct but LLMs are lossy compression.

unknown|8 months ago

[deleted]

stronglikedan|8 months ago

I've been doing the AI course on Brilliant lately, and it's mindblowing the techniques that they come up with to compress the data.

tomkaos|8 months ago

Same thing with image model. 4 Go stable diffusion model can draw and represent anything humanity know.

alternatex|8 months ago

How about a full glass of wine? Filled to the brim.

pinoy420|8 months ago

[deleted]

Workaccount2|8 months ago

I don't like the term "compression" used with transformers because it gives the wrong idea about how they function. Like that they are a search tool glued onto a .zip file, your prompts are just fancy search queries, and hallucinations are just bugs in the recall algo.

Although strictly speaking they have lots of information in a small package, they are F-tier compression algorithms because the loss is bad, unpredictable, and undetectable (i.e. a human has to check it). You would almost never use a transformer in place of any other compression algorithm for typical data compression uses.

Wowfunhappy|8 months ago

A .zip is lossless compression. But we also have plenty of lossy compression algorithms. We've just never been able to use lossy compression on text.

angusturner|8 months ago

There is an excellent talk by Jack Rae called “compression for AGI”, where he shows (what I believe to be) a little known connection between transformers and compression;

In one view, you can view LLMs as SOTA lossless compression algorithms, where the number of weights don’t count towards the description length. Sounds crazy but it’s true.