top | item 46417815

Show HN: Z80-μLM, a 'Conversational AI' That Fits in 40KB

514 points| quesomaster9000 | 2 months ago |github.com

How small can a language model be while still doing something useful? I wanted to find out, and had some spare time over the holidays.

Z80-μLM is a character-level language model with 2-bit quantized weights ({-2,-1,0,+1}) that runs on a Z80 with 64KB RAM. The entire thing: inference, weights, chat UI, it all fits in a 40KB .COM file that you can run in a CP/M emulator and hopefully even real hardware!

It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.

--

The extreme constraints nerd-sniped me and forced interesting trade-offs: trigram hashing (typo-tolerant, loses word order), 16-bit integer math, and some careful massaging of the training data meant I could keep the examples 'interesting'.

The key was quantization-aware training that accurately models the inference code limitations. The training loop runs both float and integer-quantized forward passes in parallel, scoring the model on how well its knowledge survives quantization. The weights are progressively pushed toward the 2-bit grid using straight-through estimators, with overflow penalties matching the Z80's 16-bit accumulator limits. By the end of training, the model has already adapted to its constraints, so no post-hoc quantization collapse.

Eventually I ended up spending a few dollars on Claude API to generate 20 questions data (see examples/guess/GUESS.COM), I hope Anthropic won't send me a C&D for distilling their model against the ToS ;P

But anyway, happy code-golf season everybody :)

122 comments

order

nineteen999|2 months ago

This couldn't be more perfectly timed .. I have an Unreal Engine game with both VT100 terminals (for running coding agents) and Z80 emulators, and a serial bridge that allows coding agents to program the CP/M machines:

https://i.imgur.com/6TRe1NE.png

Thank you for posting! It's unbelievable how someone sometimes just drops something that fits right into what you're doing. However bizarre it seems.

quesomaster9000|2 months ago

Oh dear, it seems we've... somehow been psychically linked...

I developed a browser-based CP/M emulator & IDE: https://lockboot.github.io/desktop/

I was going to post that instead, but wanted a 'cool demo' instead, and fell down the rabbit hole.

sixtyj|2 months ago

Connections: Alternative History of Technology by James Burke documents these "coincidences".

simonjgreen|2 months ago

Super intrigued but annoyingly I can’t view imgur here

rahen|2 months ago

I love it, instant Github star. I wrote an MLP in Fortran IV for a punched card machine from the sixties (https://github.com/dbrll/Xortran), so this really speaks to me.

The interaction is surprisingly good despite the lack of attention mechanism and the limitation of the "context" to trigrams from the last sentence.

This could have worked on 60s-era hardware and would have completely changed the world (and science fiction) back then. Great job.

noosphr|2 months ago

Stuff like this is fascinating. Truly the road not taken.

Tin foil hat on: i think that a huge part of the major buyout of ram from AI companies is to keep people from realising that we are essentially at the home computer revolution stage of llms. I have a 1tb ram machine which with custom agents outperforms all the proprietary models. It's private, secure and won't let me be motetized.

Dwedit|2 months ago

In before AI companies buy up all the Z80s and raise the prices to new heights.

nubinetwork|2 months ago

Too late, they stopped being available last year.

giancarlostoro|2 months ago

This is something I've been wondering about myself. What's the "Minimally Viable LLM" that can have simple conversations. Then my next question is, how much can we push it so it can learn from looking up data externally, can we build a tiny model with an insanely larger context window? I have to assume I'm not the only one who has asked or thought of these things.

Ultimately, if you can build an ultra tiny model that can talk and learn on the fly, you've just fully localized a personal assistant like Siri.

fho|2 months ago

You might be interested in RWKV: https://www.rwkv.com/

Not exactly "minimal viable", but a "what if RNNs where good for LLMs" case study.

-> insanely fast on CPUs

qingcharles|2 months ago

I think what's amazing to speculate is how we could have had some very basic LLMs in at least the 90s if we'd invented the tech previously. I wonder what the world would be like now if we had?

Dylan16807|2 months ago

For your first question, the LLM someone built in Minecraft can handle simple conversations with 5 million weights, mostly 8 bits.

I doubt it would be able to make good use of a large context window, though.

andrepd|2 months ago

We should show this every time a Slack/Teams/Jira engineer tries to explain to us why a text chat needs 1.5GB of ram to start up.

dangus|2 months ago

> It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.

You can buy a kid’s tiger electronics style toy that plays 20 questions.

It’s not like this LLM is bastion of glorious efficiency, it’s just stripped down to fit on the hardware.

Slack/Teams handles company-wide video calls and can render anything a web browser can, and they run an entire App Store of apps, all from a cross-platform application.

Including Jira in the conversation doesn’t even make logical sense. It’s not a desktop application that consumes memory. Jira has such a wide scope that the word “Jira” doesn’t even describe a single product.

vedmakk|2 months ago

If one would train an actual secret (e.g. a passphrase) into such a model, that a user would need to guess by asking the right questions. Could this secret be easily reverse engineered / inferred by having access to models weights - or would it be safe to assume that one could only get to the secret by asking the right questions?

Kiboneu|2 months ago

I don’t know, but your question reminds me of this paper which seems to address it on a lower level: https://arxiv.org/abs/2204.06974

“Planting Undetectable Backdoors in Machine Learning Models”

“ … On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees. …”

ronsor|2 months ago

> this secret be easily reverse engineered / inferred by having access to models weights

It could with a network this small. More generally this falls under "interpretability."

roygbiv2|2 months ago

Awesome. I've just designed and built my own z80 computer, though right now it has 32kb ROM and 32kb RAM. This will definitely change on the next revision so I'll be sure to try it out.

wewewedxfgdf|2 months ago

RAM is very expensive right now.

gcanyon|2 months ago

So it seems like with the right code (and maybe a ton of future infrastructure for training?) Eliza could have been much more capable back in the day.

antonvs|2 months ago

The original ELIZA ran on an IBM 7094 mainframe, in the 1960s. That machine had 32K x 36-bit words, and no support for byte operations. It did support 6-bit BCD characters, packed 6 per word, but those were for string operations, and didn't support arithmetic or logical operations.

This means that a directly translated 40 KB Z80 executable might be a tight squeeze on that mainframe, because 40K > 32K, counting words, not bytes. Of course if most of that size is just 2-bit weight data then it might not be so bad.

ELIZA running on later hardware would have been a different story, with the Z80 - released in 1976 - being an example.

gwern|2 months ago

So if it's not using attention and it processes the entire input into an embedding to process in one go, I guess this is neither a Transformer nor a RNN but just a MLP?

orbital-decay|2 months ago

Pretty cool! I wish free-input RPGs of old had fuzzy matchers. They worked by exact keyword matching and it was awkward. I think the last game of that kind (where you could input arbitrary text when talking to NPCs) was probably Wizardry 8 (2001).

Peteragain|2 months ago

There are two things happening here. A really small LLM mechanism which is useful for thinking about how the big ones work, and a reference to the well known phenomenon, commonly dismissively referred to as a "trick", in which humans want to believe. We work hard to account for what our conversational partner says. Language in use is a collective cultural construct. By this view the real question is how and why we humans understand an utterance in a particular way. Eliza, Parry, and the Chomsky bot at http://chomskybot.com work on this principle. Just sayin'.

Zee2|2 months ago

This is super cool. Would love to see a Z80 simulator set up with these examples to play with!

dmd|2 months ago

https://3e.org/private/z80ulmweb/

It's just one-shot AI slop - literally, the prompt was 'make a web based version of [github url of this project]' and it spat this out. It appears to work fine.

I'll keep it up for a couple of months and then it'll be auto-deleted, no sense in keeping it around longer than that.

bartread|2 months ago

This is excellent. Thing I’d like to do if I had time: get it running on a 48K Spectrum. 10 year old me would have found that absolutely magical back in the 1980s.

tomduncalf|2 months ago

This was my first thought too haha. That would be mind blowing

jrdres|1 month ago

It runs, but it would be very slow on actual hardware.

I tried on a cycle-accurate emulator of a TRS-80 Model I with Omikron CP/M mapper. Most Z-80 machines of the time were 4MHz, but the TRS-80 was only 1.77 MHz.

1. Type "GUESS", get question prompt.

2. User types: "Are you an animal?", ENTER key

3. Wait 25 seconds

4. Program prints "N"

5. Wait 20 seconds

6. Program prints "O"

7. Wait 23 seconds

8. Program prints linefeed, returns to question prompt

Total time to return 2-char answer to user's question: 1 min 9 sec or so. I bet a longer answer would take proportionally longer.

"The wonder isn't that it does it well, it's a wonder it does it at all."

gp2000|1 month ago

Though it'll still be kinda slow on a Model I, I've written an about 9 times faster Z-80 code for the network evaluation. I imagine the pull request will end up in the main depot but for now you can find it in https://github.com/gp48k/z80ai

I think I can do a little bit better; maybe 10% faster.

MagicMoonlight|2 months ago

What I really want is a game where each of the NPCs has a tiny model like this, so you can actually talk to them.

GuB-42|2 months ago

I thought about this, chatbots existed well before LLMs (Eliza: 1966!) and the only time I have seen a commercially successful game with a (very simple) chatbot was Quake III Arena!

Quake 3 is probably the last game where you would expect a chatbot, as there are few games where storytelling matters less and it is a very little known feature, but Quake 3 bots can react to what you say in the chat, in addition to the usual taunts.

But that's the thing, Quake 3 can do it because it is inconsequential, in a story-driven game like a RPG, NPCs have a well defined spot in the story and gameplay, they tell you exactly what you need to know, as to not disrupt the flow of the story. Tell you too much, and they will spoil the big reveal, tell you too little, and you don't know what to do, tell you irrelevant details and you get lost chasing them. It has to be concise and to the point, so that those who don't really care know what to do to advance the story, but with enough flavor to make the world alive. It is really hard to find the right balance, and if in addition, you have to incorporate a chatbot, it borders on impossible.

It looks like a good idea on the surface, but it most likely isn't, unless it is clearly not part of the main gameplay loop, as in Quake 3.

Some people had some success using a (big) LLM as a DM in D&D, which I think is easier since it can make up the story as it advances, it is much harder to make up game elements in a computer RPG that are not programmed in.

vatary|2 months ago

It's pretty obvious this is just a stress test for compressing and running LLMs. It doesn't have much practical use right now, but it shows us that IoT devices are gonna have built-in LLMs really soon. It's a huge leap in intelligence—kind of like the jump from apes to humans. That is seriously cool.

acosmism|2 months ago

i'll echo that practicality only surfaces once it is apparent what can be done. yea this feels like running "DOOM on pregnancy test devices" type of moment

anonzzzies|2 months ago

Luckily I have a very large amount of MSX computers, zx, amstrad cpc etc and even one multiprocessor z80 cp/m machine for the real power. Wonder how gnarly this is going to perform with bankswitching though. Probably not good.

jacquesm|2 months ago

Between this and RAM prices Zilog stock must be up! Awesome hack. Now apply the same principles to a laptop and take a megabyte or so, see what that does.

boznz|2 months ago

Great work. What is your timeline to AGI ?

fuzzfactor|2 months ago

Can't possibly be further than just around the corner.

a_t48|2 months ago

Nice - that will fit on a Gameboy cartridge, though bank switching might make it super terrible to run. Each bank is only 16k. You can have a bunch of them, but you can only access one bank at a time (well, technically two - bank 0 is IIRC always accessible).

ColonelPhantom|2 months ago

Each layer of the LM is also at most 16 KiB, so if you want to minimize bank switching, I think making sure each layer is in one bank would be enough? Bank switching shouldn't give much overhead anyway unless it complicates an inner loop, which would be avoided if no layers are split across banks.

ant6n|2 months ago

You have 32KB of ROM, plus 8 Kb of ram on original game boy. Game boy color has more. Bank switching is super fast, as well. Given that models are likely streamed, I doubt the bank switching is a problem.

Biggest pain point is likely the text input.

jasonjmcghee|2 months ago

For future projects and/or for this project, there are many LLMs available more than good enough to generate that kind of synthetic data (20 Qs) with permissive terms of use. (So you don’t need to stress about breaking TOS / C&D etc)

Zardoz84|2 months ago

Meanwhile, Eliza was ported to BASIC and was run on many home computers in the 80s.

magicalhippo|2 months ago

As far as I know, the last layer is very quantization-sensitive, and is typically not quantized, or quantized lightly.

Have you experimented with having it less quantized, and evaluated the quality drop?

Regardless, very cool project.

kouteiheika|2 months ago

(Not OP)

It depends on the model, but from my experiments (quantizing one layer of a model to 2-bit and then training the model with that layer in 2-bit to fix the damage) the first layer is the most sensitive, and yes, the last layer is also sensitive too. The middle layers take the best to quantization.

Different components of a layer also have a different sensitivity; e.g. the MLP downscale block damages the model the most when quantized, while quantizing the Q projection in self attention damages the model the least.

coolius|2 months ago

This is impressive, those are some very restrictive requirements. I wonder what we are able to run on more powerful hardware such as ESP32 or RP2040, has anyone tried this?

pdyc|2 months ago

interesting, i am wondering how far can it go if we remove some of these limitations but try to solve some extremely specific problem like generating regex based on user input? i know small models(270M range) can do that but can it be done in say < 10MB range?

Waterluvian|2 months ago

Generate an LLM that is designed to solve one extremely specific problem: answering the ultimate question of life, the universe, and everything.

Even with modern supercomputing the computation would be outpaced by the heat death of the universe, so token output must be limited to a single integer.

dirkt|2 months ago

Eliza's granddaughter.

lostmsu|2 months ago

Did you train the model with quantization awareness? How?

DrNosferatu|2 months ago

Awesome! Anyone for a port to the MSX?

A web version would also be cool.

Y_Y|2 months ago

Very cool. Did you consider using sparse weights?

integricho|2 months ago

Someone add it to collapseos please :)

codetiger|2 months ago

Imagine, this working on a Gameboy, in those days. Would've sounded like magic

Sharlin|2 months ago

I don’t think this could beat an ELIZA-style bot in how magical it feels, given the extreme terseness of its replies.

lodovic|2 months ago

I love these thought experiments. Looking at the code size, it would have been possible for someone to come up with this back in the days, similar to the idea of a million monkeys on a typewriter eventually producing Shakespeare.

alfiedotwtf|2 months ago

And would have lasted 3 minutes.

Speaking of - I remember my first digital camera (Fujitsu 1Mb resolution using SmartMedia)… it used so much power that you could take 20-30 photos and then needed to replace all 4 batteries lol

numpad0|2 months ago

Flip phones had predictive texts since forever. LLMs are just* supercharged predi[ctive text algorithms are computer algorithms that are]

qingcharles|2 months ago

"Look, my Game Boy passes the Turing Test!"

*burns you at the stake*