Show HN: Z80-μLM, a 'Conversational AI' That Fits in 40KB
514 points| quesomaster9000 | 2 months ago |github.com
Z80-μLM is a character-level language model with 2-bit quantized weights ({-2,-1,0,+1}) that runs on a Z80 with 64KB RAM. The entire thing: inference, weights, chat UI, it all fits in a 40KB .COM file that you can run in a CP/M emulator and hopefully even real hardware!
It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.
--
The extreme constraints nerd-sniped me and forced interesting trade-offs: trigram hashing (typo-tolerant, loses word order), 16-bit integer math, and some careful massaging of the training data meant I could keep the examples 'interesting'.
The key was quantization-aware training that accurately models the inference code limitations. The training loop runs both float and integer-quantized forward passes in parallel, scoring the model on how well its knowledge survives quantization. The weights are progressively pushed toward the 2-bit grid using straight-through estimators, with overflow penalties matching the Z80's 16-bit accumulator limits. By the end of training, the model has already adapted to its constraints, so no post-hoc quantization collapse.
Eventually I ended up spending a few dollars on Claude API to generate 20 questions data (see examples/guess/GUESS.COM), I hope Anthropic won't send me a C&D for distilling their model against the ToS ;P
But anyway, happy code-golf season everybody :)
nineteen999|2 months ago
https://i.imgur.com/6TRe1NE.png
Thank you for posting! It's unbelievable how someone sometimes just drops something that fits right into what you're doing. However bizarre it seems.
quesomaster9000|2 months ago
I developed a browser-based CP/M emulator & IDE: https://lockboot.github.io/desktop/
I was going to post that instead, but wanted a 'cool demo' instead, and fell down the rabbit hole.
sixtyj|2 months ago
unknown|2 months ago
[deleted]
simonjgreen|2 months ago
rahen|2 months ago
The interaction is surprisingly good despite the lack of attention mechanism and the limitation of the "context" to trigrams from the last sentence.
This could have worked on 60s-era hardware and would have completely changed the world (and science fiction) back then. Great job.
noosphr|2 months ago
Tin foil hat on: i think that a huge part of the major buyout of ram from AI companies is to keep people from realising that we are essentially at the home computer revolution stage of llms. I have a 1tb ram machine which with custom agents outperforms all the proprietary models. It's private, secure and won't let me be motetized.
Dwedit|2 months ago
nubinetwork|2 months ago
giancarlostoro|2 months ago
Ultimately, if you can build an ultra tiny model that can talk and learn on the fly, you've just fully localized a personal assistant like Siri.
andy12_|2 months ago
[1] https://x.com/karpathy/status/1938626382248149433
fho|2 months ago
Not exactly "minimal viable", but a "what if RNNs where good for LLMs" case study.
-> insanely fast on CPUs
qingcharles|2 months ago
Dylan16807|2 months ago
I doubt it would be able to make good use of a large context window, though.
andrepd|2 months ago
dangus|2 months ago
You can buy a kid’s tiger electronics style toy that plays 20 questions.
It’s not like this LLM is bastion of glorious efficiency, it’s just stripped down to fit on the hardware.
Slack/Teams handles company-wide video calls and can render anything a web browser can, and they run an entire App Store of apps, all from a cross-platform application.
Including Jira in the conversation doesn’t even make logical sense. It’s not a desktop application that consumes memory. Jira has such a wide scope that the word “Jira” doesn’t even describe a single product.
vedmakk|2 months ago
Kiboneu|2 months ago
“Planting Undetectable Backdoors in Machine Learning Models”
“ … On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees. …”
ronsor|2 months ago
It could with a network this small. More generally this falls under "interpretability."
bitwize|2 months ago
(edit: change url)
roygbiv2|2 months ago
wewewedxfgdf|2 months ago
gcanyon|2 months ago
antonvs|2 months ago
This means that a directly translated 40 KB Z80 executable might be a tight squeeze on that mainframe, because 40K > 32K, counting words, not bytes. Of course if most of that size is just 2-bit weight data then it might not be so bad.
ELIZA running on later hardware would have been a different story, with the Z80 - released in 1976 - being an example.
gwern|2 months ago
orbital-decay|2 months ago
Peteragain|2 months ago
nrhrjrjrjtntbt|2 months ago
Zee2|2 months ago
Imustaskforhelp|2 months ago
dmd|2 months ago
It's just one-shot AI slop - literally, the prompt was 'make a web based version of [github url of this project]' and it spat this out. It appears to work fine.
I'll keep it up for a couple of months and then it'll be auto-deleted, no sense in keeping it around longer than that.
bartread|2 months ago
tomduncalf|2 months ago
jrdres|1 month ago
I tried on a cycle-accurate emulator of a TRS-80 Model I with Omikron CP/M mapper. Most Z-80 machines of the time were 4MHz, but the TRS-80 was only 1.77 MHz.
1. Type "GUESS", get question prompt.
2. User types: "Are you an animal?", ENTER key
3. Wait 25 seconds
4. Program prints "N"
5. Wait 20 seconds
6. Program prints "O"
7. Wait 23 seconds
8. Program prints linefeed, returns to question prompt
Total time to return 2-char answer to user's question: 1 min 9 sec or so. I bet a longer answer would take proportionally longer.
"The wonder isn't that it does it well, it's a wonder it does it at all."
gp2000|1 month ago
I think I can do a little bit better; maybe 10% faster.
MagicMoonlight|2 months ago
GuB-42|2 months ago
Quake 3 is probably the last game where you would expect a chatbot, as there are few games where storytelling matters less and it is a very little known feature, but Quake 3 bots can react to what you say in the chat, in addition to the usual taunts.
But that's the thing, Quake 3 can do it because it is inconsequential, in a story-driven game like a RPG, NPCs have a well defined spot in the story and gameplay, they tell you exactly what you need to know, as to not disrupt the flow of the story. Tell you too much, and they will spoil the big reveal, tell you too little, and you don't know what to do, tell you irrelevant details and you get lost chasing them. It has to be concise and to the point, so that those who don't really care know what to do to advance the story, but with enough flavor to make the world alive. It is really hard to find the right balance, and if in addition, you have to incorporate a chatbot, it borders on impossible.
It looks like a good idea on the surface, but it most likely isn't, unless it is clearly not part of the main gameplay loop, as in Quake 3.
Some people had some success using a (big) LLM as a DM in D&D, which I think is easier since it can make up the story as it advances, it is much harder to make up game elements in a computer RPG that are not programmed in.
vatary|2 months ago
acosmism|2 months ago
anonzzzies|2 months ago
alfiedotwtf|2 months ago
teaearlgraycold|2 months ago
jacquesm|2 months ago
boznz|2 months ago
fuzzfactor|2 months ago
RustyRussell|2 months ago
a_t48|2 months ago
ColonelPhantom|2 months ago
ant6n|2 months ago
Biggest pain point is likely the text input.
jasonjmcghee|2 months ago
Zardoz84|2 months ago
magicalhippo|2 months ago
Have you experimented with having it less quantized, and evaluated the quality drop?
Regardless, very cool project.
kouteiheika|2 months ago
It depends on the model, but from my experiments (quantizing one layer of a model to 2-bit and then training the model with that layer in 2-bit to fix the damage) the first layer is the most sensitive, and yes, the last layer is also sensitive too. The middle layers take the best to quantization.
Different components of a layer also have a different sensitivity; e.g. the MLP downscale block damages the model the most when quantized, while quantizing the Q projection in self attention damages the model the least.
coolius|2 months ago
pdyc|2 months ago
Waterluvian|2 months ago
Even with modern supercomputing the computation would be outpaced by the heat death of the universe, so token output must be limited to a single integer.
dirkt|2 months ago
lostmsu|2 months ago
DrNosferatu|2 months ago
A web version would also be cool.
Y_Y|2 months ago
integricho|2 months ago
bytesandbits|2 months ago
NooneAtAll3|2 months ago
codetiger|2 months ago
Sharlin|2 months ago
lodovic|2 months ago
alfiedotwtf|2 months ago
Speaking of - I remember my first digital camera (Fujitsu 1Mb resolution using SmartMedia)… it used so much power that you could take 20-30 photos and then needed to replace all 4 batteries lol
numpad0|2 months ago
qingcharles|2 months ago
*burns you at the stake*
devhouse|2 months ago
[deleted]