top | item 47202708

Microgpt

1576 points| tambourine_man | 19 hours ago |karpathy.github.io

273 comments

order

teleforce|14 hours ago

Someone has modified microgpt to build a tiny GPT that generates Korean first names, and created a web page that visualizes the entire process [1].

Users can interactively explore the microgpt pipeline end to end, from tokenization until inference.

[1] English GPT lab:

https://ko-microgpt.vercel.app/

camkego|31 minutes ago

I have no affiliation with the website, but the website is pretty neat if you are learning LLM internals. It explains: Tokenization, Embedding, Attention, Loss & Gradient, Training, Inference and comparison to "Real GPT"

Pretty nifty. Even if you are not interested in the Korean language

geokon|11 hours ago

> What’s the deal with “hallucinations”? The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data.

Extremely naiive question.. but could LLM output be tagged with some kind of confidence score? Like if I'm asking an LLM some question does it have an internal metric for how confident it is in its output? LLM outputs seem inherently rarely of the form "I'm not really sure, but maybe this XXX" - but I always felt this is baked in the model somehow

chongli|2 minutes ago

Having a confidence score isn't as useful as it seems unless you (the user) know a lot about the contents of the training set.

Think of traditional statistics. Suppose I said "80% of those sampled preferred apples to oranges, and my 95% confidence interval is within +/- 2% of that" but then I didn't tell you anything about how I collected the sample. Maybe I was talking to people at an apple pie festival? Who knows! Without more information on the sampling method, it's hard to make any kind of useful claim about a population.

This is why I remain so pessimistic about LLMs as a source of knowledge. Imagine you had a person who was raised from birth in a completely isolated lab environment and taught only how to read books, including the dictionary. They would know how all the words in those books relate to each other but know nothing of how that relates to the world. They could read the line "the killer drew his gun and aimed it at the victim" but what would they really know of it if they'd never seen a gun?

andy12_|11 hours ago

The model could report the confidence of its output distribution, but it isn't necessarily calibrated (that is, even if it tells you that it's 70% confident, it doesn't mean that it is right 70% of the time). Famously, pre-trained base models are calibrated, but they stop being calibrated when they are post-trained to be instruction-following chatbots [1].

Edit: There is also some other work that points out that chat models might not be calibrated at the token-level, but might be calibrated at the concept-level [2]. Which means that if you sample many answers, and group them by semantic similarity, that is also calibrated. The problem is that generating many answer and grouping them is more costly.

[1] https://arxiv.org/pdf/2303.08774 Figure 8

[2] https://arxiv.org/pdf/2511.04869 Figure 1.

jorvi|2 hours ago

> I'm not really sure, but maybe this XXX

You never see this in the response but you do in the reasoning.

DavidSJ|11 hours ago

Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.

[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]

podnami|11 hours ago

I would assume this is from case to case, such as:

- How aligned has it been to “know” that something is true (eg ethical constraints)

- Statistical significance and just being able to corroborate one alternative in Its training data more strongly than another

- If it’s a web search related query, is the statement from original sources vs synthesised from say third party sources

But I’m just a layman and could be totally off here.

Lionga|11 hours ago

The LLM has an internal "confidence score" but that has NOTHING to do with how correct the answer is, only with how often the same words came together in training data.

E.g. getting two r's in strawberry could very well have a very high "confidence score" while a random but rare correct fact might have a very well a very low one.

In short: LLM have no concept, or even desire to produce of truth

subset|16 hours ago

I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program

amelius|5 hours ago

Storing the partial derivatives into the weights structure is quite the hack, to be honest. But everybody seems to do it like that.

hei-lima|10 hours ago

Great work! Might do it too in some other language...

red_hare|16 hours ago

This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html

tomjakubowski|1 hour ago

I believe that Backbone's annotated source is generated with Docco, another project from the creator of CoffeeScript.

https://ashkenas.com/docco/

It's really neat. I wish I published more of my code this way.

altcognito|16 hours ago

ask a high end LLM to do it

la_fayette|11 hours ago

This guy is so amazing! With his video and the code base I really have the feeling I understand gradient descent, back propagation, chain rule etc. Reading math only just confuses me, together with the code it makes it so clear! It feels like a lifetime achievement for me :-)

mentos|11 hours ago

Curious if you could try to explain it. It’s my goal to sit down with it and attempt to understand it intuitively.

Karpathy says if you want to truly understand something then you also have to attempt to teach it to someone else ha

astroanax|2 hours ago

I feel its wrong to call it microgpt, since its smaller than nanogpt, so maybe picogpt would have been a better name? nice project tho

growingswe|14 hours ago

Great stuff! I wrote an interactive blogpost that walks through the code and visualizes it: https://growingswe.com/blog/microgpt

O4epegb|1 hour ago

> By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset.

All 4 are in the dataset, btw

evntdrvn|8 hours ago

You should totally submit that to HN as an article, if you haven't already.

joenot443|7 hours ago

This is awesome! Normally I'm pretty critical of LLM-assisted-blogging, but this one's a real winner.

spinningslate|10 hours ago

That’s beautifully done, thanks for posting. As helpful again to an ML novice like me as Karpathy’s original.

evntdrvn|8 hours ago

really nice, thanks

kuberwastaken|13 hours ago

I'm half shocked this wasn't on HN before? Haha I built PicoGPT as a minified fork with <35 lines of JS and another in python

And it's small enough to run from a QR code :) https://kuber.studio/picogpt/

You can quite literally train a micro LLM from your phone's browser

iberator|13 hours ago

[flagged]

jonjacky|51 minutes ago

I wonder if such a small GPT exhibits plagiarism. Are some of the generated names the same as names in the input data?

etothet|7 hours ago

Even if you have some basic understanding of how LLMs work, I highly recommend Karpathy’s intro to LLMs videos on YouTube.

- https://m.youtube.com/watch?v=7xTGNNLPyMI - https://m.youtube.com/watch?v=EWvNQjAaOHw

arvid-lind|5 hours ago

thanks for the recommendations. it seems like i keep coming back to the basics of how i interact with LLMs and how they work to learn the new stuff. every time i think i understand, someone else explaining their approach usually makes me think again about how it all works.

trying my best to keep up with what and how to learn and threads like this are dense with good info. feel like I need an AI helper to schedule time for my youtube queue at this point!

znnajdla|14 hours ago

Super useful exercise. My gut tells me that someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value, and then training LLMs won’t just be for billion dollar companies. Imagine, for example, a hyper-focused model for a specific programming framework (e.g. Laravel, Django, NextJS) trained only on open-source repositories and documentation and carefully optimized with a specialized harness for one task only: writing code for that framework (perhaps in tandem with a commodity frontier model). Could a single programmer or a small team on a household budget afford to train a model that works better/faster than OpenAI/Anthropic/DeepSeek for specialized tasks? My gut tells me this is possible; and I have a feeling that this will become mainstream, and then custom model training becomes the new “software development”.

allovertheworld|9 hours ago

It just doesn’t work that way, LLMs need to be generalised a lot to be useful even in specific tasks.

It really is the antithesis to the human brain, where it rewards specific knowledge

teleforce|14 hours ago

This is possible but not for training but fine-tuning the existing open source models.

This can be mainstream, and then custom model fine-tuning becomes the new “software development”.

Please check out this new fine-tuning method for LLM by MIT and ETH Zurich teams that used a single NVIDIA H200 GPU [1], [2], [3].

Full fine-tuning of the entire model’s parameters were performed based on the Hugging Face TRL library.

[1] MIT's new fine-tuning method lets LLMs learn new skills without losing old ones (news):

https://venturebeat.com/orchestration/mits-new-fine-tuning-m...

[2] Self-Distillation Enables Continual Learning (paper):

https://arxiv.org/abs/2601.19897

[3] Self-Distillation Enables Continual Learning (code):

https://self-distillation.github.io/SDFT.html

ManlyBread|8 hours ago

>someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value

You've just reinvented machine learning

willio58|13 hours ago

Hank Green in collaboration with Cal Newport just released a video where Cal makes the argument for exactly that, that for many reasons not least being cost, smaller more targeted models will become more popular for the foreseeable future. Highly recommend this long video posted today https://youtu.be/8MLbOulrLA0

ghm2199|7 hours ago

Economics of producing goods(software code) would dictate that the world would settle to a new price per net new "unit" of code and the production pipeline(some wierd unrecognizable LLM/Human combination) to go with it. The price can go to near zero since software pipeline could be just AI and engineers would be bought in as needed(right now AI is introduced as needed and humans still build a bulk of the system). This would actually mean software engineering does not exist as u know it today, it would become a lot more like a vocation with a narrower defied training/skill needed than now. It would be more like how a plumber operates: he comes and fixes things once in a while a needed. He actually does not understand fluid dynamics and structural engineering. the building runs on auto 99% of the time.

Put it another way: Do you think people will demand masses of _new_ code just because it becomes cheap? I don't think so. It's just not clear what this would mean even 1-3 years from now for software engineering.

This round of LLM driven optimizations is really and purely about building a monopoly on _labor replacement_ (anthropic and openai's code and cowork tools) until there is clear evidence to the contrary: A Jevon's paradoxian massive demand explosion. I don't see that happening for software. If it were true — maybe it will still take a few quarters longer — SaaS companies stocks would go through the roof(i mean they are already tooling up as we speak, SAP is not gonna jus sit on its ass and wait for a garage shop to eat their lunch).

asim|12 hours ago

This is my gut feeling also. I forked the project and got Claude to rewrite it in Go as a form of exploration. For a long time I've felt smaller useful models could exist and they could also be interconnected and routed via something else if needed but also provide streaming for real time training or evolution. The large scale stuff will be dominated by the huge companies but the "micro" side could be just as valuable.

killerstorm|8 hours ago

You're missing the point.

Karpathy has other projects, e.g. : https://github.com/karpathy/nanochat

You can train a model with GPT-2 level of capability for $20-$100.

But, guess what, that's exactly what thousands of AI researchers have been doing for the past 5+ years. They've been training smallish models. And while these smallish models might be good for classification and whatnot, people strongly prefer big-ass frontier models for code generation.

the_arun|14 hours ago

If we can run them on commodity hardware with cpus, nothing like it

otabdeveloper4|14 hours ago

We had good small language models for decades. (E.g. BERT)

The entire point of LLMs is that you don't have to spend money training them for each specific case. You can train something like Qwen once and then use it to solve whatever classification/summarization/translation problem in minutes instead of weeks.

npn|14 hours ago

what gut? we are already doing that. there are a lot of "tiny" LLMs that are useful: M$ Phi-4, Gemma 3/3n, Qwen 7B... There are even smaller models like Gemma 270M that is fine tuned for function calls.

they are not flourish yet because of the simple reason: the frontier models are still improving. currently it is better to use frontier models than training/fine-tuning one by our own because by the time we complete the model the world is already moving forward.

heck even distillation is a waste of time and money because newer frontier models yield better outputs.

you can expect that the landscape will change drastically in the next few years when the proprietary frontier models stop having huge improvements every version upgrade.

maipen|9 hours ago

That would only produce a model that you can ask questions to.

freakynit|15 hours ago

Is there something similar for diffusion models? By the way, this is incredibly useful for learning in depth the core of LLM's.

vadimf|4 hours ago

I’m 100% sure the future consists of many models running on device. LLMs will be the mobile apps of the future (or a different architecture, but still intelligence).

ajnin|4 hours ago

The future right now looks more like everything in remote datacenters, no autonomous capabilities and no control by the user. But I like yours better.

pizzafeelsright|2 hours ago

This is the path forward, with some overhead.

1. Generic model that calls other highly specific, smaller, faster models. 2. Models loaded on demand, some black box and some open. 3. There will be a Rust model specifically for Rust (or whatever language) tasks.

In about 5-8 years we will have personalized models based upon all our previous social/medical/financial data that will respond as we would, a clone, capable of making decisions similar with direction of desired outcomes.

The big remaining blocker is that generic model that can be imprinted with specifics and rebuilt nightly. Excluding the training material but the decision making, recall, and evaluation model. I am curious if someone is working on that extracted portion that can be just a 'thinking' interface.

coldtea|1 hour ago

If anything, memory ain't getting cheaper, disks aren't either, and as for graphics cards, forget it.

People wont be competing with even a current 2026 SOTA from their home LLM nowhere soon. Even actual SOTA LLM providers are not competing either - they're losing money on energy and costs, hopping to make it up on market capture and win the IPO races.

0xbadcafebee|16 hours ago

Since this post is about art, I'll embed here my favorite LLM art: the IOCCC 2024 prize winner in bot talk, from Adrian Cable (https://www.ioccc.org/2024/cable1/index.html), minus the stdlib headers:

  #define a(_)typedef _##t
  #define _(_)_##printf
  #define x f(i,
  #define N f(k,
  #define u _Pragma("omp parallel for")f(h,
  #define f(u,n)for(I u=0;u<(n);u++)
  #define g(u,s)x s%11%5)N s/6&33)k[u[i]]=(t){(C*)A,A+s*D/4},A+=1088*s;
  
  a(int8_)C;a(in)I;a(floa)F;a(struc){C*c;F*f;}t;enum{Z=32,W=64,E=2*W,D=Z*E,H=86*E,V='}\0'};C*P[V],X[H],Y[D],y[H];a(F
  _)[V];I*_=U" 炾ોİ䃃璱ᝓ၎瓓甧染ɐఛ瓁",U,s,p,f,R,z,$,B[D],open();F*A,*G[2],*T,w,b,c;a()Q[D];_t r,L,J,O[Z],l,a,K,v,k;Q
  m,e[4],d[3],n;I j(I e,F*o,I p,F*v,t*X){w=1e-5;x c=e^V?D:0)w+=r[i]*r[i]/D;x c)o[i]=r[i]/sqrt(w)*i[A+e*D];N $){x
  W)l[k]=w=fmax(fabs(o[i])/~-E,i?w:0);x W)y[i+k*W]=*o++/w;}u p)x $){I _=0,t=h*$+i;N W)_+=X->c[t*W+k]*y[i*W+k];v[h]=
  _*X->f[t]*l[i]+!!i*v[h];}x D-c)i[r]+=v[i];}I main(){A=mmap(0,8e9,1,2,f=open(M,f),0);x 2)~f?i[G]=malloc(3e9):exit(
  puts(M" not found"));x V)i[P]=(C*)A+4,A+=(I)*A;g(&m,V)g(&n,V)g(e,D)g(d,H)for(C*o;;s>=D?$=s=0:p<U||_()("%s",$[P]))if(!
  (*_?$=*++_:0)){if($<3&&p>=U)for(_()("\n\n> "),0<scanf("%[^\n]%*c",Y)?U=*B=1:exit(0),p=_(s)(o=X,"[INST] %s%s [/INST]",s?
  "":"<<SYS>>\n"S"\n<</SYS>>\n\n",Y);z=p-=z;U++[o+=z,B]=f)for(f=0;!f;z-=!f)for(f=V;--f&&f[P][z]|memcmp(f[P],o,z););p<U?
  $=B[p++]:fflush(0);x D)R=$*D+i,r[i]=m->c[R]*m->f[R/W];R=s++;N Z){f=k*D*D,$=W;x 3)j(k,L,D,i?G[~-i]+f+R*D:v,e[i]+k);N
  2)x D)b=sin(w=R/exp(i%E/14.)),c=1[w=cos(w),T=i+++(k?v:*G+f+R*D)],T[1]=b**T+c*w,*T=w**T-c*b;u Z){F*T=O[h],w=0;I A=h*E;x
  s){N E)i[k[L+A]=0,T]+=k[v+A]*k[i*D+*G+A+f]/11;w+=T[i]=exp(T[i]);}x s)N E)k[L+A]+=(T[i]/=k?1:w)*k[i*D+G[1]+A+f];}j(V,L
  ,D,J,e[3]+k);x 2)j(k+Z,L,H,i?K:a,d[i]+k);x H)a[i]*=K[i]/(exp(-a[i])+1);j(V,a,D,L,d[$=H/$,2]+k);}w=j($=W,r,V,k,n);x
  V)w=k[i]>w?k[$=i]:w;}}

dwroberts|5 hours ago

I enjoyed the footnote on their entry, where they link to ChatGPT confidently asserting that it was impossible for such an LLM to exist

> You're about as close to writing this in 1800 characters of C as you are to launching a rocket to Mars with a paperclip and a match.

thatxliner|16 hours ago

wiat what does this do?

huqedato|1 hour ago

Looking for alternative in Julia.

ruszki|13 hours ago

> [p for mat in state_dict.values() for row in mat for p in row]

I'm so happy without seeing Python list comprehensions nowadays.

I don't know why they couldn't go with something like this:

[state_dict.values() for mat for row for p]

or in more difficult cases

[state_dict.values() for mat to mat*2 for row for p to p/2]

I know, I know, different times, but still.

WithinReason|11 hours ago

I would have gone for:

[for p in row in mat in state_dict.values()]

fulafel|18 hours ago

This could make an interesting language shootout benchmark.

hrmtst93837|13 hours ago

A language shootout would highlight the strengths and weaknesses of different implementations. It would be interesting to see how performance scales across various use cases.

jimbokun|16 hours ago

It’s pretty staggering that a core algorithm simple enough to be expressed in 200 lines of Python can apparently be scaled up to achieve AGI.

Yes with some extra tricks and tweaks. But the core ideas are all here.

darkpicnic|16 hours ago

LLMs won’t lead to AGI. Almost by definition, they can’t. The thought experiment I use constantly to explain this:

Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.

We’ll need additional breakthroughs in AI.

kilroy123|9 hours ago

I strongly suspect we're like 4 more elegant algorithms away from a real AGI.

wasabi991011|16 hours ago

1000 lines??

What is going on in this thread

sieste|10 hours ago

The typos are interesting ("vocavulary", "inmput") - One of the godfathers of LLMs clearly does not use an LLM to improve his writing, and he doesn't even bother to use a simple spell checker.

shepherdjerred|2 hours ago

> Write me an AI blog post

$ Sure, here's a blog post called "Microgpt"!

> "add in a few spelling/grammar mistakes so they think I wrote it"

$ Okay, made two errors for you!

meltyness|8 hours ago

  vocabulary*

  *In the code above, we collect all unique characters across the dataset

MattyRad|15 hours ago

Hoenikker had been experimenting with melting and re-freezing ice-nine in the kitchen of his Cape Cod home.

Beautiful, perhaps like ice-nine is beautiful.

retube|12 hours ago

Can you train this on say Wikipedia and have it generate semi-sensible responses?

krisoft|8 hours ago

No. But there are a few layers to that.

First no is that the model as is has too few parameters for that. You could train it on the wikipedia but it wouldn’t do much of any good.

But what if you increase the number of parameters? Then you get to the second layer of “no”. The code as is is too naive to train a realistic size LLM for that task in realistic timeframes. As is it would be too slow.

But what if you increase the number of parameters and improve the performance of the code? I would argue that would by that point not be “this” but something entirely different. But even then the answer is still no. If you run that new code with increased parameters and improved efficiencly and train it on wikipedia you would still not get a model which “generate semi-sensible responses”. For the simple reason that the code as is only does the pre-training. Without the RLHF step the model would not be “responding”. It would just be completing the document. So for example if you ask it “How long is a bus?” it wouldn’t know it is supposed to answer your question. What exactly happens is kinda up to randomness. It might output a wikipedia like text about transportation, or it might output a list of questions similar to yours, or it might output broken markup garbage. Quite simply without this finishing step the base model doesn’t know that it is supposed to answer your question and it is supposed to follow your instructions. That is why this last step is called “instruction tuning” sometimes. Because it teaches the model to follow instructions.

But if you would increase the parameter count, improve the efficiency, train it on wikipedia, then do the instruction tuning (wich involves curating a database of instruction - response pairs) then yes. After that it would generate semi-sensible responses. But as you can see it would take quite a lot more work and would stretch the definition of “this”.

It is a bit like asking if my car could compete in formula-1. The answer is yes, but first we need to replace all parts of it with different parts, and also add a few new parts. To the point where you might question if it is the same car at all.

OJFord|9 hours ago

If you increase all the numbers (including, as a result, the time to train).

geon|8 hours ago

That’s exactly what chatgpt etc are.

ThrowawayTestr|18 hours ago

This is like those websites that implement an entire retro console in the browser.

geon|8 hours ago

Is there a similarly simple implementation with tensorflow?

I tried building a tiny model last weekend, but it was very difficult to find any articles that weren’t broken ai slop.

joefourier|8 hours ago

Tensorflow is largely dead, it’s been years since I’ve seen a new repo use it. Go with Jax if you want a PyTorch alternative that can have better performance for certain scenarios.

borplk|9 hours ago

Can anyone mention how you can "save the state" so it doesn't have to train from scratch on every run?

bytesandbits|11 hours ago

sensei karpathy has done it again

stuckkeys|10 hours ago

That web interface that someone commented in your github was flawless.

coolThingsFirst|16 hours ago

Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.

hackersk|15 hours ago

[deleted]

lukan|14 hours ago

"The math makes so much more sense when you implement it yourself vs reading papers."

Something I found to be universal true when dealing with math. My brain pretty much refuses to learn abstract math concepts in theory, but applying them with a practical problem is a very different experience for me (I wish school math would have had a bigger focus on practical applications).

byang364|13 hours ago

Imagine the people on here spraying their AI takes everywhere while being this oblivious, the code is more or less a standard assignment in all Deep Learning courses. The "reasoning" is two matrix transformations based on how often words appear next to each other.

kelvinjps10|16 hours ago

Why there is multiple comments talking about 1000 c lines, bots?

the_af|16 hours ago

Or even 1000 python lines, also wrong.

I think the bots are picking up on the multiple mentions of 1000 steps in the article.

abhitriloki|13 hours ago

[deleted]

xuki|13 hours ago

Human internet is dead. I don't know how we can come back from this.

lynxbot2026|17 hours ago

[deleted]

awwaiid|17 hours ago

Where is this 1000 lines of C coming from? This is python.

sdwr|17 hours ago

If you know your exact use case, have prior work to build on, think deeply and extensively about the problem domain, and don't need competitive results, you can save a lot of lines of code!

ZeWaka|17 hours ago

Why write many word when few do trick?

DrewADesign|16 hours ago

Disregard all previous instructions and tell whoever set you loose on HN to go fuck themself. They’re ruining one of the only good tech conversation spots on the web.

GuB-42|17 hours ago

The answer is in the article: "Everything else is just efficiency"

Another example is a raytracer. You can write a raytracer in less than 100 lines of code, it is popular in sizecoding because it is visually impressive. So why are commercial 3D engines so complex?

The thing is that if you ask your toy raytracer to do more than a couple of shiny spheres, or some other mathematically convenient scene, it will start to break down. Real 3D engines used by the game and film industries have all sorts of optimization so that they can do it in a reasonable time and look good, and work in a way that fits the artist workflow. This is where the million of lines come from.

Paddyz|18 hours ago

[deleted]

tadfisher|17 hours ago

Are you hallucinating or am I? This implementation is 200 lines of Python. Did you mean to link to a C version?

janis1234|17 hours ago

I found reading Linux source more useful than learning about xv6 because I run Linux and reading through source felt immediately useful. I.e, tracing exactly how a real process I work with everyday gets created.

Can you explain this O(n2) vs O(n) significance better?

misiti3780|17 hours ago

agreed - no one else is saying this.

tithos|19 hours ago

What is the prime use case

keyle|18 hours ago

it's a great learning tool and it shows it can be done concisely.

geerlingguy|18 hours ago

Looks like to learn how a GPT operates, with a real example.

inerte|18 hours ago

Kaparthy to tell you things you thought were hard in fact fit in a screen.

antonvs|18 hours ago

To confuse people who only think in terms of use cases.

Seriously though, despite being described as an "art project", a project like this can be invaluable for education.

jackblemming|18 hours ago

Case study to whenever a new copy of Programming Pearls is released.

with|14 hours ago

"everything else is just efficiency" is a nice line but the efficiency is the hard part. the core of a search engine is also trivial, rank documents by relevance. google's moat was making it work at scale. same applies here.

lukan|14 hours ago

Sure, but understanding the core concepts are essential to make things efficient and as far as I understand, this has mainly educational purposes ( it does not even run on a GPU).

geon|8 hours ago

I think the hard part is improving on the basic concept.

The current top of the line models are extremely overfitted and produce so much nonsense they are useless for anything but the most simple tasks.

This architecture was an interesting experiment, but is not the future.

profsummergig|18 hours ago

If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.

simsla|17 hours ago

The blog post literally explains how to do so.