top | item 47189502

(no title)

alexlitz | 2 days ago

I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...

discuss

order

ks2048|2 days ago

I didn't look at all the details, but wanted to see how you did the initial embedding and see you do have a 14x5 matrix there. I guess when you are setting things by-hand (rather than learning), the definition of counting "parameters" is a bit unclear. One could say all those are parameters! even if setting in a straight-forward way.

alexlitz|1 day ago

Yeah basically it is an implementation detail but most of them are zero, there is an equivalent 14 parameter sparse matrix for that.

sowbug|2 days ago

I ask this question as someone who can't do much more than confirm that your blog post is written in English by someone who knows math.

Does this result suggest that if we had N clever humans manually building an LLM, they might come up with something as smart as a frontier model, but potentially 45 times smaller? (1644 / 36 ~= 45, N = very large, time not specified)

alexlitz|2 days ago

I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling. Also there are smaller ones that were trained so would still be more like 311/36 ~= 8.6.