top | item 46204960

(no title)

> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits.

At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.

Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!

discuss

tomrod|2 months ago

Disclaimer: working and occasionally researching in the space.

The first paragraph is clear linear algebra terminology, the second looked like deeper subfield specific jargon and I was about to ask for a citation as the words definitely are real but the claim sounded hyperspecific and unfamiliar.

I figure a person needs 12 to 18 months of linear algebra, enough to work through Horn and Johnson's "Matrix Analysis" or the more bespoke volumes from Jeffrey Humpheries to get the math behind ML. Not necessarily to use AI/ML as a tech, which really can benefit from the grind towards commodification, but to be able to parse the technical side of about 90 to 95 percent of conference papers.

danielmarkbruce|2 months ago

One needs about 12 to 18 hours of linear algebra to work though the papers, not 12 to 18 months. The vast majority of stuff in AI/ML papers is just "we tried X and it worked!".

jhardy54|2 months ago

> 12 to 18 months of linear algebra

Do you mean full-time study, or something else? I’ve been using inference endpoints but have recently been trying to go deeper and struggling, but I’m not sure where to start.

For example, when selecting an ASR model I was able to understand the various architectures through high-level descriptions and metaphors, but I’d like to have a deeper understanding/intuition instead of needing to outsource that to summaries and explainers from other people.

woadwarrior01|2 months ago

It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings.

[1]: https://arxiv.org/abs/1608.05859

jcims|2 months ago

The turbo encabulator lives on.

empath75|2 months ago

It's a 28 part series. If you start from the beginning, everything is explained in detail.

miki123211|2 months ago

As somebody who understands how LLMs work pretty well, I can definitely feel your pain.

I started learning about neural networks when Whisper came out, at that point I literally knew nothing about how they worked. I started by reading the Whisper paper... which made about 0 sense to me. I was wondering whether all of those fancy terms are truly necessary. Now, I can't even imagine how I'd describe similar concepts without them.

whimsicalism|2 months ago

i consider it a bit rude to make people read AI output without flagging it immediately

squigz|2 months ago

I'm glad I'm not the only one who has a Turbo Encabulator moment when this stuff is posted.

unethical_ban|2 months ago

I was reading this thinking "Holy crap, this stuff sounds straight out of Norman Rockwell... wait, Rockwell Automation. Oh, it actually is"

QuadmasterXLII|2 months ago

The second paragraph is highly derivative of the adversarial turbo encabulator, which Schmithuber invented in the 90s. No citation of course.

BubbleRings|2 months ago

Are you saying I should have attributed, or ChatGPT should have? I suppose I would have but my spurving bearings were rusty.

ekropotin|2 months ago

I have no idea what you’ve just said, so here is my upvote.