top | item 46912635

(no title)

Factorization is key here. It separates dataset-level structure from observation-level computation so the model doesn't waste capacity rediscovering structure.

I've been arguing the same for code generation. LLMs flatten parse trees into token sequences, then burn compute reconstructing hierarchy as hidden states. Graph transformers could be a good solution for both: https://manidoraisamy.com/ai-mother-tongue.html

discuss

mkmccjr|16 days ago

Thank you for your comment, and I sincerely apologize for my slow response! "Rediscovering structure" is exactly the inefficiency I was trying to highlight.

In the physics/science cases I work with, the factorization is usually between the physical law (shared structure) and the experimental conditions (dataset-specific structure). If you don't separate them, the model wastes capacity trying to memorize the noise of the experimental conditions. (It's ineffective as well as wasteful.)

The analogy to code generation makes a lot of sense: flattening a tree into a sequence forces the model to infer syntax that was already explicit. Thank you for the link; I look forward to diving into it!