top | item 46589311

(no title)

kevmo314 | 1 month ago

It's an incremental improvement, not really a revolutionary step.

That being said, I think one could adapt an existing model to add mHC by initializing the routing matrix to the regular residual connection and then post-train the hyper connection matrices. This would let you continue training more efficiently on existing models.

discuss

taykolasinski|1 month ago

That initialization strategy (effectively starting as identity to match the standard residual stream) is clever. It would let you surgery an existing model like Llama-3 and fine-tune it into an mHC architecture.

The main risk I see is that the 7x signal amplification happens very aggressively. Even with a gentle initialization, you’d likely need very strict gradient clipping or a tiny learning rate on those new routing matrices to prevent them from blowing up the pre-trained features in the first few steps.

Also, I think there's a mix-up here between mHC (this paper, expressivity) and MLA (latent attention, which provides the massive context efficiency). mHC doesn't save memory, but it might make the model 'smarter' per parameter.

solarkraft|1 month ago

You’re right, I totally mixed this up with MLA.