(no title)
kir-gadjello | 2 years ago
I'm not sure if this is the right place to ask about this, but could you consider training an LLM using a more advanced, sparse transformer architecture (specifically, "Terraformer" from this paper https://arxiv.org/abs/2111.12763 and this codebase https://github.com/google/trax/blob/master/trax/models/resea... by Google Brain and OpenAI)? I understand the pressure to focus on training a straightforward LLaMA replication, but of course you see that it's a legacy dense architecture which limits its inference performance. This new architecture is not just an academic curiosity but is already validated at scale by Google, providing 10x+ inference performance boost on the same hardware.
Frankly, the community's compute budget - for training and for inference - isn't infinite, and neither is the public's interest in models that do not have advantage (at least in convenience) over closed-source ones; and so we should utilize both those resources as efficiently as possible. It could be a big step forward if you trained at least LLaMA-Terraformer-7B and 13B foundation models on the whole dataset.
No comments yet.