(no title)
tuned | 4 months ago
Multi-layer Transformer: N stacked decoder blocks with pre-norm residual connections Rotary Position Embeddings (RoPE): Replaces learned positional encodings with rotary embeddings for better length generalization Multi-Query Attention (MQA): Reduces KV cache size by sharing key/value heads across query heads RMSNorm: Parameter-free normalization for stability (instead of LayerNorm) QK-norm: Normalizes queries and keys before attention to prevent numerical instability ReLU² MLP: Uses ReLU(x)² activation for better gradient flow on GPUs Softcap Logits: Bounds output logits using tanh(x/15)*15 to prevent extreme values
No comments yet.