top | item 44833492 (no title) tape_measure | 6 months ago WORDS IN CAPS are different tokens than lowercase, so maybe the lowercase tokens tie into more trained parts of the manifold. discuss order hn newest maxbond|6 months ago That's a super interesting hypothesis. From an information theory perspective, rarer tokens are more informative. Maybe this results in the caps lock tokens being weighted higher by the attention mechanism.
maxbond|6 months ago That's a super interesting hypothesis. From an information theory perspective, rarer tokens are more informative. Maybe this results in the caps lock tokens being weighted higher by the attention mechanism.
maxbond|6 months ago