top | item 45843378 (no title) fspeech | 3 months ago It uses 75% linear attention layers so it is inherently lower cost. And it is MOE so active parameters are far lower. discuss order hn newest No comments yet.
No comments yet.