As it's still work in progress may I suggest? It would be nice if you go beyond what others have already published and add more details. Like different position encodings, MoE, decoding methods, tokenization. As it's educational easy to use should be a priority, of course.
rasbt|2 years ago