Isn't condensing rotary embeddings just a clever hack to circumvent the original training limitations? Is it really a sustainable solution for extending context length in language models?
Yes, it actually works. That's what matters at the end of the day.
But no, eventually you're going to have to fine tune on something that has a larger context or training a brand new model with the position embeddings semi randomized so that it learns to generalize in the first place instead of needing hacks like this to function.
But training a model costs millions and millions of dollars, so we're going to have to wait until some very generous group decides to do all that training and then release that model open source.
Or releases the model file for a paid charge fee, I'd pay like $200 for a good one.
bioemerl|2 years ago
Yes, it actually works. That's what matters at the end of the day.
But no, eventually you're going to have to fine tune on something that has a larger context or training a brand new model with the position embeddings semi randomized so that it learns to generalize in the first place instead of needing hacks like this to function.
But training a model costs millions and millions of dollars, so we're going to have to wait until some very generous group decides to do all that training and then release that model open source.
Or releases the model file for a paid charge fee, I'd pay like $200 for a good one.
kaiokendev|2 years ago