This sounds like an interesting idea, can you elaborate more may be with a concrete example. I am wondering if this can be implemented easily as a plugin in optillm.
One could argue TF-IDF is a case of an attention layer... but not quadratic in inference/training and kinda just a quotient. Yeah maybe we should go back
codelion|9 months ago
pkoird|9 months ago
knuppar|9 months ago