I noticed the lack of support from ollama and llama.cpp for RWKV. As those are (to my eyes) very strong drivers of experimentation (i.e. supporting them means vastly more outreach) I was considering whether you were considering taking this into your own hands by contributing code to them? Or rather is the fact that you are not (AFAIK) doing it because you lack the bandwidth in terms of man power or any other reason?
It’s really, interesting work. I’m glad you’ve kept at it. I’d like to ask you about two issues.
I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?
The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?
(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)
> The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally.
That really depends on whether LLM pretraining ends up held as an infringing use. (Of course, it’ll take a while for the cases to work through the courts and for a body of jurisprudence to be developed on this subject.)
I'm quite interested in repeng [0] (representztion engineering) for steerability of (so fzr transformer based) LLMs and was wondering if anyone had tried such methods on rwkv (or mamba for that matter). Maybe there are some low hanging fruits about it.
One of the interesting "new direction" for RWKV and Mamba (or any recurrent model), is the monitoring and manipulation of the state in between token. For steerability, alignment, etc =)
Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space
lower compute cost especially over longer sequence length. Depending on context length, its 10x, 100x, or even 1000x+ cheaper. (quadratic vs linear cost difference)
Has there been any plans to build a “reasoning” llm using RWKV? With the increase in inference token count caused by such methods, the muhc lower footprint of recurrent architecture could really make a difference for such a use-case.
Ey7NFZ3P0nzAe|1 year ago
nickpsecurity|1 year ago
I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?
The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?
(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)
Example use of Gutenberg:
https://www.tensorflow.org/datasets/catalog/pg19
anon373839|1 year ago
That really depends on whether LLM pretraining ends up held as an infringing use. (Of course, it’ll take a while for the cases to work through the courts and for a body of jurisprudence to be developed on this subject.)
Ey7NFZ3P0nzAe|1 year ago
pico_creator|1 year ago
Its the same principle as open transformer models where an adapter is used to generate the embedding
However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.
The tech is there, the base model needs to be better
Ey7NFZ3P0nzAe|1 year ago
[0] https://github.com/vgel/repeng/issues
pico_creator|1 year ago
Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space
low_tech_punk|1 year ago
pico_creator|1 year ago
bratao|1 year ago
I have a task(OCR cleaning) that I´m evaluating faster options and look like RWKV would be a nice alternative.
littlestymaar|1 year ago
theLiminator|1 year ago
pico_creator|1 year ago
jharohit|1 year ago