(no title)
kir-gadjello | 2 years ago
Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.
Here is the paper https://arxiv.org/abs/2111.12763 and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested.
Hope you get to look into this!
b33j0r|2 years ago
Like why did we even get excited? This? Great work.
swyx|2 years ago
is that a guess or is there a source? im curious to read more
kir-gadjello|2 years ago
I have an expanded list of foundational research that is likely to serve as basis for gpt4 here in my blog: https://kir-gadjello.github.io/posts/gpt4-some-technical-hyp...
Hope it helps!
chaxor|2 years ago
kir-gadjello|2 years ago
>You are free to:
>Share — copy and redistribute the material in any medium or format
>Adapt — remix, transform, and build upon the material
>for any purpose, even commercially.
Compare this to the latest release from StabilityAI lab DeepFloyd, "IF", which in addition to various restrictive clauses strictly prohibits commercial use: https://github.com/deep-floyd/IF/blob/develop/LICENSE-MODEL
Repl.it's release is as open as it gets these days, in my book.