Amazing that you can just shove a ton of multimodal data into a big transformer and get a really good multimodal model. I wonder where things will top out. For many years a lot of people (including me) were saying "you can't just take existing architectures, scale them up, feed them a lot of data, and expect something mpressive", but here we are.
Ditto. My views changed after thinking ”wait, isn’t that what Mother Nature did?” Her solutions just took three billion years of pretraining, decades of individual fine tuning, and ungodly amounts of streaming data. Our solutions are faster due to a vastly more efficient learning algorithm and relevant digital compute.
Now I’ve reached the opposite SWAG hypothesis: given a sufficiently general optimization problem, the more one scales training/inference compute, the more impossible it is to prevent intelligence.
That seems disappointingly prosaic. Shouldn’t there be more to it? Human ego creates anthropocentric bias toward thinking we’re special. We aren’t. At best we’re lucky. And nature doesn’t care about our biases. She repeatedly disabuses us from our “special” places — at the center of the universe, the solar system, the tree of life, and now the spectrum of general intelligence.
This changed my perspective on the LLM “statistical parrot”/“no True Scotsman” critics. Their conviction without evidence (i.e faith) that SotA models don’t “really” reason comes from insecurity. It’s a loud reaction to egos popping. That’s a trauma that I can sympathize with.
You forgot the corollary. What transformers fundamentally reason about is N x partition of input x token embedding size (N = number of attention heads). That's the "latent space" between 2 layers of a transformer, that's what attention produces (which is, in almost all transformers the same across all layers except the first and the last).
Now if you look at this, you might notice ... that's pretty huge for a latent space. Convolutional AI had latent spaces that gradually decreased to maybe 100 numbers, generally even smaller. The big transformers have a latent space. For GPT-3 it will be 96 x 4096 x 128. That is a hell of a lot of numbers between 2 layers. And it just keeps the entire input (up to that point) in memory (it slowly fills up the "context"). What then reasons about this data is a "layer" of a transformer which is more or less a resnetted deep neural network.
But convnets were fundamentally limited to "thinking" in, the biggest I've seen were 1000 dimensions. Because we couldn't keep their thinking stable with more dimensions. But ... we do know how to do that now.
You could look at this to figure out what transformers do if you radically simplify. Nobody can imageine a 100,000 dimensional space. Just doesn't work, does it? But let's say we have a hypothetical transformer with a context size of 2. Let's call token 1 "x" and token 2 "y". You probably see where I'm going with this. This transformer will learn to navigate a plane in a way similar to what it's seen in the training data. "If near 5,5 go north by 32" might be what one neuron in one layer does. This is not different in 100,000 dimensions, except now everybody's lost.
But ... what happens in a convnet with a latent space of 50,000? 100,000? 1,000,000? What happens, for that matter, in a simple deep neural network (ie. just connected layers + softmax) of that size? This was never really tried for 2 reasons: the hardware couldn't do it at the time, AND the math wouldn't support it (we didn't know how to deal with some of the problems, likely you'd need to "resnet" both convnets and deep neural networks, for example)
Would the "old architectures" just work with such an incredible massive latent space?
And there's the other side as well: improve transformers ... what about including MUCH more in the context? A long list of previous conversations, for example. The entire text of learning books things like a multiplication table, a list of ways triangles can be proven to be congruent, the periodic table, physical contexts, the expansion rules for differential calculus, "Physics for scientists and engineers", the whole thing. Yes that will absolutely blow out the latent space, but clearly we've decided that a billion or 2 of extra investments will still allow us to calculate the output.
And its long context recall is quite good! We've already kind of discovered this with Yi, but there are some things one can do with a mega context that you just can't get with RAG.
> but there are some things one can do with a mega context that you just can't get with RAG.
Can you elaborate? In my mind, RAG and "mega context" are orthogonal - RAG is something done by adding documents to the context for the LLM to reference, and "mega context" is just having a big context. No?
> And its long context recall is quite good! We've already kind of discovered this with Yi, but there are some things one can do with a mega context that you just can't get with RAG.
I've got to imagine that a mega-context like this can help RAG work in ways that just isn't possible otherwise. i.e. bring in many more search results or surrounding context around the results so that the processing can do much more.
… d)Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens.
In terms of content, I am blown away yet again by the SoTA speeding on by as I try to catch up. Can someone with a more cynical eye point me to competitors or problems with this approach? Because as it stands… that jump to a context length of a million tokens is pretty impressive to an outsider.
Look at how RingAttention is implemented: it's blockwise attention distributed among many GPUs, in other words bruteforce parallelization. For inference they use TPUs v4-128, not running this at home any time soon.
I wonder why are the example videos this specific clip compilation format.
It feels to me that to navigate that, you essentially have to index 500 10-seconds videos, and that looks a lot easier than retrieving information that is in an actual 1 hour long video, because the later one will have a lot more of easy to mix-up moments. So maybe it hides an inability to answer questions about actual long videos (in the paper, the other example videos cap at 3 minutes length for what I can see).
On the other hand, maybe it's just for results presentation purposes, because it is much more readily "verifiable" for everyone than saying "trust us, in this very long video, there's the correct answer unarguably".
So if someone happens to more about that, I'd be very interested
It's pretty wild watching technology develop where I genuinely don't have a confident idea of just how far it will progress by December in February of the same year.
Open models have just been on fire lately, and the next generation of SotA models to pull synthetic data from in training the next generation of open models each taking nuanced and clever approaches to infrastructure improvements has me pretty much considering all bets to be off.
At this point, the bottleneck is increasingly the human ability to adapt to improving tools than limitations in the tools themselves.
- Matei Zaharia, a CTO of Databricks
- Pieter Abbeel Director of the Berkeley Robot Learning Lab, Co-Director of the Berkeley Artificial Intelligence Research (BAIR) lab
- Two talented PhD students: Hao Liu, Wilson Yan
> We curated a large dataset of videos and languages from public book and video datasets, consisting of videos of diverse activities and long-form books.
I didn’t see any other mention of datasets used, is this on intentional?
While I’m not sure about this one, many AI’s do hide their training data because it’s illegally obtained (ie file sharing of copyrighted works). That’s half of why I dropped AI. The “Proving Wrongdoing” part of my article has specific examples of it:
the claim is that a dataset was created. of words and of videos, and that it was created from public datasets of books and videos, those datasets containing books, and videos.
it takes too many words to say almost nothing.
nothing to see here.
if that isn’t the intent, then the authors need to do better.
It means Millions of tokens. Token in this context means either text token (as in tokenization https://en.wikipedia.org/wiki/Large_language_model#Probabili...) or video token where the paper describes them as "each frame in the video is tokenized with VQGAN into 256 tokens." (p. 6)
It blows my mind how quickly we are moving with these advances in LLM, and these are just the ones we see in PUBLIC. I'm sure there are more advanced proprietary solutions that we aren't privy to.
This implementation is similar to something Ilya Sutskever said a few months ago but I think I am misunderstanding both: I think they are saying robots could learn how to move and what facial expressions to use by watching millions of hours of videos involving humans, a sort of LLM of human behavior. I am not a scientist so I may have this wrong.
Not that controversial. Just need to map it to the controls correctly. The experience from others can show what a human would do. There needs to be a layer of figuring out how to achieve that outcome with whatever tools are on hand
Person of interest, and it is actually a very well thought and down to earth show. Quite interesting to see many of the elements in the show coming to life with recent AI advancements.
[+] [-] knightoffaith|2 years ago|reply
[+] [-] a_wild_dandan|2 years ago|reply
Now I’ve reached the opposite SWAG hypothesis: given a sufficiently general optimization problem, the more one scales training/inference compute, the more impossible it is to prevent intelligence.
That seems disappointingly prosaic. Shouldn’t there be more to it? Human ego creates anthropocentric bias toward thinking we’re special. We aren’t. At best we’re lucky. And nature doesn’t care about our biases. She repeatedly disabuses us from our “special” places — at the center of the universe, the solar system, the tree of life, and now the spectrum of general intelligence.
This changed my perspective on the LLM “statistical parrot”/“no True Scotsman” critics. Their conviction without evidence (i.e faith) that SotA models don’t “really” reason comes from insecurity. It’s a loud reaction to egos popping. That’s a trauma that I can sympathize with.
[+] [-] azinman2|2 years ago|reply
[+] [-] candiodari|2 years ago|reply
But convnets were fundamentally limited to "thinking" in, the biggest I've seen were 1000 dimensions. Because we couldn't keep their thinking stable with more dimensions. But ... we do know how to do that now.
You could look at this to figure out what transformers do if you radically simplify. Nobody can imageine a 100,000 dimensional space. Just doesn't work, does it? But let's say we have a hypothetical transformer with a context size of 2. Let's call token 1 "x" and token 2 "y". You probably see where I'm going with this. This transformer will learn to navigate a plane in a way similar to what it's seen in the training data. "If near 5,5 go north by 32" might be what one neuron in one layer does. This is not different in 100,000 dimensions, except now everybody's lost.
But ... what happens in a convnet with a latent space of 50,000? 100,000? 1,000,000? What happens, for that matter, in a simple deep neural network (ie. just connected layers + softmax) of that size? This was never really tried for 2 reasons: the hardware couldn't do it at the time, AND the math wouldn't support it (we didn't know how to deal with some of the problems, likely you'd need to "resnet" both convnets and deep neural networks, for example)
Would the "old architectures" just work with such an incredible massive latent space?
And there's the other side as well: improve transformers ... what about including MUCH more in the context? A long list of previous conversations, for example. The entire text of learning books things like a multiplication table, a list of ways triangles can be proven to be congruent, the periodic table, physical contexts, the expansion rules for differential calculus, "Physics for scientists and engineers", the whole thing. Yes that will absolutely blow out the latent space, but clearly we've decided that a billion or 2 of extra investments will still allow us to calculate the output.
[+] [-] brucethemoose2|2 years ago|reply
https://huggingface.co/brucethemoose/LargeWorldModel_LWM-Tex...
https://huggingface.co/dranger003/LWM-Text-Chat-128K-iMat.GG...
And its long context recall is quite good! We've already kind of discovered this with Yi, but there are some things one can do with a mega context that you just can't get with RAG.
[+] [-] andy_xor_andrew|2 years ago|reply
Can you elaborate? In my mind, RAG and "mega context" are orthogonal - RAG is something done by adding documents to the context for the LLM to reference, and "mega context" is just having a big context. No?
[+] [-] simcop2387|2 years ago|reply
I've got to imagine that a mega-context like this can help RAG work in ways that just isn't possible otherwise. i.e. bring in many more search results or surrounding context around the results so that the processing can do much more.
[+] [-] bbor|2 years ago|reply
In terms of content, I am blown away yet again by the SoTA speeding on by as I try to catch up. Can someone with a more cynical eye point me to competitors or problems with this approach? Because as it stands… that jump to a context length of a million tokens is pretty impressive to an outsider.
[+] [-] benob|2 years ago|reply
[+] [-] brucethemoose2|2 years ago|reply
We've had Yi 6B 200K context for some time, which is also quite good.
The problem, of course, is hardware requirements and vram. This one is particularly hairy since its not a GQA model.
[+] [-] C4stor|2 years ago|reply
It feels to me that to navigate that, you essentially have to index 500 10-seconds videos, and that looks a lot easier than retrieving information that is in an actual 1 hour long video, because the later one will have a lot more of easy to mix-up moments. So maybe it hides an inability to answer questions about actual long videos (in the paper, the other example videos cap at 3 minutes length for what I can see).
On the other hand, maybe it's just for results presentation purposes, because it is much more readily "verifiable" for everyone than saying "trust us, in this very long video, there's the correct answer unarguably".
So if someone happens to more about that, I'd be very interested
[+] [-] kromem|2 years ago|reply
Open models have just been on fire lately, and the next generation of SotA models to pull synthetic data from in training the next generation of open models each taking nuanced and clever approaches to infrastructure improvements has me pretty much considering all bets to be off.
At this point, the bottleneck is increasingly the human ability to adapt to improving tools than limitations in the tools themselves.
[+] [-] colechristensen|2 years ago|reply
[+] [-] kavaivaleri|2 years ago|reply
- Matei Zaharia, a CTO of Databricks - Pieter Abbeel Director of the Berkeley Robot Learning Lab, Co-Director of the Berkeley Artificial Intelligence Research (BAIR) lab - Two talented PhD students: Hao Liu, Wilson Yan
[+] [-] jerpint|2 years ago|reply
Other than this sentence:
> We curated a large dataset of videos and languages from public book and video datasets, consisting of videos of diverse activities and long-form books.
I didn’t see any other mention of datasets used, is this on intentional?
[+] [-] brucethemoose2|2 years ago|reply
https://huggingface.co/LargeWorldModel
And the model page specifically mention Books3
[+] [-] agnosticmantis|2 years ago|reply
- Books3 dataset
- 700B text-image pairs from Laion-2B-en, filtered to only keep images with at least 256 resolution
- 400M text-image pairs from COYO-700M, filtered to only keep images with at least 256 resolution
- 10M text-video pairs from WebVid10M
- 3M text-video pairs from a subset of InternVid10M
- 73K text-video chat pairs from Valley-Instruct-73K
- 100K text-video chat pairs from Video-ChatGPT
0: https://huggingface.co/LargeWorldModel/LWM-Chat-1M-Jax#train...
[+] [-] nickpsecurity|2 years ago|reply
http://gethisword.com/tech/exploringai/
[+] [-] catchnear4321|2 years ago|reply
the claim is that a dataset was created. of words and of videos, and that it was created from public datasets of books and videos, those datasets containing books, and videos.
it takes too many words to say almost nothing.
nothing to see here.
if that isn’t the intent, then the authors need to do better.
[+] [-] kleiba|2 years ago|reply
[+] [-] currycurry16|2 years ago|reply
[+] [-] jaredsohn|2 years ago|reply
[+] [-] bArray|2 years ago|reply
https://futurama.fandom.com/wiki/Ogden_Wernstrom
[+] [-] Delumine|2 years ago|reply
[+] [-] labrador|2 years ago|reply
[+] [-] spywaregorilla|2 years ago|reply
[+] [-] pfooti|2 years ago|reply
[+] [-] robertkoss|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] 7734128|2 years ago|reply
[+] [-] pk-protect-ai|2 years ago|reply
[+] [-] xvector|2 years ago|reply
[+] [-] yusml|2 years ago|reply
[+] [-] cranberryturkey|2 years ago|reply
[+] [-] gchamonlive|2 years ago|reply