(no title)
atlex2 | 3 months ago
On the other hand, it's amazing that a 512x512 image can be represented by 85 tokens (as in OAI's API), or 263 tokens per second for video (with Gemini). It's as if the memory vs compute tradeoff has morphed into a memory vs embedding question.
This dichotomy reminds me of the "Apple Rotators - can you rotate the Apple in your head" question. The spatial embeddings will likely solve dynamics questions a lot more intuitively (ie, without extended thinking).
We're also working on this space at FlyShirley - training pilots to fly then training Shirley to fly - where we benefit from established simulation tools. Looking forward to trying Fei Fei's models!
No comments yet.