(no title)
sly010
|
7 months ago
Genuine question: How does this work? How does an LLM do object detection? Or more generally, how does an LLM do anything that is not text? I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision. It doesn't make sense to me why would Gemini 2 and 2.5 would have different vision capabilities, shouldn't they both have access to the same, purpose trained state of the art vision model?
sashank_1509|7 months ago
Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.
What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.
namibj|7 months ago
IIUC they got the native voice2voice models trained on YT-sourced audio. Skipping any intermediate text form is really helpful for fuzzy speech such as from people slurring/mumbling words. Also having access to a full world model during voice-deciphering obviously helps with situations that are very context-heavy, such as for example (spoken/Kana/phonetic) Japanese (which relies on human understanding of context to parse homophones, and non-phonetic Han (Kanji) in writing to make up for the inability to interject clarification).
simonw|7 months ago
Most vision LLMs don't actually use a separate vision model. https://huggingface.co/blog/vlms is a decent explanation of what's going on.
Most of the big LLMs these days are vision LLMs - the Claude models, the OpenAI models, Grok and most of the Gemini models all accept images in addition to text. To my knowledge none of them are using tool calling to a separate vision model for this.
Some of the local models can do this too - Mistral Small and Gemma 3 are two examples. You can tell they're not tool calling to anything because they run directly out of a single model weights file.
gylterud|7 months ago
For instance, I asked it to compute the symmetry group of a pattern I found on a wallpaper in a Lebanese restaurant this weekend. It realised it was unsure of the symmetries and used a python script to rotate and mirror the pattern and compare to the original to check the symmetries it suspected. Pretty awesome!
Legend2440|7 months ago
famouswaffles|7 months ago
https://www.youtube.com/watch?v=EzDsrEvdgNQ
Cheer2171|7 months ago