top | item 44865688

(no title)

antx | 6 months ago

Also with the rapid advances of vision language models, I would be surprised if we don't see image-to-text-to-voice system that works with real-time video in a not-so-far future! Like a reverse "Genie" where instead of providing a prompt and it generates a world, you provide a streaming video and it spouts relevant information when changes happen, or on demand, for instance...

discuss

gostsamo|6 months ago

It would be great to have it as a backup, but it will always be the heaviest in computation and responsiveness solution so it should be the last one used.

fho|6 months ago

Have you played around with the current vision features? I am pretty sure even gpt-4.1 can give you pretty good descriptions of e.g. screen captures, including being able to "read" and reproduce text.