How far are we away from something like a helmet with chat GPT and a video camera installed, I imagine this will be awesome for low vision people. Imagine having a guide tell you how to walk to the grocery store, and help you grocery shop without an assistant. Of course you have tons of liability issues here, but this is very impressive
We're planning on getting a phone-carrying lanyard and she will just carry her phone around her neck with Be My Eyes^0 looking out the rear camera, pointed outward. She's DeafBlind, so it'll be bluetoothed to her hearing aids, and she can interact with the world through the conversational AI.
I helped her access the video from the presentation, and it brought her to tears. Now, she can play guitar, and the AI and her can write songs and sing them together.
This is a big day in the lives of a lot of people whom aren't normally part of the conversation. As of today, they are.
Just the ability to distinguish bills would be hugely helpful, although I suppose that's much less of a problem these days with credit cards and digital payment options.
With this capability, how close are y'all to it being able to listen to my pronunciation of a new language (e.g. Italian) and given specific feedback about how to pronounce it like a local?
It completely botched teaching someone to say “hello” in Chinese - it used the wrong tones and then incorrectly told them their pronunciation was good.
I don't think that'd work without a dedicated startup behind it.
The first (and imo the main) hurdle is not reproduction, but just learning to hear the correct sounds. If you don't speak Hindi and are a native English speaker, this [1] is a good example. You can only work on nailing those consonants when they become as distinct to your ear as cUp and cAp are in English.
We can get by by falling back to context (it's unlikely someone would ask for a "shit of paper"!), but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears.
That's because we think we hear things as they are, but it's an illusion. Cup/cap distinction is as subtle to an Eastern European as Hindi consonants or Mandarin tones are to English speakers, because the set of meaningful sounds distinctions differs between languages. Relearning the phonetic system requires dedicated work (minimal pairs is one option) and learning enough phonetics to have the vocabulary to discuss sounds as they are. It's not enough to just give feedback.
After watching the demo, my question isn't about how close it is to helping me learn a language, but about how close it is to being me in another language.
Even styles of thought might be different in other languages, so I don't say that lightly... (stay strong, Sapir-Wharf, stay strong ;)
I was conversing with it in Hinglish (A combination of Hindi and English) which folks in Urban India use and it was pretty on point apart from some use of esoteric hindi words but i think with right prompting we can fix that.
This is damn near one of the most impressive things, can only imagine especially with live translation and voice synthesis (eleven labs style) you'd be capable of to integrate with something like teams (select each persons language and do realtime translation to each persons native language, with their own voice and intonations would NUTS)
Random OpenAI question: While the GPT models have become ever cheaper, the price for the tts models have stayed in the $15/1Mio char range. I was hoping this would also become cheaper at some point. There're so many apps (e.g. language learning) that quickly become too expensive given these prices. With the GPT-4o voice (which sounds much better than the current TTS or TTS HD endpoint) I thought maybe the prices for TTS would go down. Sadly that hasn't happened. Is that something on the OpenAI agenda?
I've always been wondering what GPT models lack that makes them "query->response" only. I've always tried to get chatbots to lose the initially needed query, with no avail. What would It take to get a GPT model to freely generate tokens in a thought like pattern? I think when I'm alone without query from another human. Why can't they?
> What would It take to get a GPT model to freely generate tokens in a thought like pattern?
That’s fundamentally not how GPT models work, but you can easily build a framework around them that calls them in a loop; you’d need a special system prompt to get anything “thought like” that way, and if you want it to be anything other than stream-of-simulated-consciousness with no relevance to anything, and a non-empty “user” prompt each round, which could be as simple as time, a status update on something in the world, etc.
Monkeys who've trained since birth to use sign language, and can reply incredible questions, have the same issue. The researchers noticed they never once asked a question like "why is the sky blue?" or "why do you dress up". Zero initiating conversation, but they do reply when you ask what they want.
I suppose it would cost even more electricity to have ChatGPT musing alone though, burning through its nvidia cards...
I think this will be key in a logical proof that statistical generation can never lead to sentience; Penrose will be shown to be correct, at least regarding the computability of consciousness.
You could say, in a sense, that without a human mind to collapse the wave function, the superposition of data in a neural net's weights can never have any meaning.
Even when we build connections between these statistical systems to interact with each other in a way similar to contemplation, they still require a human-created nucleation point on which to root the generation of their ultimate chain of outputs.
I feel like the fact that these models contain so much data has gripped our hardwired obsession for novelty and clouds our perception of their actual capacity to do de novo creation, which I think will be shown to be nil.
An understanding of how LLMs function should probably make this intuitively clear. Even with infinite context and infinite ability to weigh conceptual relations, they would still sit lifeless for all time without some, any, initial input against which they can run their statistics.
It happens sometimes. Just the other day a local TinyLlama instance started asking me questions.
The chat memory was full of mostly nonsense and it asked me a completely random and simple question out of the blue. Did chatbots evolve a lot since he was created.
I think you can get models to "think" if you give them a goal in the system prompt, a memory of previous thoughts, and keep invoking them with cron
They are designed for query and reponse. They don't do anything unless you give them input. Also there's not much research on the best architecture for running continuous though loops in the background and how to mix them into the conversational "context". Current LLMs only emulate single thought synthesis based on long-term memory recall (and some goes off to query the Internet).
> I think when I'm alone without query from another human.
You are actually constantly queried, but it's stimulation from your senses. There are also neurons in your brain which fires regularly, like a clock that ticks every second.
Do you want to make a system that thinks without input? Then you need to add hidden stimuli via a non-deterministic random number generator, preferably a quantum based RNG (or it won't be possible to claim the resulting system has free-will). Even a single photon hitting your retina can affect your thoughts and there are no doubt other quantum effects that trips neurons in your brain above the firing threshold.
I think you need at least three of four levels of loops interacting, with varying strength between them. First level would be the interface to the world, the input and output level (video, audio, text). Data from here are high priority and is capable of interrupting lower levels.
The second level would be short term memory and context switching. Conversations needs to be classified, and stored in a database, and you need an API to retrieve old contexts (conversations). You also possibly need context compression (summarization of conversations in case you're about to hit a context window limit).
The third level would be the actual "thinking", a loop that constantly talks to itself to accomplish a goal using the data from all the other levels but mostly driven by the short term memory. Possibly you could go super-human here and spawn multiple worker processes in parallel. You need to classify the memories by asking; do I need more information? where do I find this information? Do I need an algorithm to accomplish a task? What is the completion criteria. Everything here is powered by an algorithm. You would take your data and produce a list of steps that you have to follow to resolves to a conclusion.
Everything you do as a human to resolve a thought can be expressed as a list or tree of steps.
If you've had a conversation with someone and you keep thinking about it afterwards, what has happened is basically that you have spawned a "worker process" that tries to come to a conclusion that satisfies some criteria. Perhaps there was ambiguity in the conversation that you are trying to resolve, or the conversation gave you some chemical stimulation.
The last level would be subconscious noise driven by the RNG, this would filter up with low priority. In the absence of other external stimuli with higher priority, or currently running thought processes, this would drive the spontaneous self-thinking portion (and dreams) when external stimuli is lacking.
Implement this and you will have something more akin to true AGI (whatever that is) on a very basic level.
In my ChatGPT app or on the website I can select GPT-4o as a model, but my model doesn't seem to work like the demo. The voice mode is the same as before and the images come from DALLE and ChatGPT doesn't seem to understand or modify them any better than previously.
I couldn’t quite tell from the announcement, but is there still a separate TTS step, where GPT is generating tones/pitches that are to be used, or is it completely end to end where GPT is generating the output sounds directly?
Licensing the emotion-intoned TTS as a standalone API is something I would look forward to seeing. Not sure how feasible that would be if, as a sibling comment suggested, it bypasses the text-rendering step altogether.
Is it possible to use this as a TTS model? I noticed on the announcement post that this is a single model as opposed to a text model being piped to a separate TTS model.
The web page implies you can try it immediately. Initially it wasn't available.
A few hours later it was in both the web UI and the mobile app - I got a popu[ telling me that GPT-4o was available. However nothing seems to be any different. I'm not given any option to use video as an input, the app can't seem to pick up any new info from my voice.
I'm left a bit confused as to what I can do that I couldn't do before. I certainly can't seem to recreate much of the stuff from the announcement demos.
Sorry to hijack, but how the hell can I solve this? I have the EXACT SAME error on two iOS devices (native app only — web is fine), but not on Android, Mac, or Windows.
baq|1 year ago
Winner of the 'understatement of the week' award (and it's only Monday).
Also top contender in the 'technically correct' category.
behnamoh|1 year ago
Yes! As soon as I saw gdb I was like "that can't be Greg", but sure enough, that's him.
swyx|1 year ago
Uptrenda|1 year ago
[deleted]
999900000999|1 year ago
JieJie|1 year ago
I helped her access the video from the presentation, and it brought her to tears. Now, she can play guitar, and the AI and her can write songs and sing them together.
This is a big day in the lives of a lot of people whom aren't normally part of the conversation. As of today, they are.
0: https://www.bemyeyes.com/
silverquiet|1 year ago
rfoo|1 year ago
krainboltgreene|1 year ago
I don't need to imagine that, I've had it for about 8 years. It's OK.
> help you grocery shop without an assistant
Isn't this something you learn as a child? Is that a thing we need automated?
macintux|1 year ago
ninininino|1 year ago
sim7c00|1 year ago
unknown|1 year ago
[deleted]
jamestimmins|1 year ago
Seems like these would be similar.
elil17|1 year ago
dgroshev|1 year ago
The first (and imo the main) hurdle is not reproduction, but just learning to hear the correct sounds. If you don't speak Hindi and are a native English speaker, this [1] is a good example. You can only work on nailing those consonants when they become as distinct to your ear as cUp and cAp are in English.
We can get by by falling back to context (it's unlikely someone would ask for a "shit of paper"!), but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears.
That's because we think we hear things as they are, but it's an illusion. Cup/cap distinction is as subtle to an Eastern European as Hindi consonants or Mandarin tones are to English speakers, because the set of meaningful sounds distinctions differs between languages. Relearning the phonetic system requires dedicated work (minimal pairs is one option) and learning enough phonetics to have the vocabulary to discuss sounds as they are. It's not enough to just give feedback.
[1]: https://www.youtube.com/watch?v=-I7iUUp-cX8
patcon|1 year ago
Even styles of thought might be different in other languages, so I don't say that lightly... (stay strong, Sapir-Wharf, stay strong ;)
hack_ml|1 year ago
estebank|1 year ago
taytus|1 year ago
cchance|1 year ago
purplerabbit|1 year ago
Beautiful articulation.
This is an enormous win for humanity.
terhechte|1 year ago
unknown|1 year ago
[deleted]
j-krieger|1 year ago
dragonwriter|1 year ago
That’s fundamentally not how GPT models work, but you can easily build a framework around them that calls them in a loop; you’d need a special system prompt to get anything “thought like” that way, and if you want it to be anything other than stream-of-simulated-consciousness with no relevance to anything, and a non-empty “user” prompt each round, which could be as simple as time, a status update on something in the world, etc.
xwolfi|1 year ago
I suppose it would cost even more electricity to have ChatGPT musing alone though, burning through its nvidia cards...
kolinko|1 year ago
You can use any open source model wirthout any promot whatsoever
nurple|1 year ago
You could say, in a sense, that without a human mind to collapse the wave function, the superposition of data in a neural net's weights can never have any meaning.
Even when we build connections between these statistical systems to interact with each other in a way similar to contemplation, they still require a human-created nucleation point on which to root the generation of their ultimate chain of outputs.
I feel like the fact that these models contain so much data has gripped our hardwired obsession for novelty and clouds our perception of their actual capacity to do de novo creation, which I think will be shown to be nil.
An understanding of how LLMs function should probably make this intuitively clear. Even with infinite context and infinite ability to weigh conceptual relations, they would still sit lifeless for all time without some, any, initial input against which they can run their statistics.
hpeter|1 year ago
I think you can get models to "think" if you give them a goal in the system prompt, a memory of previous thoughts, and keep invoking them with cron
djur|1 year ago
pelorat|1 year ago
They are designed for query and reponse. They don't do anything unless you give them input. Also there's not much research on the best architecture for running continuous though loops in the background and how to mix them into the conversational "context". Current LLMs only emulate single thought synthesis based on long-term memory recall (and some goes off to query the Internet).
> I think when I'm alone without query from another human.
You are actually constantly queried, but it's stimulation from your senses. There are also neurons in your brain which fires regularly, like a clock that ticks every second.
Do you want to make a system that thinks without input? Then you need to add hidden stimuli via a non-deterministic random number generator, preferably a quantum based RNG (or it won't be possible to claim the resulting system has free-will). Even a single photon hitting your retina can affect your thoughts and there are no doubt other quantum effects that trips neurons in your brain above the firing threshold.
I think you need at least three of four levels of loops interacting, with varying strength between them. First level would be the interface to the world, the input and output level (video, audio, text). Data from here are high priority and is capable of interrupting lower levels.
The second level would be short term memory and context switching. Conversations needs to be classified, and stored in a database, and you need an API to retrieve old contexts (conversations). You also possibly need context compression (summarization of conversations in case you're about to hit a context window limit).
The third level would be the actual "thinking", a loop that constantly talks to itself to accomplish a goal using the data from all the other levels but mostly driven by the short term memory. Possibly you could go super-human here and spawn multiple worker processes in parallel. You need to classify the memories by asking; do I need more information? where do I find this information? Do I need an algorithm to accomplish a task? What is the completion criteria. Everything here is powered by an algorithm. You would take your data and produce a list of steps that you have to follow to resolves to a conclusion.
Everything you do as a human to resolve a thought can be expressed as a list or tree of steps.
If you've had a conversation with someone and you keep thinking about it afterwards, what has happened is basically that you have spawned a "worker process" that tries to come to a conclusion that satisfies some criteria. Perhaps there was ambiguity in the conversation that you are trying to resolve, or the conversation gave you some chemical stimulation.
The last level would be subconscious noise driven by the RNG, this would filter up with low priority. In the absence of other external stimuli with higher priority, or currently running thought processes, this would drive the spontaneous self-thinking portion (and dreams) when external stimuli is lacking.
Implement this and you will have something more akin to true AGI (whatever that is) on a very basic level.
throwthrowuknow|1 year ago
ALittleLight|1 year ago
sumedh|1 year ago
jacobsimon|1 year ago
derac|1 year ago
mttpgn|1 year ago
rane|1 year ago
As a language learner, this would be tremendously useful.
bjtitus|1 year ago
andybak|1 year ago
The web page implies you can try it immediately. Initially it wasn't available.
A few hours later it was in both the web UI and the mobile app - I got a popu[ telling me that GPT-4o was available. However nothing seems to be any different. I'm not given any option to use video as an input, the app can't seem to pick up any new info from my voice.
I'm left a bit confused as to what I can do that I couldn't do before. I certainly can't seem to recreate much of the stuff from the announcement demos.
sumedh|1 year ago
dpflan|1 year ago
I imagine that there is a lot of usage at the HQ, human + AI karaoke?
skottenborg|1 year ago
Ah yes, also known as being co-founder :)
leozq|1 year ago
rrr_oh_man|1 year ago
Sorry to hijack, but how the hell can I solve this? I have the EXACT SAME error on two iOS devices (native app only — web is fine), but not on Android, Mac, or Windows.
dmarinoc|1 year ago
Sadly, the error returned is not related to the cause.
hpeter|1 year ago
It will be fully available in Eu with the GDPR compliance?
unknown|1 year ago
[deleted]
passion__desire|1 year ago
[deleted]
xanderlewis|1 year ago
theboat|1 year ago
unknown|1 year ago
[deleted]
Induane|1 year ago
moab|1 year ago