top | item 38206282

(no title)

obscur | 2 years ago

GPT4 is multimodal in the sense that it can take images as input. The person is using a speech to text system such as OpenAIs Whisper and serving screenshots and voice transcripts to GPT4 and GPT4 is returning a text response which is converted to speech using a text to speech system such as OpenAIs TTS API.

discuss

stevofolife|2 years ago

Ah got it! So basically the prompt to GPT4 is an image + text (converted from audio).