Can you share a github link to this? Where are you reducing the latency? Are you processing the raw audio to text? In my experience ChatGPT generation time is much faster than local Lllama unless you're using something potato like a 7B model.
Unfortunately it has a really high "works on my machine" factor. I'm using Llama2-chat-13B via mlc-llm + whisper-streaming + coqui TTS. I just have a bunch of hardcoded paths and these projects tend to be a real pain to set up, so figuring out a nice way to package it up with its dependencies in a portable way is the hard part.
I'm mostly using llama2 because I wanted it to work entirely offline, not because it's necessarily faster, although it is quite fast with mlc-llm. Calling out to GPT-4 is something I'd like to add. I think the right thing is actually to have the local model generate the first few words (even filler words sometimes maybe) and then switch to the GPT-4 answer whenever it comes back.
modeless|2 years ago
I'm mostly using llama2 because I wanted it to work entirely offline, not because it's necessarily faster, although it is quite fast with mlc-llm. Calling out to GPT-4 is something I'd like to add. I think the right thing is actually to have the local model generate the first few words (even filler words sometimes maybe) and then switch to the GPT-4 answer whenever it comes back.
kordlessagain|2 years ago