top | item 43901323

(no title)

karimf | 10 months ago

Do you have any information on how long each step take? Like how many ms on each step of the pipeline?

I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?

discuss

koljab|10 months ago

LLM and TTS latency get's determined and logged at the start. It's around 220ms for the LLM returning the first synthesizable sentence fragment (depending on the length of the fragment, which is usually something between 3 and 10 words). Then around 80ms of TTS until the first audio chunk is delivered. STT with base.en you can neglect, it's under 5 ms, VAD same. Turn detection model also adds around 20 ms. I have zero clue if and how fast this runs on a Mac.