(no title)
ajuhasz | 9 days ago
One of our core architecture decisions was to use a streaming speech-to-text model. At any given time about 80ms of actual audio is in memory and about 5 minutes of transcribed audio (text) is in memory (this is help the STT model know the context of the audio for higher transcription accuracy).
Of these 5 minute transcripts, those that don't become memories are forgotten. So only selected extracted memories are durably stored. Currently we store the transcript with the memory (this was a request from our prototype users to help them build confidence in the transcription accuracy) but we'll continue to iterate based on feedback if this is the correct decision.
No comments yet.