Show HN: Multimodal perception system for real-time conversation
54 points| mert_gerdan | 19 days ago |raven.tavuslabs.org
One thing that’s always bothered me is that almost all conversational systems still reduce everything to transcripts, and throw away a ton of signals that need to be used downstream. Some existing emotion understanding models try to analyze and classify into small sets of arbitrary boxes, but they either aren’t fast / rich enough to do this with conviction in real-time.
So I built a multimodal perception system which gives us a way to encode visual and audio conversational signals and have them translated into natural language by aligning a small LLM on these signals, such that the agent can "see" and "hear" you, and that you can interface with it via an OpenAI compatible tool schema in a live conversation.
It outputs short natural language descriptions of what’s going on in the interaction - things like uncertainty building, sarcasm, disengagement, or even shift in attention of a single turn in a convo.
Some quick specs:
- Runs in real-time per conversation
- Processing at ~15fps video + overlapping audio alongside the conversation
- Handles nuanced emotions, whispers vs shouts
- Trained on synthetic + internal convo data
Happy to answer questions or go deeper on architecture/tradeoffs
More details here: https://www.tavus.io/post/raven-1-bringing-emotional-intelli...
arctic-true|19 days ago
Another concern I’d have is bias. If I am prone to speaking loudly, is it going to say I’m shrill? If my camera is not aligned well, is it going to say I’m not making eye contact?
mert_gerdan|19 days ago
Bias is a concern for sure, though it adapts to your speech pattern and behaviors in the duration of a single conversation, so ack'ing you not making eye contact because say your camera is on a different monitor, it'll make the mistake once and not refer to that again.
ycombiredd|19 days ago
One part of me has a tendency to think "good, take some subjectivity away from a human with poor social skills", but another part of me is repulsed by the concept because we see how otherwise capable humans will defer to "expertise" of an LLM due to a notion of perceived "expertise" in the machine, or laziness (see recent kerfuffles in the legal field over hallucinated citations, etc.)
Objective classification in CV is one thing, but subjective identification (psychology, pseudoscientific forensic sociology, etc) via a multi-modal model triggers a sort of danger warning in me as initial reaction.
Neat work, though, from a technical standpoint.
mert_gerdan|19 days ago
Don't want this to turn into a Matt Damon in Elysium type of situation for sure with that scene with the parole officer hahah (which would stem from a poor integration of such subjective signals into existing workflows, more so than the availability of those signals)
For emotional intelligence, I personally see this as a prerequisite for any voice / language model that's interacting with humans, just like how an autonomous car has to be able to identify a pothole, so does a voice / video agent navigating a pothole in a conversation.
rl3|19 days ago
Candidate: That's the hotel.
HR: What?
Candidate: Where I live.
HR: Nice place?
Candidate: Yeah, sure. I guess. Is that part of the test?
HR: No. Just warming you up, that's all.
edbaskerville|19 days ago
mert_gerdan|19 days ago
jesserowe|19 days ago
mert_gerdan|19 days ago
ashishheda|19 days ago
mert_gerdan|19 days ago
Johnny_Bonk|19 days ago
unknown|19 days ago
[deleted]
TheOnlyWayUp|19 days ago
[deleted]