> The researchers ran the audio and motion data through smaller models that generated text captions and class predictions, then fed those outputs into different LLMs (Gemini-2.5-pro and Qwen-32B) to see how well they could identify the activity.
Maybe I'm not understanding it, but as I get it, LLMs weren't really important: all they did was further interpreting outputs of a fronting audio-to-text classifier model.
drdaeman|3 months ago
Maybe I'm not understanding it, but as I get it, LLMs weren't really important: all they did was further interpreting outputs of a fronting audio-to-text classifier model.
Lerc|3 months ago
You don't need them, but they are one way to do it that people know how to implement.
Identifying patterns is fairly amenable to analytic approaches, interpreting them, less so.
nrhrjrjrjtntbt|3 months ago
bigyabai|3 months ago