I don't believe video generation can make nonverbal communication sync up so well, regarding the shrug, eye movement, facial expression etc. perfectly synced with the voice. As I said, I think it's conditioned on some real footage, somewhat like ControlNet perhaps.
bonoboTP|10 months ago