(no title)
schopra909 | 1 month ago
We can update the code over the next day or two to provide the option for delete VAE after the text encoding is computed (to save on RAM). And then report back the GB consumed for 360p, 720p 2-5 seconds on GitHub so there are more accurate numbers.
Beyond the 10 GB from the T5, there's just a lot of VRAM taken up by the context window of 720p video (even though the model itself is 2B parameters).
storystarling|1 month ago
Have you tried quantizing the T5? In my experience you can usually run these encoders in 8-bit or even 4-bit with negligible quality loss. Dropping that memory footprint would make this much more viable for consumer hardware.
schopra909|1 month ago
The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.
A 720p 5 second video is roughly 100K tokens of context
schopra909|1 month ago
When we started down this path, T5 was the standard (back in 2024).
Likely won’t be the text encoder for subsequent models, given its size (per your point) and age