top | item 46318786

(no title)

Isn't finetuning the point of the T5 style models, since they perform better for smaller parameter counts?

discuss

It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither.

jeffjeffbear|2 months ago

> https://huggingface.co/google/t5gemma-2-1b-1b

From here it looks like it still is long context and multimodal though?

>Inputs and outputs Input:

Text string, such as a question, a prompt, or a document to be summarized

Images, normalized to 896 x 896 resolution and encoded to 256 tokens each

Total input context of 128K tokens Output:

Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document

Total output context up to 32K tokens