top | item 46318786 (no title) jeffjeffbear | 2 months ago Isn't finetuning the point of the T5 style models, since they perform better for smaller parameter counts? discuss order hn newest refulgentis|2 months ago It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither. jeffjeffbear|2 months ago > https://huggingface.co/google/t5gemma-2-1b-1bFrom here it looks like it still is long context and multimodal though?>Inputs and outputs Input:Text string, such as a question, a prompt, or a document to be summarizedImages, normalized to 896 x 896 resolution and encoded to 256 tokens eachTotal input context of 128K tokens Output:Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a documentTotal output context up to 32K tokens load replies (1)
refulgentis|2 months ago It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither. jeffjeffbear|2 months ago > https://huggingface.co/google/t5gemma-2-1b-1bFrom here it looks like it still is long context and multimodal though?>Inputs and outputs Input:Text string, such as a question, a prompt, or a document to be summarizedImages, normalized to 896 x 896 resolution and encoded to 256 tokens eachTotal input context of 128K tokens Output:Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a documentTotal output context up to 32K tokens load replies (1)
jeffjeffbear|2 months ago > https://huggingface.co/google/t5gemma-2-1b-1bFrom here it looks like it still is long context and multimodal though?>Inputs and outputs Input:Text string, such as a question, a prompt, or a document to be summarizedImages, normalized to 896 x 896 resolution and encoded to 256 tokens eachTotal input context of 128K tokens Output:Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a documentTotal output context up to 32K tokens load replies (1)
refulgentis|2 months ago
jeffjeffbear|2 months ago
From here it looks like it still is long context and multimodal though?
>Inputs and outputs Input:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
Total input context of 128K tokens Output:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output context up to 32K tokens