(no title)
aconz2 | 1 year ago
Is this accurate? I thought for example gemini pro used image tokens and gpt4-o similar
> without the need for separate image/text encoders
but then they say they pre-trained two different tokenizers, so maybe they just mean that the tokens go into the same attention layer? But then I thought that is how all the multi-modal stuff was happening already?
two typos stabilitize and multiplicate
marcinzm|1 year ago
huac|1 year ago
Other work allows the model during training to learn the 'tokenization' more explicitly. that's more similar to Adept's Fuyu architecture, which I am personally a fan of, but also does not enable generating images out.
You can generate images using late fusion as well, though I am not aware of other public work that discloses both early fusion and image generation.
mountainriver|1 year ago