top | item 41494734

(no title)

darknoon | 1 year ago

this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally.

discuss

No comments yet.