top | item 41494734 (no title) darknoon | 1 year ago this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally. discuss order hn newest No comments yet.
No comments yet.