(no title)
MathYouF | 3 years ago
The moment everyone knew this was going to be big was in 2019 when StyleGAN came out. They used a lot of tricks like aligning face features (like eyes) and had all their pictures of a single domain (the most famous being faces) but none the less, that was the moment everyone in the AI field knew this was going to be big, and so three years ago a lot of big people shifted to this line of research.
The four main innovations since then have been:
1. Transformers
Generalized computation kernels which allow for images to consider non-localised relationships between pixels of an image. Released in 2017, and originally used for language.
2. Pixel Patch Encodings
Different resolution semantic and geometric image information encodings which allow for better representations of relationships between image areas than pixels are able to achieve given the same compute. Allows using Transformers on high resolution images.
3. CLIP
Contrastive Language and Image Pairing. Before, the only way we knew to classify an image was as a "face" or "cat" or "ramen". When the genius idea of labeling images as semantically meaningful vectors rather than one hot encoded classes was revealed, it changed everything in computer vision very quickly, and problems that used to be hard became trivial. Released in 2021
4. Diffusion Models
GANs penalise you for making an image which does not seem to be part of an existing dataset. This encourages one to make the worst quality image that looks like a member of that dataset. Diffusion learns to denoise an image, removing noise is perceptually similar to increasing resolution, people like images that look that way. There may be more people with better intuition about diffusion models may be able to add on why they're superior. I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then.
Hope this was helpful. All of the above were only implemented for images in any real way in the last three years. Putting them all together is something many people only just this year did, resulting in DallE, Stable Diffusion, and Imagen.
I'm working on doing this for 3D and later for use cases in AR. 3D generation still hasn't been cracked the same way image has but the above will likely contribute to the solution to that as well. Anyone who's intersted in working on that feel free to message me.
astrange|3 years ago
The models behind Imagen and StableDiffusion are actually simpler than DALLE2, and both are higher quality (SD of course isn’t always since it’s much smaller). That suggests DALLE3 will also be simpler again.
There’s also been very recent work with generalized diffusion models (that use problems other than noise removal and still work) and Google researchers have been tweeting results from a merged Imagen/Parti in the last few days.
nuccy|3 years ago
P.S. It seems raccoons are unimaginable (even for AI) with any sunglasses: if photo-realistic mode is selected for a raccoon, changing to "wearing a sunglasses and" makes no difference :)
MathYouF|3 years ago
The models are a product of their datasets, specifically the relationship of the images and prompts via CLIP. CLIP puts both images and text into coordinate space, imagine just a 2D graph. It tries to assure that for any real image and its caption, they will each be each others closest neighbor in that coordinate space.
So if you want a certain image, you have to ask "what caption would be most likely and most uniquely given to the image I'm imagining".
I'm sure this advice is way less helpful than what you find in prompt engineering discord channels and guides I've seen.
themoonisachees|3 years ago
MathYouF|3 years ago
Secondly, there's vastly more labeled image data in the world than 3D data, so creating a CLMP (contrastive language and mesh pairing) model is harder.
It's very late but I may be able to give a much better answer on more of the nuances of 3D generation tomorrow.
PoignardAzur|3 years ago
I knew about transformers, CLIP and diffusion, but pixel patch encodings are new to me.
Can you give me more details / point me towards an explainer? A quick duckduckgo search didn't help.
johnthewise|3 years ago
1: https://arxiv.org/pdf/2010.11929.pdf