top | item 32590482

(no title)

MathYouF | 3 years ago

Like anyone deeply in a field I know maybe several thousand people who could probably give a better answer, but I figure I'll give an effort to provide one since I don't see any good ones posted yet.

The moment everyone knew this was going to be big was in 2019 when StyleGAN came out. They used a lot of tricks like aligning face features (like eyes) and had all their pictures of a single domain (the most famous being faces) but none the less, that was the moment everyone in the AI field knew this was going to be big, and so three years ago a lot of big people shifted to this line of research.

The four main innovations since then have been:

1. Transformers

Generalized computation kernels which allow for images to consider non-localised relationships between pixels of an image. Released in 2017, and originally used for language.

2. Pixel Patch Encodings

Different resolution semantic and geometric image information encodings which allow for better representations of relationships between image areas than pixels are able to achieve given the same compute. Allows using Transformers on high resolution images.

3. CLIP

Contrastive Language and Image Pairing. Before, the only way we knew to classify an image was as a "face" or "cat" or "ramen". When the genius idea of labeling images as semantically meaningful vectors rather than one hot encoded classes was revealed, it changed everything in computer vision very quickly, and problems that used to be hard became trivial. Released in 2021

4. Diffusion Models

GANs penalise you for making an image which does not seem to be part of an existing dataset. This encourages one to make the worst quality image that looks like a member of that dataset. Diffusion learns to denoise an image, removing noise is perceptually similar to increasing resolution, people like images that look that way. There may be more people with better intuition about diffusion models may be able to add on why they're superior. I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then.

Hope this was helpful. All of the above were only implemented for images in any real way in the last three years. Putting them all together is something many people only just this year did, resulting in DallE, Stable Diffusion, and Imagen.

I'm working on doing this for 3D and later for use cases in AR. 3D generation still hasn't been cracked the same way image has but the above will likely contribute to the solution to that as well. Anyone who's intersted in working on that feel free to message me.

discuss

astrange|3 years ago

> I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then.

The models behind Imagen and StableDiffusion are actually simpler than DALLE2, and both are higher quality (SD of course isn’t always since it’s much smaller). That suggests DALLE3 will also be simpler again.

There’s also been very recent work with generalized diffusion models (that use problems other than noise removal and still work) and Google researchers have been tweeting results from a merged Imagen/Parti in the last few days.

nuccy|3 years ago

Thanks for answering. Since you mentioned your work on text-to-3d, what are the ways to enhance the image/3d model to actually be photo-(or rather reality)-realistic? Even (presumably) hand-picked examples from google on the linked page lack support bars of the sunglasses, include floating cups of wine with base-less Eiffel tower in the background.

P.S. It seems raccoons are unimaginable (even for AI) with any sunglasses: if photo-realistic mode is selected for a raccoon, changing to "wearing a sunglasses and" makes no difference :)

MathYouF|3 years ago

I know as much about how to get the best image outputs from text inputs as the person who designed an airport knows the best place to eat in it. The emergent properties of the system are a result of the data put into it, so I can only discuss the system itself, not what it ended up doing with the data in that system.

The models are a product of their datasets, specifically the relationship of the images and prompts via CLIP. CLIP puts both images and text into coordinate space, imagine just a 2D graph. It tries to assure that for any real image and its caption, they will each be each others closest neighbor in that coordinate space.

So if you want a certain image, you have to ask "what caption would be most likely and most uniquely given to the image I'm imagining".

I'm sure this advice is way less helpful than what you find in prompt engineering discord channels and guides I've seen.

themoonisachees|3 years ago

Is 3d a different problem, or a similar one but considerably harder? I'd expect the data encoding (vertices vs pixels) to change a bit about it but I'm not familiar enough to know.

MathYouF|3 years ago

Pixel values are discrete (length x width x r256 x g256 x b256) and vertex values are continuous, so that is one major difference.

Secondly, there's vastly more labeled image data in the world than 3D data, so creating a CLMP (contrastive language and mesh pairing) model is harder.

It's very late but I may be able to give a much better answer on more of the nuances of 3D generation tomorrow.

PoignardAzur|3 years ago

Interesting!

I knew about transformers, CLIP and diffusion, but pixel patch encodings are new to me.

Can you give me more details / point me towards an explainer? A quick duckduckgo search didn't help.

johnthewise|3 years ago

I don't quite remember whether it was first used in Vit paper[1], but it's a fairly straight forward idea. You take the patches of an image like they are words in a sentence, reduce the size of the patch(num_of_pixel x num_of_pixel) with a linear projection so that we can actually process it and get rid of sparse pixel information, add in positional encodings to put in location information of the patch and treat them as how you treated words in language models from that point on with transformers. Essentially, words are human constructed but information dense representation of language but images do have quite sparsity in them because individual pixel values don't really change much of an image.

1: https://arxiv.org/pdf/2010.11929.pdf