top | item 37757654

(no title)

Jack000 | 2 years ago

I think this approach isn't ideal because you're representing pixels as 150x150 unique bins. With only 71k fonts it's likely a lot of these bins are never used, especially at the corners. Since you're quantizing anyways, you might as well use a convnet then trace the output, which would better take advantage of the 2d nature of the pixel data.

This kind of reminds me of dalle-1 where the image is represented as 256 image tokens then generated one token at a time. That approach is the most direct way to adapt a causal-LM architecture but it clearly didn't make a lot of sense because images don't have a natural top-down-left-right order.

For vector graphics, the closest analogous concept to pixel-wise convolution would be the Minkowski sum. I wonder if a Minkowski sum-based diffusion model would work for svg images.

discuss

SerCe|2 years ago

Thank you for the suggestion. A couple of ML engineers with whom I've spoken after publishing the blog also suggested that I should try representing x and y coordinates as separate tokens.

briandw|2 years ago

How would the Minkowski sum be used in the diffusion model? Is the idea to look at the Minkowski sum of the prediction and label?

Jack000|2 years ago

In pixel space a convnet uses pixel-wise convolutions and a pixel-kernel. If you represent a vector image as a polygon, the direct equivalent to a convolution would be the Minkowski sum of the vector image and a polygon-kernel.

You could start off with a random polygon and the reverse diffusion process would slowly turn it into a text glyph.