top | item 38449170

(no title)

shaileshm | 2 years ago

This is what a truly revolutionary idea looks like. There are so many details in the paper. Also, we know that transformers can scale. Pretty sure this idea will be used by a lot of companies to train the general 3D asset creation pipeline. This is just too great.

"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."

This idea is simply beautiful and so obvious in hindsight.

"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."

discuss

legel|2 years ago

It's cool, it's also par for the field of 3D reconstruction today. I wouldn't describe this paper as particularly innovative or exceptional.

What do I think is really compelling in this field (given that it's my profession)?

This has me star-struck lately -- 3D meshing from a single image, a very large 3D reconstruction model trained on millions of all kinds of 3D models... https://yiconghong.me/LRM/

hedgehog|2 years ago

Another thing to note here is this looks to be around seven total days of training on at most 4 A100s. Not all really cutting edge work requires a data center sized cluster.

tomcam|2 years ago

Can someone explain quantized embeddings to me?

_hark|2 years ago

NNs are typically continuous/differentiable so you can do gradient-based learning on them. We often want to use some of the structure the NN has learned to represent data efficiently. E.g., we might take a pre-trained GPT-type model, and put a passage of text through it, and instead of getting the next-token prediction probability (which GPT was trained on), we just get a snapshot of some of the activations at some intermediate layer of the network. The idea is that these activations will encode semantically useful information about the input text. Then we might e.g. store a bunch of these activations and use them to do semantic search/lookup to find similar passages of text, or whatever.

Quantized embeddings are just that, but you introduce some discrete structure into the NN, such that the representations there are not continuous. A typical way to do this these days is to learn a codebook VQ-VAE style. Basically, we take some intermediate continuous representation learned in the normal way, and replace it in the forward pass with the nearest "quantized" code from our codebook. It biases the learning since we can't differentiate through it, and we just pretend like we didn't take the quantization step, but it seems to work well. There's a lot more that can be said about why one might want to do this, the value of discrete vs continuous representations, efficiency, modularity, etc...

godelski|2 years ago

> Also, we know that transformers can scale

Do we have strong evidence that other models don't scale or have we just put more time into transformers?

Convolutional resnets look to scale on vision and language: (cv) https://arxiv.org/abs/2301.00808, (cv) https://arxiv.org/abs/2110.00476, (nlp) https://github.com/HazyResearch/safari

MLPs also seem to scale: (cv) https://arxiv.org/abs/2105.01601, (cv) https://arxiv.org/abs/2105.03404

I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing.

donpark|2 years ago

How does this differ from similar techniques previously applied to DNA and RNA sequences?

ganzuul|2 years ago

...Is graph convolution matrix factorization by another name?

fjkdlsjflkds|2 years ago

No... a graph convolution is just a convolution (over a graph, like all convolutions).

The difference from a "normal" convolution is that you can consider arbitrary connectivity of the graph (rather than the usual connectivity induced by a regular Euclidian grid), but the underlying idea is the same: to calculate the result of the operation at any single place (i.e., node), you need to perform a linear operation over that place (i.e., node) and its neighbourhood (i.e., connected nodes), the same way that (e.g.) in a convolutional neural network, you calculate the value of a pixel by considering its value and that of its neighbours, when performing a convolution.