MeshGPT: Generating triangle meshes with decoder-only transformers

shaileshm|2 years ago

This is what a truly revolutionary idea looks like. There are so many details in the paper. Also, we know that transformers can scale. Pretty sure this idea will be used by a lot of companies to train the general 3D asset creation pipeline. This is just too great.

"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."

This idea is simply beautiful and so obvious in hindsight.

"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."

More from paper. Just so cool!

legel|2 years ago

It's cool, it's also par for the field of 3D reconstruction today. I wouldn't describe this paper as particularly innovative or exceptional.

What do I think is really compelling in this field (given that it's my profession)?

This has me star-struck lately -- 3D meshing from a single image, a very large 3D reconstruction model trained on millions of all kinds of 3D models... https://yiconghong.me/LRM/

hedgehog|2 years ago

Another thing to note here is this looks to be around seven total days of training on at most 4 A100s. Not all really cutting edge work requires a data center sized cluster.

tomcam|2 years ago

Can someone explain quantized embeddings to me?

godelski|2 years ago

> Also, we know that transformers can scale

Do we have strong evidence that other models don't scale or have we just put more time into transformers?

Convolutional resnets look to scale on vision and language: (cv) https://arxiv.org/abs/2301.00808, (cv) https://arxiv.org/abs/2110.00476, (nlp) https://github.com/HazyResearch/safari

MLPs also seem to scale: (cv) https://arxiv.org/abs/2105.01601, (cv) https://arxiv.org/abs/2105.03404

I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing.

donpark|2 years ago

How does this differ from similar techniques previously applied to DNA and RNA sequences?

ganzuul|2 years ago

...Is graph convolution matrix factorization by another name?

LarsDu88|2 years ago

As a machine learning engineer who dabbles with Blender and hobby gamedev, this is pretty impressive, but not quite to the point of being useful in any practical manner (as far as the limited furniture examples are concerned.

A competent modeler can make these types of meshes in under 5 minutes, and you still need to seed the generation with polys.

I imagine the next step will be to have the seed generation controlled by an LLM, and to start adding image models to the autoregressive parts of the architecture.

Then we might see truly mobile game-ready assets!

empath-nirvana|2 years ago

> A competent modeler can make these types of meshes in under 5 minutes.

I don't think this general complaint about AI workflows is that useful. Most people are not a competent <insert job here>. Most people don't know a competent <insert job here> or can't afford to hire one. Even something that takes longer than a professional do at worse quality for many things is better than _nothing_ which is the realistic alternative for most people who would use something like this.

hipadev23|2 years ago

> A competent modeler can make these types of meshes in under 5 minutes

Sweet. Can you point me to these modelers who work on-demand and bill for their time in 5 minute increments? I’d love to be able to just pay $1-2 per model and get custom <whatever> dropped into my game when I need it.

esperent|2 years ago

> A competent modeler can make these types of meshes in under 5 minutes

It's not about competent modellers, any more than SD is for expert artists.

It's about giving tools to the non-experts. And also about freeing up those competent modellers to work on more interesting things than the 10,000 chair variants needed for future AAA games. They can work on making unique and interesting characters instead, or novel futuristic models that aren't in the training set and require real imagination combined with their expertise.

Kaijo|2 years ago

The mesh topology here would see these rejected as assets for in basically any professional context. A competent modeler could make much higher quality models, more suited to texturing and deformation, in under five minutes. A speed modeler could make the same in under a minute. And a procedural system in something like Blender geonodes can already spit out an endless variety of such models. But the pace of progress is staggering.

Art9681|2 years ago

Just like a competent developer can use LLMs to bootstrap workflows, a competent model will soon have tools like this as part of their normal workflow. A casual user would be able to do things that they otherwise wouldnt have been able to. But an expert in the ML model's knowledge domain can really make it shine.

I really believe that the more experienced you are in a particular use case, the more use you can get out of an ML model.

Unfortunately, it's those very same people that seem to be the most resistant to adopting this without really giving it the practice required to get somewhere useful with it. I suppose part of the problem is we expect it to be a magic wand. But it's really just the new PhotoShop, or Blender, or Microsoft Word, or PowerPoint ...

Most people open those apps, click mindleslly for a bit, promptly leave never to return. And so it is with "AI".

eurekin|2 years ago

I can imagine one usecase, in a typical architecture design, where the architect creates a design and always faces this stumbling block, when wanting to make it look as lively as possible: sprinkling a lot of convincing assets everywhere.

As they are generated, variations are much easier to come by easier, than buying a couple asset packs.

GaggiX|2 years ago

A simple next step would be to simply scale the model, make it bigger, and train it on millions of images in the wild.

unknown|2 years ago

[deleted]

WhitneyLand|2 years ago

As I understand it their claim is more about efficiency and quality.

Being able to model something - is way different from being able to do it in the least amount of triangles and/or without losing details.

th0ma5|2 years ago

This is a very underrated comment... As with any tech demo, I'd they don't show it, it can't do it. It is very very easy to imagine a generalization of these things to other purposes, which, if it could do it, would be a different presentation.

kranke155|2 years ago

My chosen profession (3D / filmmaking) feels like being in some kind of combat trench at the moment. Both fascinating and scary

sheepscreek|2 years ago

Perhaps one way to look at this could be auto-scaffolding. The typical modelling and CAD tools might include this feature to get you up and running faster.

Another massive benefit is composability. If the model can generate a cup and a table, it also knows how to generate a cup on a table.

Think of all the complex gears and machine parts this could generate in the blink of an eye, while being relevant to the project - rotated and positioned exact where you want it. Very similar to how GitHub Copilot works.

worldsayshi|2 years ago

I don't see that LLM's have come that much further in 3D animation than programming in this regard: It can spit out bits and pieces that looks okay in isolation but a human need to solve the puzzle. And often solving the puzzle means rewriting/redoing most of the pieces.

We're safe for now but we should learn how to leverage the new tech.

bsenftner|2 years ago

So you're probably familiar with the role of a Bidding Producer; imagine the difficulty they are facing: on one side they have filmmakers saying they just read so and so is now created by AI, while that is news to the bidding producer and their VFX/animation studio clients scrambling as everything they do is new again.

orbital-decay|2 years ago

I don't know, 3D CGI has already been moving at the breakneck speed for the last three decades without any AI. Today's tools are qualitatively different (sculpting, simulation, auto-rigging etc etc etc).

nextworddev|2 years ago

What do you ascertain the use case of this in your field? Does it seem high quality? (I have no context)

rizky05|2 years ago

[deleted]

sram1337|2 years ago

What is the input? Is it converting a text query like "chair" to a mesh?

edit: Seems like mesh completion is the main input-output method, not just a neat feature.

anentropic|2 years ago

Yeah it's hard to tell.

It looks like the input is itself a 3D mesh? So the model is doing "shape completion" (e.g. they show generating a chair from just some legs)... or possibly generating "variations" when the input shape is more complete?

But I guess it's a starting point... maybe you could use another model that does worse quality text-to-mesh as the input and get something more crisp and coherent from this one.

all2|2 years ago

You prompt this LLM using 3D meshes for it to complete, in the same manner you use language to prompt language specific LLMs.

CamperBob2|2 years ago

That's what I was wondering. From the diagram it looks like the input is other chair meshes, which makes it somewhat less interesting.

unknown|2 years ago

[deleted]

alexose|2 years ago

It sure feels like every remaining hard problem (i.e., the ones where we haven't made much progress since the 90s) is in line to be solved by transformers in some fashion. What a time to be alive.

mlsu|2 years ago

The next breakthrough will be the UX to create 3d scenes in front of a model like this, in VR. This would basically let you _generate_ a permanent, arbitrary 3D environment, for any environment for which we have training data.

Diffusion models could be used to generate textures.

Mark is right and so so early.

ShamelessC|2 years ago

Mark?

edit: Oh, _that_ Mark? lol okay

edit edit: Maybe credit Lecun or something? Mark going all in on the metaverse was definitely not because he somehow predicted deep learning would take off. Even the people who trained the earliest models weren't sure how well it would work.

valine|2 years ago

Even if this is “only” mesh autocomplete, it is still massively useful for 3D artists. There’s a disconnect right now between how characters are sculpted and how characters are animated. You’d typically need a time consuming step to retopologize your model. Transformer based retopology that takes a rough mesh and gives you clean topology would be a big time saver.

Another application: take the output of your gaussian splatter or diffusion model and run it through MeshGPT. Instant usable assets with clean topology from text.

mattigames|2 years ago

Lol for 3D artists, this will be used 99% by people who have have never created a mesh by hand in their lifes; to replace their need to hire a 3D artist: programmers who don't want (or can't) pay a designer, architects who never learned nothing other than CAD, fiver "jobs", et al

I don't think people here realize how are we inching to automating the automation itself, and the programmers who will be able to make a living out of this will be a tiny fraction of those who can make a living out of it today.

toxik|2 years ago

What you have to understand is that these methods are very sensitive to what is in distribution and out of distribution. If you just plug in user data, it will likely not work.

bradleyishungry|2 years ago

sorry to tell you, but there’s no way anything will be generating clean topology for characters for a long long time.

j7ake|2 years ago

I love this field. Paper include a nice website, examples, and videos.

So much more refreshing than the dense abstract, intro, results paper style.

chongli|2 years ago

This looks really cool! Seems like it would be an incredible boon for an indie game developer to generate a large pool of assets!

stuckinhell|2 years ago

I think indie game development is dead with these techniques. Instead big companies will create "make your own game" games.

Indie games already seems pretty derivative these days. I think this tech will kill them in mid-term as big companies use them.

catapart|2 years ago

Dang, this is getting so good! Still got a ways to go, with the weird edges, but at this point, that feels like 'iteration details' rather than an algorithmic or otherwise complex problem.

It's really going to speed up my pipeline to not have to pipe all of my meshes into a procgen library with a million little mesh modifiers hooked up to drivers. Instead, I can just pop all of my meshes into a folder, train the network on them, and then start asking it for other stuff in that style, knowing that I won't have to re-topo or otherwise screw with the stuff it makes, unless I'm looking for more creative influence.

Of course, until it's all the way to that point, I'm still better served by the procgen; but I'm very excited by how quickly this is coming together! Hopefully by next year's Unreal showcase, they'll be talking about their new "Asset Generator" feature.

truckerbill|2 years ago

Do you have a recommended procgen lib?

beebeepka|2 years ago

Games and pretty much any other experience being generated by AI is obvious to anyone paying attention at this point. But how would it work. Are current ai generated images and videos using rasterisation? Will they use rasterisation, path tracing or any other traditional rendering technique, or is will it be an entirely different thing.

wolfgang805|2 years ago

Why would a video or an image, something generated without a mesh, be using rasterization?

BrokrnAlgorithm|2 years ago

I'm not a 3D artist, but why are we still, for lack of a better word, "stuck" with having / wanting to use simple meshes? I appreciate the simplicity, but isn't this an unnecessary limitation of mesh generation? It feels like an approach that imitates the constraints of having both limited hardware and artist resources. Shouldn't AI models help us break these boundaries?

ipsum2|2 years ago

We're not stuck on meshes. Check out neural radiance fields as an alternative.

Stevvo|2 years ago

Fantastic, but still useless from a professional perspective. i.e. A mesh that represents a cube as 12 triangles is a better prestation of the form than previous efforts, but barely more usable.

Whilst it might not be the solution I'm waiting for, I can now see it as possible. If an AI model can handle traingles, it might handle edge loops and NURBS curves.

btbuildem|2 years ago

This is fantastic! You can broad-strokes sketch the key strokes of the shape you want, and this will generate some "best" matches around that.

What I really appreciate about this is that they took the concept (transformers) and applied it in a quite different-from-usual domain. Thinking outside of the (triangulated) box!

m3kw9|2 years ago

So you train it with vector sequences that represent furnitures and it predicts the next token(triangles), so how is this different from it ChatGPT was trained with the same sequences and can output all the 3d locations and trangle size/lengths in sequence and have a 3d program piece it together?

KyleLewis|2 years ago

Cant wait for the "multimodal" version that can take a written description and generate meshes

Mizza|2 years ago

Great work. But I don't get from the demo how it knows what object to autocomplete the mesh with - if you give it four posts as an input, how does it know to autocomplete as a table and not a dog?

So maybe the next step is something like CLIP, but for meshes? CLuMP?

DeathArrow|2 years ago

So maybe in a few years we can ask AI to generate a level or entire game.

wolfgang805|2 years ago

It would be nice to be see work and be part of a field that did work that humans could not do, instead of creating work that just replaces what humans already know how to do.

mclanett|2 years ago

This is very cool. You can start with an image, generate a mesh for it, render it, and then compare the render to the image. Fully automated training.

de6u99er|2 years ago

continous training

carbocation|2 years ago

On my phone so I’ve only read this promo page - could this approach be modified for surface reconstruction from a 3D point cloud?

jhiggins777|2 years ago

Really cool, but in 3d modeling triangles are a "no no". You are taught early on to design in quads.

circuit10|2 years ago

Can this handle more organic shapes?

trostaft|2 years ago

Seems like the bibtex on the page is broken? Or might just be an extension of mine.

amelius|2 years ago

Is this limited to shapes that have mostly flat faces?

frozencell|2 years ago

Not reproducible with code = Not research.

airstrike|2 years ago

This is revolutionary

65892|2 years ago

[deleted]

throwaway8318|2 years ago

[deleted]

toxik|2 years ago

This was done years ago, with transformers. It was then dubbed Polygen.

Sharlin|2 years ago

You might want to RTFA. Polygen and other prior art are mentioned. This approach is superior.

GaggiX|2 years ago

First, you use the word "transformers" to mean "autoregressive models", they are not synonymous, second, this model beats Polygen on every metric, it's not even close.

157 comments