This is what a truly revolutionary idea looks like. There are so many details in the paper. Also, we know that transformers can scale. Pretty sure this idea will be used by a lot of companies to train the general 3D asset creation pipeline. This is just too great.
"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."
This idea is simply beautiful and so obvious in hindsight.
"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."
It's cool, it's also par for the field of 3D reconstruction today. I wouldn't describe this paper as particularly innovative or exceptional.
What do I think is really compelling in this field (given that it's my profession)?
This has me star-struck lately -- 3D meshing from a single image, a very large 3D reconstruction model trained on millions of all kinds of 3D models... https://yiconghong.me/LRM/
Another thing to note here is this looks to be around seven total days of training on at most 4 A100s. Not all really cutting edge work requires a data center sized cluster.
I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing.
As a machine learning engineer who dabbles with Blender and hobby gamedev, this is pretty impressive, but not quite to the point of being useful in any practical manner (as far as the limited furniture examples are concerned.
A competent modeler can make these types of meshes in under 5 minutes, and you still need to seed the generation with polys.
I imagine the next step will be to have the seed generation controlled by an LLM, and to start adding image models to the autoregressive parts of the architecture.
> A competent modeler can make these types of meshes in under 5 minutes.
I don't think this general complaint about AI workflows is that useful. Most people are not a competent <insert job here>. Most people don't know a competent <insert job here> or can't afford to hire one. Even something that takes longer than a professional do at worse quality for many things is better than _nothing_ which is the realistic alternative for most people who would use something like this.
> A competent modeler can make these types of meshes in under 5 minutes
Sweet. Can you point me to these modelers who work on-demand and bill for their time in 5 minute increments? I’d love to be able to just pay $1-2 per model and get custom <whatever> dropped into my game when I need it.
> A competent modeler can make these types of meshes in under 5 minutes
It's not about competent modellers, any more than SD is for expert artists.
It's about giving tools to the non-experts. And also about freeing up those competent modellers to work on more interesting things than the 10,000 chair variants needed for future AAA games. They can work on making unique and interesting characters instead, or novel futuristic models that aren't in the training set and require real imagination combined with their expertise.
The mesh topology here would see these rejected as assets for in basically any professional context. A competent modeler could make much higher quality models, more suited to texturing and deformation, in under five minutes. A speed modeler could make the same in under a minute. And a procedural system in something like Blender geonodes can already spit out an endless variety of such models. But the pace of progress is staggering.
Just like a competent developer can use LLMs to bootstrap workflows, a competent model will soon have tools like this as part of their normal workflow. A casual user would be able to do things that they otherwise wouldnt have been able to. But an expert in the ML model's knowledge domain can really make it shine.
I really believe that the more experienced you are in a particular use case, the more use you can get out of an ML model.
Unfortunately, it's those very same people that seem to be the most resistant to adopting this without really giving it the practice required to get somewhere useful with it. I suppose part of the problem is we expect it to be a magic wand. But it's really just the new PhotoShop, or Blender, or Microsoft Word, or PowerPoint ...
Most people open those apps, click mindleslly for a bit, promptly leave never to return. And so it is with "AI".
I can imagine one usecase, in a typical architecture design, where the architect creates a design and always faces this stumbling block, when wanting to make it look as lively as possible: sprinkling a lot of convincing assets everywhere.
As they are generated, variations are much easier to come by easier, than buying a couple asset packs.
This is a very underrated comment... As with any tech demo, I'd they don't show it, it can't do it. It is very very easy to imagine a generalization of these things to other purposes, which, if it could do it, would be a different presentation.
Perhaps one way to look at this could be auto-scaffolding. The typical modelling and CAD tools might include this feature to get you up and running faster.
Another massive benefit is composability. If the model can generate a cup and a table, it also knows how to generate a cup on a table.
Think of all the complex gears and machine parts this could generate in the blink of an eye, while being relevant to the project - rotated and positioned exact where you want it. Very similar to how GitHub Copilot works.
I don't see that LLM's have come that much further in 3D animation than programming in this regard: It can spit out bits and pieces that looks okay in isolation but a human need to solve the puzzle. And often solving the puzzle means rewriting/redoing most of the pieces.
We're safe for now but we should learn how to leverage the new tech.
So you're probably familiar with the role of a Bidding Producer; imagine the difficulty they are facing: on one side they have filmmakers saying they just read so and so is now created by AI, while that is news to the bidding producer and their VFX/animation studio clients scrambling as everything they do is new again.
I don't know, 3D CGI has already been moving at the breakneck speed for the last three decades without any AI. Today's tools are qualitatively different (sculpting, simulation, auto-rigging etc etc etc).
It looks like the input is itself a 3D mesh? So the model is doing "shape completion" (e.g. they show generating a chair from just some legs)... or possibly generating "variations" when the input shape is more complete?
But I guess it's a starting point... maybe you could use another model that does worse quality text-to-mesh as the input and get something more crisp and coherent from this one.
It sure feels like every remaining hard problem (i.e., the ones where we haven't made much progress since the 90s) is in line to be solved by transformers in some fashion. What a time to be alive.
The next breakthrough will be the UX to create 3d scenes in front of a model like this, in VR. This would basically let you _generate_ a permanent, arbitrary 3D environment, for any environment for which we have training data.
Diffusion models could be used to generate textures.
edit edit: Maybe credit Lecun or something? Mark going all in on the metaverse was definitely not because he somehow predicted deep learning would take off. Even the people who trained the earliest models weren't sure how well it would work.
Even if this is “only” mesh autocomplete, it is still massively useful for 3D artists. There’s a disconnect right now between how characters are sculpted and how characters are animated. You’d typically need a time consuming step to retopologize your model. Transformer based retopology that takes a rough mesh and gives you clean topology would be a big time saver.
Another application: take the output of your gaussian splatter or diffusion model and run it through MeshGPT. Instant usable assets with clean topology from text.
Lol for 3D artists, this will be used 99% by people who have have never created a mesh by hand in their lifes; to replace their need to hire a 3D artist: programmers who don't want (or can't) pay a designer, architects who never learned nothing other than CAD, fiver "jobs", et al
I don't think people here realize how are we inching to automating the automation itself, and the programmers who will be able to make a living out of this will be a tiny fraction of those who can make a living out of it today.
What you have to understand is that these methods are very sensitive to what is in distribution and out of distribution. If you just plug in user data, it will likely not work.
Dang, this is getting so good! Still got a ways to go, with the weird edges, but at this point, that feels like 'iteration details' rather than an algorithmic or otherwise complex problem.
It's really going to speed up my pipeline to not have to pipe all of my meshes into a procgen library with a million little mesh modifiers hooked up to drivers. Instead, I can just pop all of my meshes into a folder, train the network on them, and then start asking it for other stuff in that style, knowing that I won't have to re-topo or otherwise screw with the stuff it makes, unless I'm looking for more creative influence.
Of course, until it's all the way to that point, I'm still better served by the procgen; but I'm very excited by how quickly this is coming together! Hopefully by next year's Unreal showcase, they'll be talking about their new "Asset Generator" feature.
Games and pretty much any other experience being generated by AI is obvious to anyone paying attention at this point. But how would it work. Are current ai generated images and videos using rasterisation? Will they use rasterisation, path tracing or any other traditional rendering technique, or is will it be an entirely different thing.
I'm not a 3D artist, but why are we still, for lack of a better word, "stuck" with having / wanting to use simple meshes? I appreciate the simplicity, but isn't this an unnecessary limitation of mesh generation? It feels like an approach that imitates the constraints of having both limited hardware and artist resources. Shouldn't AI models help us break these boundaries?
Fantastic, but still useless from a professional perspective. i.e. A mesh that represents a cube as 12 triangles is a better prestation of the form than previous efforts, but barely more usable.
Whilst it might not be the solution I'm waiting for, I can now see it as possible. If an AI model can handle traingles, it might handle edge loops and NURBS curves.
This is fantastic! You can broad-strokes sketch the key strokes of the shape you want, and this will generate some "best" matches around that.
What I really appreciate about this is that they took the concept (transformers) and applied it in a quite different-from-usual domain. Thinking outside of the (triangulated) box!
So you train it with vector sequences that represent furnitures and it predicts the next token(triangles), so how is this different from it ChatGPT was trained with the same sequences and can output all the 3d locations and trangle size/lengths in sequence and have a 3d program piece it together?
Great work. But I don't get from the demo how it knows what object to autocomplete the mesh with - if you give it four posts as an input, how does it know to autocomplete as a table and not a dog?
So maybe the next step is something like CLIP, but for meshes? CLuMP?
It would be nice to be see work and be part of a field that did work that humans could not do, instead of creating work that just replaces what humans already know how to do.
First, you use the word "transformers" to mean "autoregressive models", they are not synonymous, second, this model beats Polygen on every metric, it's not even close.
shaileshm|2 years ago
"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."
This idea is simply beautiful and so obvious in hindsight.
"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."
More from paper. Just so cool!
legel|2 years ago
What do I think is really compelling in this field (given that it's my profession)?
This has me star-struck lately -- 3D meshing from a single image, a very large 3D reconstruction model trained on millions of all kinds of 3D models... https://yiconghong.me/LRM/
hedgehog|2 years ago
tomcam|2 years ago
godelski|2 years ago
Do we have strong evidence that other models don't scale or have we just put more time into transformers?
Convolutional resnets look to scale on vision and language: (cv) https://arxiv.org/abs/2301.00808, (cv) https://arxiv.org/abs/2110.00476, (nlp) https://github.com/HazyResearch/safari
MLPs also seem to scale: (cv) https://arxiv.org/abs/2105.01601, (cv) https://arxiv.org/abs/2105.03404
I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing.
donpark|2 years ago
ganzuul|2 years ago
LarsDu88|2 years ago
A competent modeler can make these types of meshes in under 5 minutes, and you still need to seed the generation with polys.
I imagine the next step will be to have the seed generation controlled by an LLM, and to start adding image models to the autoregressive parts of the architecture.
Then we might see truly mobile game-ready assets!
empath-nirvana|2 years ago
I don't think this general complaint about AI workflows is that useful. Most people are not a competent <insert job here>. Most people don't know a competent <insert job here> or can't afford to hire one. Even something that takes longer than a professional do at worse quality for many things is better than _nothing_ which is the realistic alternative for most people who would use something like this.
hipadev23|2 years ago
Sweet. Can you point me to these modelers who work on-demand and bill for their time in 5 minute increments? I’d love to be able to just pay $1-2 per model and get custom <whatever> dropped into my game when I need it.
esperent|2 years ago
It's not about competent modellers, any more than SD is for expert artists.
It's about giving tools to the non-experts. And also about freeing up those competent modellers to work on more interesting things than the 10,000 chair variants needed for future AAA games. They can work on making unique and interesting characters instead, or novel futuristic models that aren't in the training set and require real imagination combined with their expertise.
Kaijo|2 years ago
Art9681|2 years ago
I really believe that the more experienced you are in a particular use case, the more use you can get out of an ML model.
Unfortunately, it's those very same people that seem to be the most resistant to adopting this without really giving it the practice required to get somewhere useful with it. I suppose part of the problem is we expect it to be a magic wand. But it's really just the new PhotoShop, or Blender, or Microsoft Word, or PowerPoint ...
Most people open those apps, click mindleslly for a bit, promptly leave never to return. And so it is with "AI".
eurekin|2 years ago
As they are generated, variations are much easier to come by easier, than buying a couple asset packs.
GaggiX|2 years ago
unknown|2 years ago
[deleted]
WhitneyLand|2 years ago
Being able to model something - is way different from being able to do it in the least amount of triangles and/or without losing details.
th0ma5|2 years ago
kranke155|2 years ago
sheepscreek|2 years ago
Another massive benefit is composability. If the model can generate a cup and a table, it also knows how to generate a cup on a table.
Think of all the complex gears and machine parts this could generate in the blink of an eye, while being relevant to the project - rotated and positioned exact where you want it. Very similar to how GitHub Copilot works.
worldsayshi|2 years ago
We're safe for now but we should learn how to leverage the new tech.
bsenftner|2 years ago
orbital-decay|2 years ago
nextworddev|2 years ago
rizky05|2 years ago
[deleted]
sram1337|2 years ago
edit: Seems like mesh completion is the main input-output method, not just a neat feature.
anentropic|2 years ago
It looks like the input is itself a 3D mesh? So the model is doing "shape completion" (e.g. they show generating a chair from just some legs)... or possibly generating "variations" when the input shape is more complete?
But I guess it's a starting point... maybe you could use another model that does worse quality text-to-mesh as the input and get something more crisp and coherent from this one.
all2|2 years ago
CamperBob2|2 years ago
unknown|2 years ago
[deleted]
alexose|2 years ago
mlsu|2 years ago
Diffusion models could be used to generate textures.
Mark is right and so so early.
ShamelessC|2 years ago
edit: Oh, _that_ Mark? lol okay
edit edit: Maybe credit Lecun or something? Mark going all in on the metaverse was definitely not because he somehow predicted deep learning would take off. Even the people who trained the earliest models weren't sure how well it would work.
valine|2 years ago
Another application: take the output of your gaussian splatter or diffusion model and run it through MeshGPT. Instant usable assets with clean topology from text.
mattigames|2 years ago
I don't think people here realize how are we inching to automating the automation itself, and the programmers who will be able to make a living out of this will be a tiny fraction of those who can make a living out of it today.
toxik|2 years ago
bradleyishungry|2 years ago
j7ake|2 years ago
So much more refreshing than the dense abstract, intro, results paper style.
chongli|2 years ago
stuckinhell|2 years ago
Indie games already seems pretty derivative these days. I think this tech will kill them in mid-term as big companies use them.
catapart|2 years ago
It's really going to speed up my pipeline to not have to pipe all of my meshes into a procgen library with a million little mesh modifiers hooked up to drivers. Instead, I can just pop all of my meshes into a folder, train the network on them, and then start asking it for other stuff in that style, knowing that I won't have to re-topo or otherwise screw with the stuff it makes, unless I'm looking for more creative influence.
Of course, until it's all the way to that point, I'm still better served by the procgen; but I'm very excited by how quickly this is coming together! Hopefully by next year's Unreal showcase, they'll be talking about their new "Asset Generator" feature.
truckerbill|2 years ago
beebeepka|2 years ago
wolfgang805|2 years ago
BrokrnAlgorithm|2 years ago
ipsum2|2 years ago
Stevvo|2 years ago
Whilst it might not be the solution I'm waiting for, I can now see it as possible. If an AI model can handle traingles, it might handle edge loops and NURBS curves.
btbuildem|2 years ago
What I really appreciate about this is that they took the concept (transformers) and applied it in a quite different-from-usual domain. Thinking outside of the (triangulated) box!
m3kw9|2 years ago
KyleLewis|2 years ago
Mizza|2 years ago
So maybe the next step is something like CLIP, but for meshes? CLuMP?
DeathArrow|2 years ago
wolfgang805|2 years ago
mclanett|2 years ago
de6u99er|2 years ago
carbocation|2 years ago
jhiggins777|2 years ago
circuit10|2 years ago
trostaft|2 years ago
amelius|2 years ago
frozencell|2 years ago
airstrike|2 years ago
65892|2 years ago
[deleted]
throwaway8318|2 years ago
[deleted]
toxik|2 years ago
Sharlin|2 years ago
GaggiX|2 years ago