If this model is so good at estimating depth from single image, shouldn't it also be able to take multiple images as input and estimate even better? But searching a bit it looks like this is supposed to be a single image to 3D only. I don't understand why it does not (can not?) work with multiple images.
I also feel like an heavily multimodal model could be very nice for this: allow multiple images from various angles, optionally some true depth data even if imperfect (like what a basic phone LIDAR would output), why not even photos of the same place even if it comes from other sources at other times (just to gather more data), and based on that generate a 3D scene you can explore, using generative AI for filling with plausible content what is missing.
If you have multiple images you could use photogrammetry.
At the end, if you want to "fill in the blanks" llm will always "make up" stuff, based on all of its training data.
With a technology like photogrammetry you can get much better results, therefor if you have multiple angled images and dont really need to make up stuff, its better to use such
I'm going to guess this is because the image to depth data, while good, is not perfectly accurate and therefore cannot be a shared ground truth between multiple images. At that point what you want is a more traditional structure from motion workflow, which already exists and does a decent job.
Tried a few random images and scenes, overall wasn't that impressive. Maybe I'm using the wrong kinds of input images or something, but for the most part once I moved more than a small amount, the rendering was mostly noise. To be fair, I didn't really expect much more.
Neat demo, but feels like things need to come quite a ways to make this interesting.
My understanding of JavaScript is cursory, but my reading of that webpage is the UI is just smoke and mirrors, and it is just waiting for the whole thing to be processed in a single remote API call to some back-end system. If the back-end is down, it will always stop at 90%. The crawling progress bar is fake with canned messages updated with Math.Random() delays. Gives you something to look at, I guess, but seems a little misleading. Might be wrong ...
That's a pretty well-solved problem at this point, if you want to do it yourself. You'll want some kind of NeRF tool and a way to calculate the camera poses of the photos you took. COLMAP is the tool most people use for the latter.
I'd recommend trying Instant Neural Graphics Primitives (https://github.com/NVlabs/instant-ngp) from NVIDIA. It's a couple years old, so not state-of-the-art, but it runs on just about anything and is extremely fast.
Yeah I think you're right. It calls that out (in really tiny footer text) that it's leveraging ml-sharp.
It's pretty trivial to get running locally and generating the PLY files. Spark's a pretty good renderer for it after you've generated the gaussian splats.
Its funny, always stucks on 90% till it fails with the error that another big image may be keeping the server busy.
I mean ok its a "demo" tho the funny thing is if you actually check the cli and requests, you clearly can see that the 3 stages the images walks through on "processing" are fake, its just doing 1 post request in the backend that runs while it traverses through the states, and at 90% it stops until (in theory) the request ends.
smusamashah|1 month ago
milleramp|1 month ago
MillionOClock|1 month ago
voodooEntity|1 month ago
At the end, if you want to "fill in the blanks" llm will always "make up" stuff, based on all of its training data.
With a technology like photogrammetry you can get much better results, therefor if you have multiple angled images and dont really need to make up stuff, its better to use such
shrinks99|1 month ago
echelon|1 month ago
SequoiaHope|1 month ago
brk|1 month ago
Neat demo, but feels like things need to come quite a ways to make this interesting.
mawadev|1 month ago
verytrivial|1 month ago
james2doyle|1 month ago
colordrops|1 month ago
someguyiguess|1 month ago
lastdong|1 month ago
eps|1 month ago
bigtones|1 month ago
M4R5H4LL|1 month ago
tripplyons|1 month ago
Johnny_Bonk|1 month ago
riotnrrd|1 month ago
I'd recommend trying Instant Neural Graphics Primitives (https://github.com/NVlabs/instant-ngp) from NVIDIA. It's a couple years old, so not state-of-the-art, but it runs on just about anything and is extremely fast.
carlosjobim|1 month ago
unknown|1 month ago
[deleted]
j2kun|1 month ago
personjerry|1 month ago
This is the heavy lifting: https://github.com/apple/ml-sharp
Previous discussion: https://news.ycombinator.com/item?id=46284658
vunderba|1 month ago
It's pretty trivial to get running locally and generating the PLY files. Spark's a pretty good renderer for it after you've generated the gaussian splats.
https://github.com/sparkjsdev/spark
nmstoker|1 month ago
methuselah_in|1 month ago
mightysashiman|1 month ago
voodooEntity|1 month ago
I mean ok its a "demo" tho the funny thing is if you actually check the cli and requests, you clearly can see that the 3 stages the images walks through on "processing" are fake, its just doing 1 post request in the backend that runs while it traverses through the states, and at 90% it stops until (in theory) the request ends.
hahahahhaah|1 month ago
fenwick67|1 month ago
unknown|1 month ago
[deleted]
xnx|1 month ago
causal|1 month ago