(no title)
stfwn | 5 years ago
The photogrammetry method could process ~80GB worth of 24MP photos into a micrometer-level accurate 3D model in about 8 hours, while the fastest NeRF implementations available took the same time to train a model on just 46 pictures at 0.2MP. A funny extrapolation from a handful of datapoints was that it would have taken 1406 hours or about two months to train a NeRF at a resolution of 24MP, assuming it would converge at all. PixelNeRF improves an aspect that was already great (the number of photos required) but does not seem to tackle this complexity problem.
Another problem is this: the representation of the learned scene is entirely abstract, contained within the weights of the neural networks that make up the NeRF. The space itself cannot be meaningfully inspected -- it must be probed and examined by its input/output pairs. The NeRF takes as its input a 3D location plus a viewing direction, and the output is a color radiance in that direction and a density (which depends only on the 3D location). So to generate a 2D image you emit camera rays into the NeRF from a specific hypothetical camera position, direction, focal length and sensor resolution, get the NeRF's output for many different points along all the camera rays and compute an image based on it (volume rendering).
This is fine as long as the NeRF is available and there are no time constraints, but does not seem workable for real-time graphics rendering like in gaming/VR. So the NeRF should probably be rendered into a traditional 3D model ahead of time. Afaik this is an open problem that I've only seen solved by using a combination of marching cubes to extract the scene geometry and then rendering colors from normal vectors. In this process, continuity, spatial density and directional color radiance, three of the most important contributions of the NeRF design, are entirely lost.
I would be very interested to see papers that tackle higher resolution spaces at feasible training times and faster novel view rendering times. It would be amazing to have NeRF-based graphics engines that can make up spaces out of layers of NeRFs, all probed in real-time.
fxtentacle|5 years ago
Also kind of related, the new UE5 game engine is introducing a novel in-GPU compression method which will allow them to handle more geometry. Not only for games, but also for photogrammetry, memory tends to be one of the scarcest resources.
In summary, unless the AI uses relatively few memory and unless it's relatively few layers, it won't stand a chance against traditional ways of handling geometry.
That said, the promise that I see with NeRF and related methods is their ability to make up plausible things. For everyday objects, these techniques can learn to predict how an apple would look like if you would rotate it reasonably well. That is valuable for robotics, where you need to make sure that you still recognize your environment even after you drive around a corner.
visarga|5 years ago
jimduk|5 years ago
stfwn|5 years ago
It works best if you play into the algorithm used to find the point correspondences. One commonly used one is SIFT [1]. It's a multi-step process where each step introduces some invariances, like scale invariance through convolution with gaussian kernels at different standard deviations to create a 'scale space', then doing blob detection in that space by looking at second derivative maxima and minima.
The matching process does of a lot of convolution, which is linear (so you can combine a gaussian and laplacian kernel and do both in one shot) and it can be nicely parallelized. The 8 hours of processing of ~80GB of 24MP images was on a GTX 1080.
I wouldn't say that it's particularly slow considering the amount of data and complexity of the operations, but surely a speedup would be very welcome and useful. It would become much more accessible to game companies, movie studios and even industries that (afaik) don't make much use of 3D models yet -- perhaps archaeology or anthropology would jump at the opportunity of scanning and sharing super high res models.
[1]: https://en.wikipedia.org/wiki/Scale-invariant_feature_transf...
bsenftner|5 years ago
Findeton|5 years ago
fudged71|5 years ago
visarga|5 years ago
Say you pre-train a network to predict (r, g, b) = net(x, y). Then you fine-tune it to do something else, let's say, predict if a pixel is object or stuff.
Do you think the implicit model could encode in net_backbone(x, y) information about its context, like a CNN? I mean, does it just learn punctually or does it collect context information?