top | item 25300283

(no title)

stfwn | 5 years ago

In June I worked on a comparison of the original NeRF [1] to a state of the art proprietary photogrammetry method.

The photogrammetry method could process ~80GB worth of 24MP photos into a micrometer-level accurate 3D model in about 8 hours, while the fastest NeRF implementations available took the same time to train a model on just 46 pictures at 0.2MP. A funny extrapolation from a handful of datapoints was that it would have taken 1406 hours or about two months to train a NeRF at a resolution of 24MP, assuming it would converge at all. PixelNeRF improves an aspect that was already great (the number of photos required) but does not seem to tackle this complexity problem.

Another problem is this: the representation of the learned scene is entirely abstract, contained within the weights of the neural networks that make up the NeRF. The space itself cannot be meaningfully inspected -- it must be probed and examined by its input/output pairs. The NeRF takes as its input a 3D location plus a viewing direction, and the output is a color radiance in that direction and a density (which depends only on the 3D location). So to generate a 2D image you emit camera rays into the NeRF from a specific hypothetical camera position, direction, focal length and sensor resolution, get the NeRF's output for many different points along all the camera rays and compute an image based on it (volume rendering).

This is fine as long as the NeRF is available and there are no time constraints, but does not seem workable for real-time graphics rendering like in gaming/VR. So the NeRF should probably be rendered into a traditional 3D model ahead of time. Afaik this is an open problem that I've only seen solved by using a combination of marching cubes to extract the scene geometry and then rendering colors from normal vectors. In this process, continuity, spatial density and directional color radiance, three of the most important contributions of the NeRF design, are entirely lost.

I would be very interested to see papers that tackle higher resolution spaces at feasible training times and faster novel view rendering times. It would be amazing to have NeRF-based graphics engines that can make up spaces out of layers of NeRFs, all probed in real-time.

[1]: https://www.matthewtancik.com/nerf

discuss

fxtentacle|5 years ago

In a similar vein to what you described, I recently evaluated multiple deep learning based image synthesis techniques for their viability for filling in parts of real-time games. Turns out, it's just not useful. GPUs are just fine with rendering mind-boggling amounts of geometry, as long as overdraw is low. In other words, using deep learning is using up exactly the resource that's rare already.

Also kind of related, the new UE5 game engine is introducing a novel in-GPU compression method which will allow them to handle more geometry. Not only for games, but also for photogrammetry, memory tends to be one of the scarcest resources.

In summary, unless the AI uses relatively few memory and unless it's relatively few layers, it won't stand a chance against traditional ways of handling geometry.

That said, the promise that I see with NeRF and related methods is their ability to make up plausible things. For everyday objects, these techniques can learn to predict how an apple would look like if you would rotate it reasonably well. That is valuable for robotics, where you need to make sure that you still recognize your environment even after you drive around a corner.

visarga|5 years ago

If you make the implicit network conditional on an instance ID then you get instance embeddings and you can interpolate between them to create unlimited variations.

jimduk|5 years ago

Question - if you had a really accurate fiducial (say 1/1000 or 1/10000 of a pixel absolute accuracy, sub micrometre) that could be fixed on/ near the model - is that interesting and would it help speed up the photogrammetry ? We have a system for accurate measurement x,y,z,rot by imaging a flat scale and are currently focussed on precision engineering/ microscopy/ xy stages; but I didn't realise the big photogrammetry systems were so slow or desired micrometre accuracy. May be a whole set of questions about depth of field and mechanical and thermal stability, but just a thought

stfwn|5 years ago

Fiducial markers are commonly used in photogrammetry to either speed up the process, make the resulting model more accurate or a balance of both depending on what the user is looking for. Good fiducials make for distinct features that can easily be matched across different images.

It works best if you play into the algorithm used to find the point correspondences. One commonly used one is SIFT [1]. It's a multi-step process where each step introduces some invariances, like scale invariance through convolution with gaussian kernels at different standard deviations to create a 'scale space', then doing blob detection in that space by looking at second derivative maxima and minima.

The matching process does of a lot of convolution, which is linear (so you can combine a gaussian and laplacian kernel and do both in one shot) and it can be nicely parallelized. The 8 hours of processing of ~80GB of 24MP images was on a GTX 1080.

I wouldn't say that it's particularly slow considering the amount of data and complexity of the operations, but surely a speedup would be very welcome and useful. It would become much more accessible to game companies, movie studios and even industries that (afaik) don't make much use of 3D models yet -- perhaps archaeology or anthropology would jump at the opportunity of scanning and sharing super high res models.

[1]: https://en.wikipedia.org/wiki/Scale-invariant_feature_transf...

bsenftner|5 years ago

The manufacturing class photogrammetry applications require extremely high resolutions, while at the same time there are photogrammetry frameworks in use by the VFX industry that are tuned to a lower fidelity because those results are going into a digital artist production pool - who will remodel or fix any issues. These photogrammetry frameworks are real time, if not several times faster.

Findeton|5 years ago

NeRFs are very cool and probably the future, but plain old multi-plane images (MPIs) still have their use case. If you want to create a real-time video light-field, MPIs are very fast to render. Although MPIs require a large amount of memory/RAM, rendering them is very fast. In comparison, NeRFs take very little space but rendering in real-time would be challenging at the moment as the NN is an implicit representation of the scene and you need to call the NN multiple times to render each pixel.

fudged71|5 years ago

Have you seen this optical neural network technique? It seems like one way you could get fast (real-time) results from a complex neural network https://www.photonics.com/AMP/AMP_Article.aspx?AID=63751

visarga|5 years ago

I am thinking about using implicit models to do implicit information aggregation.

Say you pre-train a network to predict (r, g, b) = net(x, y). Then you fine-tune it to do something else, let's say, predict if a pixel is object or stuff.

Do you think the implicit model could encode in net_backbone(x, y) information about its context, like a CNN? I mean, does it just learn punctually or does it collect context information?