NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

@article{mildenhall2021nerf,
  title={Nerf: Representing scenes as neural radiance fields for view synthesis},
  author={Mildenhall, Ben and Srinivasan, Pratul P and Tancik, Matthew and Barron, Jonathan T and Ramamoorthi, Ravi and Ng, Ren},
  journal={Communications of the ACM},
  volume={65},
  number={1},
  pages={99--106},
  year={2021},
  publisher={ACM New York, NY, USA}
}

Using a fully connected deep network (MLP ) to learn a function

nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_3d2418857f88cf51dfcf1385d53294f0cad71f62.svg

that produces for a camera position nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_13e0081976fc28816236a393c8b8156e1eb09651.svg and viewing angles nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_b9afaee59105759bb11a3ebaf560aa86ce74d9d4.svg a color in nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_5f22ba1dbc7a0f26e1d00476abf3ffebb39d06db.svg and visual denisty nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_84d9b190a41a7f714cc455f58d2402ca97f47469.svg.

Similar to SIRENs

Introduction #

  • The network is trained on color images with their known camera positions and directions.

  • The whole network is always trained only on one scene.

  • Previous approaches for novel view synthesis where never able to produce photo realistic results.

  • The viewing angles are used for determining the output colors nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_5f22ba1dbc7a0f26e1d00476abf3ffebb39d06db.svg, since they could vary based on specular reflections. The visual density nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_84d9b190a41a7f714cc455f58d2402ca97f47469.svg does not depend on the viewing angles.

  • To produce an image, per pixel rays are marched through the volume and the color and densities are queried in the network. Using front to back Alpha Blending the final color for a pixel is produced.

  • Since the volume rendering is itself differentiable, its result can be used as in a loss function and gradient descent can be used to train the network.

  • Actually use a position encoding for the 5D input of position and viewing angles.

  • Resulting scene representation (network) is much smaller than voxel representation and is furthermore not bound to a grid structure. NeRF’s scene representation is continuous.

  • NeRF outperforms neural 3D representations and deep convolutional networks.

Neural 3D shape representations #

View synthesis and image-based rendering #

Neural Radiance Field Scene Representation #

  • In practice the viewing direction is not represented by nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_0dd984b3c17df1a79975e491d01fb734faac15d9.svg and nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_d3358117f72b81086a355ac20d0407ba10203b90.svg but rather cartesian unti vector nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_f8078e75b9ea1e504f7db02ea55ca38185dd46c3.svg.

  • Network first predicts volumetric density nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_84d9b190a41a7f714cc455f58d2402ca97f47469.svg based purely on position

    • 8 Fully connected layers, 256 channels per layer, using ReLU to ouput nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_84d9b190a41a7f714cc455f58d2402ca97f47469.svg and a 256-dimensional feature vector.
    • Feature vector is concatenated with ray viewing direction to one more fully connected layer, 128 channels, ReLU to output the view depenedent RGB color.

Volume Rendering With Radiance Fields #

For a ray nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_33d98397a1509a18cbf326604cb6a065681c21a9.svg along direction nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_f8078e75b9ea1e504f7db02ea55ca38185dd46c3.svg between the near value nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_1676b0440d37e28c2a5acfd7bd7cf9a6d6d7512c.svg and far value nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_59adb9859db128fe4c4addd93154d0d7970f7cfc.svg the resulting color can be calculated by integrating the colors in the volumes together with their volumetric densities along the ray.

nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_ae89d415c32f53a0901f757b7ab8d15097c46686.svg

This integral is approximated on discrete intervals, thus resulting in effective front to back Alpha Blending . The sampling intervals between nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_1676b0440d37e28c2a5acfd7bd7cf9a6d6d7512c.svg and nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_59adb9859db128fe4c4addd93154d0d7970f7cfc.svg are not equidistant, to not limit the resolution of the scene’s representation. Instead the ray is divided into evenly spaced bins from each of which a sample is drawn.

Optimizing a Neural Radiance Field #

Using hierarchical sampling (1 fine and one coarse network). Use the coarse network to find geometry and sample near geometry in the fine network. Calculate loss over both networks independenty and optimize independenty.

Positional encoding #

  • Needed for high-resolution complex scenes

  • deep networks are biased toward learning lower frequency functions

    • To combat, transfrom the input into a higher dimensional space using high frequency functions, and train the network on the modified input (of course also use transformed input for inference later, too)
    nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_43f0ab2ef9b2d6256c55ed2b0fef33de9d376036.svg
    • nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_c553c51ac308a18b2180713e205e41cacd947017.svg is then applied to all 3 components of the viewing position nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_abc91459a2a7bf3ca69df2e23fece4f2b2216410.svg and all 3 components of the viewing direction nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_f8078e75b9ea1e504f7db02ea55ca38185dd46c3.svg.
    • nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_abc91459a2a7bf3ca69df2e23fece4f2b2216410.svg is normalized such that each component is nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_6ed2541ae1b3875eadf334b992bd8c58e318c383.svg
  • The paper uses nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_09a3766c65bae1b9518e4083bca6c1669d9b58bf.svg for the position nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_6f848c55faf8884cd48fd3bd330f7d46c79bdfd1.svg and nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_e7460d25b4748800c9457866299cec48ceac5db6.svg for the viewing direction nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_5f1cab9839ab1b7a478abff616cf373b6b19242d.svg

    • Resulting in a input layer of size nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_158bc0115891839c5b12d00a8b04262c9c3f6910.svg

Implementation details #

Loss function is just the total squared error over the resulting RBG color for rays nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_b8f8efef60efd0cb2e467ac96347c8e417405797.svg in the batch nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_9ef808180fb945567b407d924c251c4c84f7c060.svg.

nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_ea8b432f873aee049c8bd14c36faf91239bba82e.svg

The paper used batch size of 4096 rays each sampled at nerf_representing_scenes_as_neural_radiance_fields_for_view_synthesis_c3bbbf49e56f63daf08db4b1b00f5e62f401fefe.svg positions and the ADAM optimizer.

Calendar July 21, 2023 (Updated October 22, 2023)