A Critical Analysis of NeRF-Based 3D Reconstruction

: This paper presents a critical analysis of image-based 3D reconstruction using neural radiance ﬁelds (NeRFs), with a focus on quantitative comparisons with respect to traditional photogrammetry. The aim is, therefore, to objectively evaluate the strengths and weaknesses of NeRFs and provide insights into their applicability to different real-life scenarios, from small objects to heritage and industrial scenes. After a comprehensive overview of photogrammetry and NeRF methods, highlighting their respective advantages and disadvantages, various NeRF methods are compared using diverse objects with varying sizes and surface characteristics, including texture-less, metallic, translucent, and transparent surfaces. We evaluated the quality of the resulting 3D reconstructions using multiple criteria, such as noise level, geometric accuracy, and the number of required images (i.e., image baselines). The results show that NeRFs exhibit superior performance over photogrammetry in terms of non-collaborative objects with texture-less, reﬂective, and refractive surfaces. Conversely, photogrammetry outperforms NeRFs in cases where the object’s surface possesses cooperative texture. Such complementarity should be further exploited in future works.


Introduction
In the fields of computer vision and photogrammetry, high-quality 3D reconstruction is an important topic that has many applications, such as quality inspection, reverse engineering, structural monitoring, digital preservation, etc.However, low-cost, portable, and flexible 3D measuring techniques that provide high geometric accuracy and highresolution details have been in great demand for some years.Existing methods for 3D reconstruction can be broadly categorized as either contact or non-contact techniques [1].In order to determine the precise 3D shape of an object, contact-based techniques often employ physical tools like a caliper or a coordinate measurement machine.While precise geometrical 3D measurements are feasible and well-suited for many applications, they do have some drawbacks, such as the length of time required to acquire data and perform sparse 3D reconstruction, the limitations of the measuring system, and/or the need for expensive instrumentation, which limits their use to specialized laboratories and projects with unique metrological specifications.Non-contact technologies, on the other hand, allow for accurate 3D reconstruction without the associated drawbacks.Most researchers have focused on passive image-based approaches due to their low cost, portability and flexibility over a wide range of application fields, including industrial inspection and quality control [2][3][4][5] as well as heritage 3D documentation [6][7][8][9].
Among image-based 3D reconstruction approaches, photogrammetry is a widely recognized method that can create a dense and geometrically accurate 3D point cloud of a real-world scene from a set of images taken from different perspectives.Photogrammetry can handle a wide range of scenes, from indoor to outdoor environments and has a proven track record in multiple projects with many commercial and open-source tools available [10,11].However, photogrammetry has its limitations, particularly when it comes to the 3D measurement of non-collaborative surfaces due to its sensitivity to textural properties of objects, and it can struggle with the generation of highly detailed 3D reconstructions.The presence of specular reflections in images, for instance, can result in noisy results for highly reflective and weakly textured objects, while transparent objects can pose significant challenges due to texture changes induced by refraction and mirror-like reflections [12][13][14][15].
More recently, a novel approach for the 3D reconstruction from image datasets based on Neural Radiance Fields (NeRFs) has attracted significant attention in the research community [16][17][18][19][20][21][22].This approach is capable of producing novel views of complex scenes by optimizing a continuous scene function from a set of oriented images.NeRF works by training a fully connected network, referred to as a neural radiance field, to replicate the input views of a scene through the use of a rendering loss (Figure 1).
Remote Sens. 2023, 15, x FOR PEER REVIEW 2 of 22 flexibility over a wide range of application fields, including industrial inspection and quality control [2][3][4][5] as well as heritage 3D documentation [6][7][8][9].Among image-based 3D reconstruction approaches, photogrammetry is a widely recognized method that can create a dense and geometrically accurate 3D point cloud of a real-world scene from a set of images taken from different perspectives.Photogrammetry can handle a wide range of scenes, from indoor to outdoor environments and has a proven track record in multiple projects with many commercial and open-source tools available [10,11].However, photogrammetry has its limitations, particularly when it comes to the 3D measurement of non-collaborative surfaces due to its sensitivity to textural properties of objects, and it can struggle with the generation of highly detailed 3D reconstructions.The presence of specular reflections in images, for instance, can result in noisy results for highly reflective and weakly textured objects, while transparent objects can pose significant challenges due to texture changes induced by refraction and mirrorlike reflections [12][13][14][15].
More recently, a novel approach for the 3D reconstruction from image datasets based on Neural Radiance Fields (NeRFs) has attracted significant attention in the research community [16][17][18][19][20][21][22].This approach is capable of producing novel views of complex scenes by optimizing a continuous scene function from a set of oriented images.NeRF works by training a fully connected network, referred to as a neural radiance field, to replicate the input views of a scene through the use of a rendering loss (Figure 1).
As shown in Figure 1, the neural network takes as input a set of continuous 5D coordinates consisting of spatial locations (x, y, z) and viewing directions (θ, ϕ), and it outputs the volume density (σ) and view-dependent emitted radiance (RGB) in each direction at each point.The NeRF is then rendered from a certain perspective, and 3D geometry can be derived, e.g., in the form of a mesh by marching camera rays [23].
Despite their recent popularity, however, there remains a need for a critical analysis of NeRF-based methods in comparison to the more conventional photogrammetry in order to objectively quantify the quality of resulting 3D models and to fully understand their strengths and limitations.As shown in Figure 1, the neural network takes as input a set of continuous 5D coordinates consisting of spatial locations (x, y, z) and viewing directions (θ, φ), and it outputs the volume density (σ) and view-dependent emitted radiance (RGB) in each direction at each point.The NeRF is then rendered from a certain perspective, and 3D geometry can be derived, e.g., in the form of a mesh by marching camera rays [23].
Despite their recent popularity, however, there remains a need for a critical analysis of NeRF-based methods in comparison to the more conventional photogrammetry in order to objectively quantify the quality of resulting 3D models and to fully understand their strengths and limitations.

Aims of This Research
NeRF methods have recently emerged as a promising alternative to photogrammetry and computer vision in the field of image-based 3D reconstruction.Therefore, this research seeks to thoroughly analyze NeRF approaches for 3D reconstruction purposes.We evaluate the accuracy of the 3D reconstruction generated using NeRF-based techniques and via photogrammetry on a wide variety of objects ranging in size and surface characteristics (well-textured, texture-less, metallic, translucent and transparent).We examined the data generated by each technique in terms of surface deviation (noise level) and geometric accuracy.The final aim is to assess the applicability of the NeRF method in a real-world scenario and to provide objective evaluation metrics regarding the advantages and limitations of NeRF-based 3D reconstruction approaches.
The paper is organised as follows: an overview of previous research activities for 3D reconstruction using both photogrammetric-based and NeRF-based approaches is presented in Section 2. Section 3 presents the proposed quality evaluation pipeline and employed datasets, whereas Section 4 reports the evaluation and comparison results.Finally, conclusions and future research plans are provided in Section 5.

The State of the Art
In this section, a comprehensive overview of previous research on 3D reconstruction is conducted, incorporating both photogrammetric and NeRF-based approaches and considering their application to non-collaborative surfaces (reflective, textureless, etc.).

Photogrammetric-Based Methods
Photogrammetry is a widely accepted method for 3D modeling of well-textured objects and is capable of accurately and reliably recovering the 3D shape of an object through multiview stereo (MVS) methods.Photogrammetric-based methods [19,[24][25][26][27][28][29][30] either rely on feature matching for depth estimation [27,28] or use voxels to represent shapes [24,29,31,32].Learning-based MVS methods can also be used, but they typically replace certain parts of the classic MVS pipeline, such as feature matching [33][34][35][36], depth fusion [37,38], or multi-view image depth inference [39][40][41].However, objects with texture-less, reflective, or refractive surfaces are challenging to reconstruct because all photogrammetric methods require matching correspondences across multiple images [14].To address this, various photogrammetric methods have been developed to reconstruct these non-collaborative objects.For texture-less objects, solutions such as random pattern projection [13,42,43] or synthetic pattern [14,44] have been suggested.However, these methods struggle with highly reflective surfaces with strong specular reflections or interreflection [43].Other methods like cross polarisation [7,45] and image pre-processing [46,47] have been used for reflective or non-collaborative surfaces, but some techniques can potentially smooth off surface roughness and affect texture consistency across views [48,49].Photogrammetry is also utilized in hybrid methods [50][51][52][53], where MVS approaches are used to generate a sparse 3D shape that can serve as a base for high-resolution measurements using Photometric Stereo (PS).Conventional [52,54,55] and learning-based [56][57][58] PS methods are also used to understand the image irradiance equation and retrieve the geometry of the imaged object but specular surfaces are still challenging for all image-based methods.

NeRF-Based Methods
Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research [59].Neural rendering is a learning-based class of image and video generation approach to control scene properties (e.g., illumination, camera parameters, pose, geometry, appearance, etc.).Neural rendering combines deep learning methods with physical knowledge from computer graphics to achieve controllable and photo-realistic (3D) models of scenes.Among them, NeRF, first proposed by Mildenhall et al. in 2020, is a method for rendering new views and reconstructing 3D scenes using an implicit representation (Figure 1).In the NeRF approach, a neural network is employed to learn the 3D shape of an object from 2D images.The radiance field, as defined in Equation (1), captures the color and volume density for each point in the scene from every possible viewing direction: The NeRF model utilizes a neural network representation where X represents the 3D coordinate of the images, d represents the azimuthal and polar viewing angles, c represents color and σ represents the volume density of the scene.In order to ensure multi-view consistency, the prediction of σ is designed to be independent of the viewing direction, while the color c can vary based on both the viewing direction and position.To achieve this, a Multi-Layer Perceptron (MLP) is employed in two steps.In the first step, the MLP takes X as input and outputs both σ and a high-dimensional feature vector.The feature vector is then combined with the viewing direction d and passed through an additional MLP, which produces the color representation c.
The original NeRF implementation, as well as subsequent methods, utilized a nondeterministic stratified sampling approach, which is described by Equations ( 2)-( 4).This method involved dividing the ray into N equally spaced bins and uniformly drawing a sample from each bin: where δ i denotes the distance between the consecutive samples (i and i + 1), while σ i and c i represent the estimated density and color values along the sample point (i).The transparency or opacity α i at sample point (i) is computed also using Equation (4).
Successive methods [60][61][62] have also incorporated the estimated depth, as expressed in Equation ( 5), to impose restrictions on densities, making them resemble delta-like functions at the surfaces of the scene, or to enforce smoothness in depth: To optimize the MLP parameters, a square error photometric loss is used for each pixel: where the variable C gt (R) represents the ground truth color of the pixel in the training image that corresponds to the ray r, while R refers to the batch of rays associated with the image to be synthesized.It should be noted that the learned implicit 3D representation of an NeRF is designated for view rendering.To obtain the explicit 3D geometry, depth maps for different views need to be extracted by taking the maximal likelihood of depth distribution for each ray.These depth maps can then be fused to derive point clouds or fed into the Marching Cube [23] algorithm to derive 3D meshes.
Although NeRF provides an alternative solution for 3D reconstruction compared to traditional photogrammetry methods and can produce promising results in situations where photogrammetry may fail to deliver accurate results, it still faces several limitations, as reported by different authors [63][64][65][66][67][68].Some of the main issues from a 3D metrological perspective that need to be considered include: (1) The resolution of the generated neural renderings (afterward converted into a 3D mesh) can be limited by the quality and resolution of the input data.In general, higherresolution input data will result in a higher-resolution 3D mesh, but the tradeoff is increased computational requirements.(2) Generating a neural rendering (and then a 3D mesh) using NeRF can be computationally intensive, requiring significant amounts of computing power and memory.(3) The general inability to accurately model the 3D shape of non-rigid objects.(4) The original NeRF model is optimized based on a per-pixel RGB reconstruction loss, which can result in a noisy reconstruction as an infinite number of photo-consistent explanations exist when using only RGB images as input.(5) NeRF generally requires a large number of input images with small baselines to generate an accurate 3D mesh, especially for scenes with complex geometry or occlusions.This can be a challenge in situations where images are difficult to acquire or when computational resources are limited.
To face the above issues, researchers have proposed several modifications and extensions to the original NeRF method in order to improve performance and 3D results.Tancik et al. [69] and Sitzmann et al. [70] adopted the position encoding operation with a different frequency to NeRFs in order to improve the resolution of the neural rendering outcome since high-frequency representation capacity in NeRFs is insufficient.Following this, other approaches have focused on improving the efficiency and resolution of the neural rendering outcome in different ways, including model acceleration [20,71], compression [72][73][74], relighting [75][76][77], and View-Dependence Normalization [78] (Zhu et al., 2023), or high-resolution 2D feature planes [68].Müller et al. [20] introduced the concept of instant Neural Graphics Primitives with a Multiresolution Hash Encoding, which allows for fast and efficient generation of 3D models.Barron et al. [64,79] proposed that Mip-NeRF, a modified version of the original NeRF, allows for the representation of scenes on continuously valued scales.Mip-NeRF greatly increases the capacity of NeRF to emphasize fine details by efficiently rendering anti-aliased conical frustums instead of rays.However, limitations of the method may include the difficulty in training and issues with computational efficiency.Chen et al. [72] presented a new method called Tensorf for modeling and reconstructing the radiance fields of a scene as a 4D tensor.This approach represents a 3D voxel grid with per-voxel multi-channel features.In addition to providing superior rendering quality, this method achieves much lower memory usage compared to previous and contemporary methods.Yang et al. [80] presented a fusion-based approach called PS-NeRF that combines the strengths of NeRF with photometric stereo methods.This method aims to address the limitations of traditional photometric stereo techniques by utilizing NeRF's capability to reconstruct a scene, ultimately leading to an improved resolution of the resultant mesh.Reiser et al. [68] introduced Memory-Efficient Radiance Field (MERF) representation, which allows for the fast rendering of large-scale scenes by utilizing a sparse feature grid and high-resolution 2D feature planes.Li et al. [21] introduced Neuralangelo, an innovative method that utilizes multi-resolution 3D hash grids and neural surface rendering to achieve superior results in recovering dense 3D surface structures from multi-view images, enabling highly detailed large-scale scene reconstruction from RGB video captures.
Some approaches [67,[81][82][83][84][85] have been proposed that extend NeRF to a dynamic domain.These approaches make it possible to reconstruct and render images of objects while they are undergoing rigid and non-rigid motions from a single camera that is moving around the scene.For example, Yan et al. [84] introduced a surface-aware dynamic NeRF (NeRF-DS) and a mask-guided deformation field.By incorporating surface position and orientation as conditioning factors in the neural radiance field function, NeRF-DS improves the representation of complex reflectance properties in specular surfaces.Additionally, the use of a mask-guided deformation field enables NeRF-DS to effectively handle large deformations and occlusions occurring during object motion.
To improve the accuracy of 3D reconstruction in the presence of noise, particularly for smooth and texture-less surfaces, some studies incorporated various priors into the optimization process.These priors include semantic similarity [86], depth smoothness [60], surface smoothness [87,88], Manhattan world assumptions [89], and monocular geometric priors [90].In contrast, the NoPe-NeRF method proposed by Bian et al. [91] uses monodepth maps to constrain the relative poses between frames and regularize NeRF's geometry.This method results in better pose estimation, which improves the quality of novel view synthesis and geometry reconstruction.Rakotosaona et al. [92] introduced a novel and versatile architecture for 3D surface reconstruction, which efficiently distills volumetric representations from NeRF-driven approaches into a Signed Surface Approximation Network.This approach enables the extraction of accurate 3D meshes and appearance while maintaining real-time rendering capabilities across various devices.Elsner et al. [93] presented Adaptive Voronoi NeRFs, a technique that enhances the efficiency of the process by employing Voronoi diagrams to partition the scene into cells.These cells are subsequently subdivided to effectively capture and represent intricate details, leading to improved performance and accuracy.Similarly, Kulhanek and Sattler [94] introduced a new radiance field representation called tera-NeRF, which successfully adjusts to 3D geometry priors given as a sparse point cloud for exploiting more details.However, it is worth noting that the quality of rendered scenes may differ depending on the density of the point cloud in various regions.
Some works aimed to reduce the number of input images [60,70,78,86,90,95].Yu et al. [95] presented an architecture that conditions NeRF on image inputs using a fully convolutional method, enabling the network to learn a scene prior to being trained on multiple scenes.This allows it to perform feed-forward view synthesis from a small number of viewpoints, even as few as one.Similarly, Niemeyer et al. [60] introduced a method to sample unseen views and regularize the appearance and geometry of patches generated from these views.Jain et al. [86] proposed DietNeRF to enhance few-shot quality via an auxiliary semantic consistency loss that boosts realistic renderings of new positions.DietNeRF learns from individual scenes to accurately render input images from the same position and to match high-level semantic features across diverse, random poses.
In the field of cultural heritage, only a limited number of publications have explicitly investigated and recognized the potential of NeRFs for 3D reconstruction, digital preservation and conservation purposes [96,97].

Analysis and Evaluation Methodology
The main goal is to conduct a critical evaluation of NeRF-based methods with respect to conventional photogrammetry by objectively measuring the quality of resulting 3D data.To accomplish this, a variety of objects and scenes with different sizes and surface characteristics, including well-textured, texture-less, metallic, translucent, and transparent, are considered (Section 3.3).The proposed evaluation strategy and metrics (Sections 3.1 and 3.2) should help researchers to understand the strengths and limitations of each approach and could be adopted for quantitative evaluations of newly proposed methods.All experiments are based on the SDFStudio [98] and Nerfstudio [22] frameworks.It is worth reminding that the NeRF output is a neural rendering; therefore, a marching cube approach [23] is used to create a mesh model from the different depth maps of each view.A point cloud is then extracted from the mesh vertices for the quantitative evaluations using the Open3D library [78].

Proposed Methodology
Firstly, various NeRF methods available in dedicated frameworks [22,98] are applied to two datasets in order to understand their performances and choose the most outperforming method (Section 4.1).Then, this method is applied to other datasets to run evaluation and comparisons (Sections 4.2-4.7)with respect to conventional photogrammetry and the available ground truth (GT) data.
Figure 2 shows the general overview of the proposed procedure to quantitatively assess the performance of an NeRF-based 3D reconstruction.All collected images or videos require camera poses in order to generate 3D reconstructions, either with conventional photogrammetry or NeRF-based methods.Starting from the available images, camera poses are retrieved using Colmap.Then, a multi-view stereo (MVS) or NeRF is applied to generate 3D data.Finally, we provide a unique and robust environment and conditions to provide an objective geometric comparison.To achieve this, 3D data produced with photogrammetry and NeRF are co-registered and rescaled with respect to the available ground truth (GT) data in Cloud Compare (using an Iterative Closest Point (ICP) algorithm [99] and a quality evaluation is performed.To provide an unbiased evaluation of geometric accuracy, different well-known criteria are applied [13,43,[100][101][102], including best plane fitting, cloud-to-cloud comparison, profiling, accuracy, and completeness.For the first two criteria, metrics, such as Standard Deviation (STD), Mean Error (Mean_E), Root Mean Squares Error (RMSE) and Mean Absolute Error (MAE), are used (Section 3.2).
outperforming method (Section 4.1).Then, this method is applied to other datasets to run evaluation and comparisons (Sections 4.2-4.7)with respect to conventional photogrammetry and the available ground truth (GT) data.
Figure 2 shows the general overview of the proposed procedure to quantitatively assess the performance of an NeRF-based 3D reconstruction.All collected images or videos require camera poses in order to generate 3D reconstructions, either with conventional photogrammetry or NeRF-based methods.Starting from the available images, camera poses are retrieved using Colmap.Then, a multi-view stereo (MVS) or NeRF is applied to generate 3D data.Finally, we provide a unique and robust environment and conditions to provide an objective geometric comparison.To achieve this, 3D data produced with photogrammetry and NeRF are co-registered and rescaled with respect to the available ground truth (GT) data in Cloud Compare (using an Iterative Closest Point (ICP) algorithm [99] and a quality evaluation is performed.To provide an unbiased evaluation of geometric accuracy, different well-known criteria are applied [13,43,  Best plane fitting is accomplished by using a Least Squares Fitting (LSF) algorithm that defines a best-fitted plane on an area of the object, which is assumed to be planar.This criterion allows us to evaluate the level of noise in the 3D data generated by photogrammetry or NeRF methods.
Profiling is carried out by extracting a cross-section from the 3D data to highlight complex geometric details of the reconstructed surface.An inspection of profiles allows us to evaluate the performance of a method in preserving geometric details, such as edges and corners, and avoid smoothing effects.
Cloud-to-cloud (C2C) comparison refers to the measurement of the nearest neighboring distance between corresponding points in two point clouds.Best plane fitting is accomplished by using a Least Squares Fitting (LSF) algorithm that defines a best-fitted plane on an area of the object, which is assumed to be planar.This criterion allows us to evaluate the level of noise in the 3D data generated by photogrammetry or NeRF methods.
Profiling is carried out by extracting a cross-section from the 3D data to highlight complex geometric details of the reconstructed surface.An inspection of profiles allows us to evaluate the performance of a method in preserving geometric details, such as edges and corners, and avoid smoothing effects.
Cloud-to-cloud (C2C) comparison refers to the measurement of the nearest neighboring distance between corresponding points in two point clouds.

Metrics
Despite the increasing popularity and widespread application of NeRF to 3D reconstruction purposes, there is still a shortage of information on quality assessment based on a specified standard or criteria (e.g., the VDI/VDE 2643 BLATT 3).Following the co-registration process and criteria mentioned before, the following metrics are used (in particular for cloud-to-cloud and plane fitting processes): where N denotes the number of observed point clouds, X j denotes the closest distance of each point to the corresponding reference point or surface, and X denotes the average observed distance.
Accuracy and completeness, respectively, also known as precision and recall [101,102], involve measuring the distance between two models.When assessing accuracy, the distance is computed from the computed data to a ground truth (GT).Conversely, to evaluate completeness, the distance is computed from the GT to the computed data.These distances can be either signed or unsigned, depending on the specific evaluation method.Accuracy reflects how closely the reconstructed points align with the ground truth, while completeness indicates the degree to which all GT points are covered.Typically, a threshold distance is employed to determine the fraction or percentage of points that fall within the acceptable threshold.The threshold value is determined based on factors such as data density and noise levels.

Testing Objects
To achieve the work objectives, different datasets are used (Figure 3): they feature objects of different dimensions and surface types, and they were captured under different lighting conditions, materials, camera networks, scales, and resolutions.

Comparisons and Analyses
This section presents experiments that evaluate and compare the performance of NeRF-based techniques versus standard photogrammetry (Colmap).After comparing multiple state-of-the-art methods (Section 4.1), Instant-NGP was selected as the NeRFbased method to be fully assessed, as it delivered superior results with respect to the other methods.The NeRF training was executed using a Nvidia A40 GPU, while the geometric comparisons of the 3D results were performed on a standard PC.The Ignatius and Truck datasets are derived from the Tanks and Temples benchmark [101], where GT data (acquired with laser scanning) are also available.
The other datasets (Stair, Synthetic, Industrial, Bottle_1 and Bottle_2) are created in FBK.The Stair dataset offers a flat, reflective, and well-textured surface with sharp edges.GT is provided by ideal plans of the step surfaces.The Synthetic 3D object created using Blender v3.2.2 (for the geometric model, UV texture and material) and Quixel Mixer v2022 (for PBR textures) has a well-textured surface featuring complex geometry, including edges and corners.A virtual camera with specific parameters (Focal length: 50 mm; Sensor size: 36 mm; image size: 1920 × 1080 pixels) is used to create a sequence of images that follows a spiral curvy path around the object.The 3D model generated in Blender is used as GT for the accuracy assessment.The Industrial object has a textureless and highly reflective metallic surface which raises problems for all passive 3D methods.Its GT data are acquired with a Hexagon/AICON Primescan active scanner with a nominal accuracy of 63 µm.Two bottles are also included, featuring transparent and refractive surfaces: their GT data are generated using photogrammetry after powdering/spraying the surfaces.
A specific benchmark for NeRF methods is under preparation by the authors and will be available at https://github.com/3DOM-FBK/NeRFBK[103], containing many more datasets with ground truth data.

Comparisons and Analyses
This section presents experiments that evaluate and compare the performance of NeRFbased techniques versus standard photogrammetry (Colmap).After comparing multiple state-of-the-art methods (Section 4.1), Instant-NGP was selected as the NeRF-based method to be fully assessed, as it delivered superior results with respect to the other methods.The NeRF training was executed using a Nvidia A40 GPU, while the geometric comparisons of the 3D results were performed on a standard PC.

State-of-the-Art Comparison
The primary objective is to conduct a comprehensive analysis of multiple NeRF-based methods.To achieve this goal, the SDFStudio unified framework developed by Yu et al. [98] is used as it incorporates multiple neural implicit surface reconstruction approaches into a single framework.SDFStudio is built upon the Nerfstudio framework [22].Among the implemented approaches, ten were chosen in order to compare their performances: Nerfacto and Tensorf from Nerfstudio, Mono-Neus, Neus-Facto, MonoSDF, VolSDF, NeuS, Mono-Unisurf and UniSurf from SDFStudio and InstantNGP from its original implementation in Müller et al. [20].
The comparison results with respect to GT data are reported in Figure 4. Results in terms of RMSE, MAE, and STD show that Instant-NGP and Nerfacto methods achieved the best outcomes, outperforming all other methods.In terms of processing time, Instant-NGP required less than a minute for both datasets to train the model, Nerfacto some 15 min.It should be noted that for the Ignatius sequence (Figure 4b), despite the neural rendering for MonoSDF, VolSDF, and Neus-facto being visually satisfactory, the marching cube to export a mesh model failed; hence, no evaluation was possible.
Therefore, based on the achieved accuracies and processing time, Instant-NGP was chosen and employed for the successive experiments in this paper.

Image Baseline's Evaluation
This section reports the assessment of NeRF-based methods when the number of input images is reduced (i.e., the baseline increases).A comparative evaluation between Instant-NGP, identified as the superior method among others (Section 4.1), and Mono-Neus, a well-established approach for sparse image scenarios [66,90], is performed.The experiment utilizes the Synthetic dataset consisting of four subsets of input images, ranging from 200 to 20 images (Figure 5), progressively reducing the number of input images (i.e., approximately doubling the image baselines).For every set of input images, both NeRF methods are used to generate 3D results, keeping a similar number of epochs.For each subset, the RMSE through point-to-point comparison with the GT data is estimated as reported in Figure 5.
The findings depict that Instant-NGP exhibits superior performance compared to Mono-Neus when a large number of input images is available.However, Mono-Neus outperforms Instant-NGP in scenarios where the number of images is low.Nevertheless, it is important to note that neither Instant-NGP nor Mono-Neus are able to successfully generate a 3D reconstruction using only 10 input images.

Image Baseline's Evaluation
This section reports the assessment of NeRF-based methods when the number of input images is reduced (i.e., the baseline increases).A comparative evaluation between Instant-NGP, identified as the superior method among others (Section 4.1), and Mono-Neus, a well-established approach for sparse image scenarios [66,90], is performed.The experiment utilizes the Synthetic dataset consisting of four subsets of input images, ranging from 200 to 20 images (Figure 5), progressively reducing the number of input images (i.e., approximately doubling the image baselines).For every set of input images, both NeRF methods are used to generate 3D results, keeping a similar number of epochs.For each subset, the RMSE through point-to-point comparison with the GT data is estimated as reported in Figure 5.
The findings depict that Instant-NGP exhibits superior performance compared to Mono-Neus when a large number of input images is available.However, Mono-Neus outperforms Instant-NGP in scenarios where the number of images is low.Nevertheless, it is important to note that neither Instant-NGP nor Mono-Neus are able to successfully generate a 3D reconstruction using only 10 input images.

Monte Carlo Simulation
The aim is to evaluate the quality of NeRF-based 3D results when the camera poses are changed/perturbed.Therefore, a Monte Carlo simulation [104] is employed to randomly perturbate the rotation and translation of camera parameters within a limited range.After the perturbation, using Instant-NGP, a 3D reconstruction is generated and compared to reference data.A total of 30 iterations (runs) are performed within two scenarios: (A) the rotation and translation are randomly disturbed in the range of ± 20 mm for translation and ± 2 degrees for rotation, (B) rotation and translation are randomly disturbed in the range of ±40 mm and ±4 degrees, respectively.The Ignatius dataset is used to run this simulation, and results are reported in Figure 6 and Table 1.The findings clearly show the importance of having accurate camera parameters.In scenario A, on average, the estimated RMSE is 19.72 mm, with an uncertainty of 2.95 mm.In scenario B, the average estimated RMSE stayed almost the same (19.97mm), whereas the uncertainty doubled (5.87 mm) due to the larger perturbation range.

Monte Carlo Simulation
The aim is to evaluate the quality of NeRF-based 3D results when the camera poses are changed/perturbed.Therefore, a Monte Carlo simulation [104] is employed to randomly perturbate the rotation and translation of camera parameters within a limited range.After the perturbation, using Instant-NGP, a 3D reconstruction is generated and compared to reference data.A total of 30 iterations (runs) are performed within two scenarios: (A) the rotation and translation are randomly disturbed in the range of ±20 mm for translation and ±2 degrees for rotation, (B) rotation and translation are randomly disturbed in the range of ±40 mm and ±4 degrees, respectively.The Ignatius dataset is used to run this simulation, and results are reported in Figure 6 and Table 1.The findings clearly show the importance of having accurate camera parameters.In scenario A, on average, the estimated RMSE is 19.72 mm, with an uncertainty of 2.95 mm.In scenario B, the average estimated RMSE stayed almost the same (19.97mm), whereas the uncertainty doubled (5.87 mm) due to the larger perturbation range.

Plane Fitting
A plan-fitting approach can be used to evaluate/measure the level of noise on reconstructed flat surfaces.In the first experiment with the Stair dataset (Figure 7a), photogrammetric point cloud and NeRF-based reconstructions are derived, employing the same number of images and camera poses.Two horizontal planes and three vertical planes are identified and analyzed based on a best-fitting process (Figure 7b).The derived metrics are presented in Table 2.In a similar way, the Synthetic dataset was used, with 200 images for the Instant-NGP and 24 images for the photogrammetric processing.Five vertical and five horizontal planes were selected, as shown in Figure 8, to perform a surface deviation analysis by fitting an ideal plane to the reconstructed object surfaces.Derived metrics are reported in Table 3.
From both results (Tables 2 and 3), it is clear that for such two objects, photogramme-  In a similar way, the Synthetic dataset was used, with 200 images for the Instant-NGP and 24 images for the photogrammetric processing.Five vertical and five horizontal planes were selected, as shown in Figure 8, to perform a surface deviation analysis by fitting an ideal plane to the reconstructed object surfaces.Derived metrics are reported in Table 3 In a similar way, the Synthetic dataset was used, with 200 images for the Instant-NGP and 24 images for the photogrammetric processing.Five vertical and five horizontal planes were selected, as shown in Figure 8, to perform a surface deviation analysis by fitting an ideal plane to the reconstructed object surfaces.Derived metrics are reported in Table 3.
From both results (Tables 2 and 3), it is clear that for such two objects, photogrammetry outperforms NeRF, and it can derive less noisy results.NeRF RMSEs are, in general, at least 2-3 times higher than photogrammetry.From both results (Tables 2 and 3), it is clear that for such two objects, photogrammetry outperforms NeRF, and it can derive less noisy results.NeRF RMSEs are, in general, at least 2-3 times higher than photogrammetry.

Profiling
The extraction of cross-section profiles is useful to demonstrate the capability of a 3D reconstruction method to retrieve geometric details or apply smoothing effects to the 3D geometry.The results of the Synthetic dataset presented in Section 4.4 are processed using Cloud Compare: several cross-sections are extracted (Figure 9) at predefined distances and geometrically compared against the reference data using different metrics, as reported in Table 4.
ported in Table 4.
The obtained findings for individual cross-sectional profiles, as well as the average of all profiles, show that photogrammetry outperforms NeRF, which generally produces more noisy results (Figure 9a-c).For instance, the average of estimated RMSE and STD for photogrammetry is around 0.09 mm and 0.08 mm, while this value for NeRF is bigger than 0.13 mm.The obtained findings for individual cross-sectional profiles, as well as the average of all profiles, show that photogrammetry outperforms NeRF, which generally produces more noisy results (Figure 9a-c).For instance, the average of estimated RMSE and STD for photogrammetry is around 0.09 mm and 0.08 mm, while this value for NeRF is bigger than 0.13 mm.

Cloud-to-Cloud Comparison
A cloud-to-cloud comparison refers to the assessment of relative Euclidean distances between corresponding 3D samples in a dataset with respect to the reference data.Different objects with different characteristics are considered (Figure 3): Ignatius, Truck, Industrial, and Synthetic.They are small and large-scale objects with texture-less, shiny, and metallic surfaces.For each dataset, 3D data are produced using photogrammetry (Colmap) and Instant-NGP and then co-registered to the available GT (Figure 10).Finally, metrics are derived as reported in Table 5. Worth to notice that the employed number of images is not always the same within the performed tests: indeed, for the Synthetic, Ignatius and Truck datasets, photogrammetry was already providing accurate results with a lower number of images, hence adding more images was not leading to further improvements.On the other hand, for NeRF, all available images were used as fewer images (or an enlargement of the baseline) did not lead to good results (see also Section 4.2).From the provided results, it can be seen that for the metallic and highly reflective object (Industrial dataset), NeRF performs better than photogrammetry, whereas, for the other scenarios, photogrammetry produces more accurate results.
Two other translucent and transparent objects are considered: Bottle_1 and Bottle_2 (Figure 3).Glass objects do not diffusely reflect the incoming light and do not have a texture of their own for photogrammetric 3D reconstruction tasks.Their appearance depends on the object's shape, surrounding background and lighting conditions.Therefore, photogrammetry can easily fail or produce very noisy results in such a situation.On the other hand, NeRF, as declared by Mildenhall et al. [16] can learn to properly generate the geometry associated with transparency due to the view-dependent nature of the NeRF model.For both objects, the photogrammetric-and NeRF-based 3D results are co-registered to the GT data and metrics are computed (Figure 11 and Table 6).Findings prove that NeRF performed better than photogrammetry for transparent objects.For example, the estimated RMSE, STD and MAE for photogrammetry on Bottle_1 are 6.5 mm, 7.1 mm and 7.5 mm, respectively.In contrast, NeRF values were dramatically reduced to 1.3 mm, 1.7 mm, and 2.1 mm, respectively.

Accuracy and Completeness
Three different datasets are used to compare photogrammetry and NeRF in terms of accuracy and completeness: Ignatius, Industrial and Bottle_1.For both NeRF (Instant-NGP) and photogrammetry, the two metrics are computed with respect to the available ground truth data.The results, presented in Figure 12, revealed the following insights: (i) for the Ignatius dataset, photogrammetry exhibits higher accuracy and completeness compared to NeRF; (ii) for the Industrial and Bottle_1 datasets, NeRF showcases slightly better results.These findings quantitatively confirm Section 4.6 and that NeRF-based approaches excel when dealing with objects featuring non-collaborative surfaces, particularly those that are transparent or shiny.In contrast, photogrammetry faces challenges in capturing the intricate details of such surfaces, making NeRF a more suitable or complementary choice.Three different datasets are used to compare photogrammetry and NeRF in terms of accuracy and completeness: Ignatius, Industrial and Bottle_1.For both NeRF (Instant-NGP) and photogrammetry, the two metrics are computed with respect to the available ground truth data.The results, presented in Figure 12, revealed the following insights: (i) for the Ignatius dataset, photogrammetry exhibits higher accuracy and completeness compared to NeRF; (ii) for the Industrial and Bottle_1 datasets, NeRF showcases slightly better results.These findings quantitatively confirm Section 4.6 and that NeRF-based approaches excel when dealing with objects featuring non-collaborative surfaces, particularly those that are transparent or shiny.In contrast, photogrammetry faces challenges in capturing the intricate details of such surfaces, making NeRF a more suitable or complementary choice.

Ignatius Industrial Bottle_1
Figure 12.Accuracy and completeness for NeRF and photogrammetry on three different objects.

Conclusions
This paper provides a comprehensive analysis of image-based 3D reconstruction using neural radiance field (NeRF) methods.Comparisons with conventional photogrammetry were performed, reporting quantitative and visual results to understand

Conclusions
This paper provides a comprehensive analysis of image-based 3D reconstruction using neural radiance field (NeRF) methods.Comparisons with conventional photogrammetry were performed, reporting quantitative and visual results to understand advantages and disadvantages while dealing with multiple types of surfaces and scenes.The study has objectively evaluated the strengths and weaknesses of NeRF-generated 3D data and provided insights into their applicability to different real-life scenarios and applications.The study employed a range of well-textured, texture-less, metallic, translucent, and transparent objects, imaged using different scales and sets of images.The quality of the generated NeRF-based 3D data was evaluated using various evaluation approaches and metrics, including noise level, surface deviation, geometric accuracy, and completeness.
The reported results indicate that NeRF outperforms photogrammetry in scenarios where conventional photogrammetric approaches fail or produce noisy results, such as with texture-less, metallic, highly reflective, and transparent objects.In contrast, photogrammetry still performs better with well-textured and partially textured objects.This is due to the fact that the NeRF-based methods are capable of generating geometry associated with reflectivity and transparency due to the view-dependent nature of the NeRF model.
The study provides valuable insights into the applicability of NeRF for different real-life scenarios, particularly for heritage and industrial scenes, where surfaces can be particularly challenging.More datasets are in preparation and will be shared soon at

Figure 2 .
Figure 2. Overview of the proposed procedure to assess the performance of NeRF-based 3D reconstruction with respect to conventional photogrammetry.

Figure 2 .
Figure 2. Overview of the proposed procedure to assess the performance of NeRF-based 3D reconstruction with respect to conventional photogrammetry.

Figure 3 .
Figure 3. Set of objects, with different surface characteristics, used to evaluate NeRF methods.

Figure 3 .
Figure 3. Set of objects, with different surface characteristics, used to evaluate NeRF methods.

Figure 4 .
Figure 4.The comparison results of the various NeRF-based methods on the Synthetic (a) and Ignatius (b) datasets with 200 and 263 images, respectively.

Figure 4 .
Figure 4.The comparison results of the various NeRF-based methods on the Synthetic (a) and Ignatius (b) datasets with 200 and 263 images, respectively.

Figure 5 .
Figure 5. Comparative performance evaluation of Instant-NGP and Mono-Neus on subsets of the Synthetic dataset.

Figure 5 .
Figure 5. Comparative performance evaluation of Instant-NGP and Mono-Neus on subsets of the Synthetic dataset.

Figure 7 .
Figure 7.An image of the Step dataset (a) and the horizontal and vertical planes used for evaluating the level of the noise in the photogrammetric and NeRF 3D reconstructions (b).

Figure 7 .
Figure 7.An image of the Step dataset (a) and the horizontal and vertical planes used for evaluating the level of the noise in the photogrammetric and NeRF 3D reconstructions (b).

Figure 8 .
Figure 8.The Synthetic object with some horizontal and vertical planes used for the evaluation.Figure 8.The Synthetic object with some horizontal and vertical planes used for the evaluation.

Figure 8 .
Figure 8.The Synthetic object with some horizontal and vertical planes used for the evaluation.Figure 8.The Synthetic object with some horizontal and vertical planes used for the evaluation.

Figure 9 .
Figure 9. Close view of the generated meshes for GT (a), photogrammetry (b) and NeRF (c).The different locations of the profiles on the Synthetic object (d).An example of a profile on the reference 3D data (black line), photogrammetry (red line) and NeRF (blue line) results (e).

Figure 9 .
Figure 9. Close view of the generated meshes for GT (a), photogrammetry (b) and NeRF (c).The different locations of the profiles on the Synthetic object (d).An example of a profile on the reference 3D data (black line), photogrammetry (red line) and NeRF (blue line) results (e).

Figure 10 .
Figure 10.Color-coded cloud-to-cloud comparisons for both Instant-NGP and photogrammetry methods with respect to the ground truth data [unit: mm].

Figure 11 .
Figure 11.Color-coded cloud-to-cloud comparisons for both Instant-NGP and photogrammetry on the two transparent objects [unit: mm].

Figure 11 .
Figure 11.Color-coded cloud-to-cloud comparisons for both Instant-NGP and photogrammetry on the two transparent objects [unit: mm].

Figure 11 .
Figure 11.Color-coded cloud-to-cloud comparisons for both Instant-NGP and photogrammetry on the two transparent objects [unit: mm].

Figure 12 .
Figure 12.Accuracy and completeness for NeRF and photogrammetry on three different objects.

Table 1 .
Summary of the Monte Carlo simulation results on the Ignatius dataset.The Error Range is the difference between the Max and Min RMSE, while the Uncertainty is computed as half of the Error Range.
Figure 6.The results of Monte Carlo simulation for perturbing the camera parameters.A summary of the statistics is reported in Table1.Figure 6.The results of Monte Carlo simulation for perturbing the camera parameters.A summary of the statistics is reported in Table1.

Table 1 .
Summary of the Monte Carlo simulation results on the Ignatius dataset.The Error Range is the difference between the Max and Min RMSE, while the Uncertainty is computed as half of the Error Range.

Table 2 .
Evaluating the noise level in the 3D surfaces of the Step dataset processed using photogrammetry and Instant-NGP [Unit: mm].

Table 2 .
Evaluating the noise level in the 3D surfaces of the Step dataset processed using photogrammetry and Instant-NGP [Unit: mm]. .

Table 3 .
Evaluation metrics for photogrammetry and NeRF results on the Synthetic object [unit: mm].

Table 5 .
Metrics of cloud-to-cloud comparisons for Instant-NGP and photogrammetry methods [unit: mm].For all but the Industrial object, Photogrammetry was used with a smaller number of images as the achieved accuracy was already better than NeRF.

Table 5 .
Metrics of cloud-to-cloud comparisons for Instant-NGP and photogrammetry methods [unit: mm].For all but the Industrial object, Photogrammetry was used with a smaller number of images as the achieved accuracy was already better than NeRF.

Table 6 .
Statistics of cloud-to-cloud comparisons on transparent objects [unit: mm].

Table 6 .
Statistics of cloud-to-cloud comparisons on transparent objects [unit: mm].

Table 6 .
Statistics of cloud-to-cloud comparisons on transparent objects [unit: mm].