On 3D Reconstruction Using RGB-D Cameras

: The representation of the physical world is an issue that concerns the scientiﬁc community studying computer vision, more and more. Recently, research has focused on modern techniques and methods of photogrammetry and stereoscopy with the aim of reconstructing three-dimensional realistic models with high accuracy and metric information in a short time. In order to obtain data at a relatively low cost, various tools have been developed, such as depth cameras. RGB-D cameras are novel sensing systems that capture RGB images along with per-pixel depth information. This survey aims to describe RGB-D camera technology. We discuss the hardware and data acquisition process, in both static and dynamic environments. Depth map sensing techniques are described, focusing on their features, pros, cons, and limitations; emerging challenges and open issues to investigate are analyzed; and some countermeasures are described. In addition, the advantages, disadvantages, and limitations of RGB-D cameras in all aspects are also described critically. This survey will be useful for researchers who want to acquire, process, and analyze the data collected.


Introduction
One of the main tasks that computer vision is dealing with is the 3D reconstruction of the real world [1]. In computer science, three-dimensional reconstruction means the process of capturing the shape and appearance of real objects. This process is accomplished by active and passive methods, and for this purpose, several tools have been developed and applied. In the last decade, an innovative technology has emerged: depth cameras (RGB-D). The elementary issue is "object detection," which refers to recognizing objects in a scene and is divided into instance recognition and category level recognition. Object recognition depends highly on RGB cameras and instance-specific information [2], whereas the quality of the recognized category depends on the generalization of the properties or functionalities of the object and the unseen instances of the same category. Although depth cameras provide 3D reconstruction models in real time, one of the main issues for researchers is robustness and accuracy [3]. In addition, the representation of the object can undergo changes such as scaling, translation, occlusion, or other deformations, which make category level recognition a difficult topic. In addition, object detection has weaknesses due to illumination, camera viewpoint, and texture [4]. The recovered information is expressed in the form of a depth map, that is, an image or image channel that contains information relating to the distance of objects' surfaces from the capturing camera. Depth maps are invariant to texture and illumination changes [5]. The modeling of objects can be categorized into geometric and semantic. Geometric modeling provides an accurate model related to geometry, whereas semantic analyzes objects in order to be understood by humans. A typical example of semantic information is the integration of RGB-D in people's daily life for space estimation (odometry), object detection, and classification (doors, window, walls, etc.). RGB-D camera have continuous detailed feedback for the existing configuration area, and this is especially helpful for visually impaired people. The RGB-D camera is a navigational aid both indoors and outdoors, in addition to the classic Digital 2022, 2 403 poses countermeasures. Section 6 presents pros, cons, and limitations of RGB-D technology, in various aspects. Finally, Section 7 evaluates RGB-D cameras and presents the conclusions.

Motivation
The representation of the physical world through realistic three-dimensional models has gained much interest in recent years and is therefore one of the subjects of research in computer science. The models are reconstructed with various modern and automated technologies. Relatively recently, depth cameras have been developed that provide realtime 3D models.

Scope and Contribution
The research carried out to date refers to various techniques and methods of 3D object reconstruction, while others discuss the differences between depth cameras and optimization to texture mapping algorithms for static scenes. This paper, through a bibliographic review, clarifies terms such as depth cameras and depth maps, and describes the construction and technology under which the RGB-D camera operates, as well as its applications. In addition, emphasis is placed on the research conducted, and the advantages and limitations of this technology are analyzed. The contribution of this paper is to discuss the state of the art and current status of RGB-D cameras. Hence, the novelty of this paper, compared to all other reviews, lies in the holistic approach to depth camera technology. In particular, in this research, the concepts (terms) of the RGB-D camera are clarified, its technology and operation are described in detail, the functions of the incorporated sensors are categorized, the process of 3D reconstruction models is explained in detail, the weaknesses and limitations that they present (especially from a technological point of view but also due to their principle of operation) are highlighted, some important datasets are listed, all of the above are evaluated, and the cons and limitations created due to a variety of factors are highlighted. In addition, appropriate countermeasures to solve any problem are proposed, and finally, challenges and open issues that require further study and research are identified.
The research questions are summarized as follows: • What is RGB-D camera technology? • What kind of data is acquired from the RGB-D camera? • What algorithms are applied for applications? • What are the benefits and limitations of the RGB-D camera? • Why is a depth map important?
Hence, the novelty of this paper lies in the holistic approach to depth camera technology. In particular, in this research, the concepts (terms) of the RGB-D camera are clarified, its technology and operation are described in detail, the functions of the incorporated sensors are categorized, the process of 3D reconstruction models is explained in detail, the weaknesses and limitations that they present (especially from a technological point of view but also due to their principle of operation) are highlighted, all of the above are evaluated, and the cons and limitations created due to a variety of factors are highlighted. In addition, appropriate countermeasures to solve any problem are proposed, and finally, challenges and open issues that require further study and research are identified.

Methodology of Research Strategy
The literature search was achieved by identifying the appropriate criteria used by other papers published in peer-reviewed journals or peer-reviewed conference proceedings, with keywords according to the title in English.

Data Sources and Search
The desired information extraction from existing surveys was done in the online bibliographic database Scopus, which includes the most important digital libraries (Elsevier, Springer, IEEE), provides a refined search, and facilitates the export of files. Below, the query that was performed in Scopus for the period 2011-2022 is described: TITLE-ABS-KEY (rgb-d AND cameras) AND TITLE-ABS-KEY (depth AND cameras).
The query results were saved in a CSV file. The process revealed 1869 documents. Of these, however, only journal articles, conference papers, and book chapters were selected, thus reducing the number to 1827. Of the remaining, only 1758 were written in English, and some of these were not ultimately relevant to the subject, while for others, only the abstract was available. Finally, 124 publications were selected as the most representative and influential and discussed in this study. Figure 1 shows the process by which information was obtained from the literature database, while Figure 2 shows the percentage of each type of publication in the papers selected.
nology. In particular, in this research, the concepts (terms) of the RGB-D camera are clarified, its technology and operation are described in detail, the functions of the incorporated sensors are categorized, the process of 3D reconstruction models is explained in detail, the weaknesses and limitations that they present (especially from a technological point of view but also due to their principle of operation) are highlighted, all of the above are evaluated, and the cons and limitations created due to a variety of factors are highlighted. In addition, appropriate countermeasures to solve any problem are proposed, and finally, challenges and open issues that require further study and research are identified.

Methodology of Research Strategy
The literature search was achieved by identifying the appropriate criteria used by other papers published in peer-reviewed journals or peer-reviewed conference proceedings, with keywords according to the title in English.

Data Sources and Search
The desired information extraction from existing surveys was done in the online bibliographic database Scopus, which includes the most important digital libraries (Elsevier, Springer, ΙΕΕΕ), provides a refined search, and facilitates the export of files. Below, the query that was performed in Scopus for the period 2011-2022 is described: TITLE-ABS-KEY (rgb-d AND cameras) AND TITLE-ABS-KEY (depth AND cameras) The query results were saved in a CSV file. The process revealed 1869 documents. Of these, however, only journal articles, conference papers, and book chapters were selected, thus reducing the number to 1827. Of the remaining, only 1758 were written in English, and some of these were not ultimately relevant to the subject, while for others, only the abstract was available. Finally, 124 publications were selected as the most representative and influential and discussed in this study. Figure 1 shows the process by which information was obtained from the literature database, while Figure 2 shows the percentage of each type of publication in the papers selected.

History of RGB and 3D Scene Reconstruction
The technology of time-of-flight (ToF) cameras with 3D imaging sensors that provide a depth image and an amplitude image with a high frame rate has developed rapidly in recent years. Figure 5 shows the historical course of the development and evolution of depth cameras.

History of RGB and 3D Scene Reconstruction
The technology of time-of-flight (ToF) cameras with 3D imaging sensors that provide a depth image and an amplitude image with a high frame rate has developed rapidly in recent years. Figure 5 shows the historical course of the development and evolution of depth cameras.
Depth cameras were developed in the last decade; however, the foundations were laid in 1989. Some milestones in the history of 3D Reconstruction and RGB-D technology are as follows: In the 1970s, the idea of 3D modeling and the significance of object shape were introduced. In the 1980s, researchers focused on the geometry of objects. From the 2000s, various techniques and methods related to the features and textures of objects and scenes were developed. In the 2010s, appropriate algorithms were developed and implemented in applications, which mainly include dynamic environments and robotics. In this framework, deep learning methods are used that provide satisfactory models with high accuracy. Nowadays, research is focused on ways of solving the existing limitations associated with quality fusion of the scenes.

History of RGB and 3D Scene Reconstruction
The technology of time-of-flight (ToF) cameras with 3D imaging sensors that provide a depth image and an amplitude image with a high frame rate has developed rapidly in recent years. Figure 5 shows the historical course of the development and evolution of depth cameras. Depth cameras were developed in the last decade; however, the foundations were laid in 1989. Some milestones in the history of 3D Reconstruction and RGB-D technology are as follows: In the 1970s, the idea of 3D modeling and the significance of object shape were introduced. In the 1980s, researchers focused on the geometry of objects. From the 2000s, various techniques and methods related to the features and textures of objects and scenes were developed. In the 2010s, appropriate algorithms were developed and implemented in applications, which mainly include dynamic environments and robotics. In this

Hardware and Basic Technology of RGB-D
Depth cameras are vision systems that alter the characteristics of their environment, mainly visual, in order to capture 3D scene data from their field of view. These systems have structured lighting, which projects a known pattern on the scene and measures its distortion when viewed from a different angle. This source may use a wavelength belonging to the visible field, but more commonly, will be selected from the infrared field. The more sophisticated systems implement the time of flight (ToF) technique, in which the return time of a light pulse after reflection by an object in the scene captures depth information (typically over short distances) from a scene of interest [71].
A typical depth camera (RGB-D) incorporates an RGB camera, a microphone, and a USB port for connection to the computer. In addition, it includes a depth sensor, which uses infrared structured light to calculate the distance of each object from the camera's horizontal optical axis (depth). In order to achieve this, it takes into account each point of the object being studied. In addition, some cameras have an infrared emitter (IR emitter) that consists of an IR laser diode to beam modulated IR light to the field of view. The reflected light is collected by the depth sensor and an infrared absorber (IR sensor) mounted anti-diametrically. RGB-D sensors are a specific type of depth-sensing device that work in association with an RGB (red, green, blue color) sensor camera. They are able to augment the conventional image with depth information (related with the distance to the sensor) on a per-pixel basis. The depth information obtained from infrared measurements is combined with the RGB image to yield an RGB-D image. The IR sensor is combined with an IR camera and an IR projector. This sensor system is highly mobile and can be attached to a mobile instrument such as a laptop [72] (see Figure 6). mounted anti-diametrically. RGB-D sensors are a specific type of depth-sensing device that work in association with an RGB (red, green, blue color) sensor camera. They are able to augment the conventional image with depth information (related with the distance to the sensor) on a per-pixel basis. The depth information obtained from infrared measurements is combined with the RGB image to yield an RGB-D image. The IR sensor is combined with an IR camera and an IR projector. This sensor system is highly mobile and can be attached to a mobile instrument such as a laptop [72] (see Figure 6). As for how it works, it emits a pre-defined pattern of infrared light rays. The light is absorbed by existing objects, and the depth sensors measure it. Since the distance between the emitter and sensor is known, from the difference between the observed and expected position, the depth measurement, with respect to the RGB sensor, is taken for each pixel. Trigonometric relationships are used for this purpose (see Figure 7). As for how it works, it emits a pre-defined pattern of infrared light rays. The light is absorbed by existing objects, and the depth sensors measure it. Since the distance between the emitter and sensor is known, from the difference between the observed and expected position, the depth measurement, with respect to the RGB sensor, is taken for each pixel. Trigonometric relationships are used for this purpose (see Figure 7).  Figure 7 illustrates the processes of RGB-D camera operation. RGB and infrared cameras capture one scene at the same time. Through this process, the information is visualized, and a 2D monochrome digital image is created, in which the color of each pixel indicates the distance of the homologous point (key point) from the camera. Dark shades indicate near-camera objects while light ones indicate otherwise.
Like all technologies, RGB-D camera technology has certain limitations. To address these, various techniques and methods have been devised and developed. For example, it often suffers from specific noise characteristics and data distortions [73,74]. In general, although depth cameras operate with the same technology, they show differences related to the camera's resilience against background and structured light [75], but almost all of them have a low cost.  Like all technologies, RGB-D camera technology has certain limitations. To address these, various techniques and methods have been devised and developed. For example, it often suffers from specific noise characteristics and data distortions [73,74]. In general, although depth cameras operate with the same technology, they show differences related to the camera's resilience against background and structured light [75], but almost all of them have a low cost.

Conceptual Framework of 3D Reconstruction
The 3D representation of the physical world is the core and basic purpose of computer vision. The term 3D representation refers to the mapping and three-dimensional reconstruction of a region (scene), including its objects. This can be achieved through various techniques and methods such as active, passive, or hybrid (i.e., combination of active and passive), monocular or stereoscopic (stereo vision), and techniques based either on the depth or content of the image. The use of depth cameras for this purpose is a valuable tool, as they capture the scene of indoor and outdoor environments in real time. How is the reconstruction achieved, though? Firstly, the information is collected by the depth camera, and then, with the help of appropriate algorithms, it is processed to obtain the final result, which is a reconstructed object that combines disparate information into a textured 3D triangle mesh. For this purpose, many algorithms have been developed, among which the more frequently used are bundle fusion [76], voxel hashing [77], SIFT, SURF [78], FAST [79], ORB [80], RANCSANC [81,82], MVS [83], ICP [84], and signed distance function (SDF) [85,86]. Each algorithm contributes to each stage of the scene reconstruction. Table 1 presents the functions of these algorithms. A standard procedure for assessing the structure of a three-dimensional object with depth cameras begins with the combination of its sensors. RGB-D sensors simultaneously capture color (RGB) and depth (D) data, and then the features of the considered scene are detected and extracted. In the next step, homologous (key) points are sought between the camera frames and are matched. Initially, a sparse point cloud is created, and then its local coordinates are transformed into the global coordinate system using the co-linearity equations. The sparse point cloud has a low resolution. From the sparse cloud emerges the dense cloud, which has metric value, and its density depends on the number of frames. In the next step, a triangle grid is created among points of the dense cloud, and the resolution of the depth map is determined by the number of triangular surfaces. The triangular grid creation process is called triangulated irregular network (TIN) spatial interpolation; a TIN is represented as a continuous surface consisting of triangular faces. A texture is given at each triangular surface. Finally, the position of the camera is evaluated, and the three-dimensional reconstructed model is extracted. In 3D reconstruction scenes in real time, the same procedure is followed, except that the calibration procedure is required to calculate the position and orientation of the camera for the desired reference system. In addition, errors are identified and corrected, in order to operate the camera accurately (see Figure 8).  Figure 8a shows a general 3D reconstruction object workflow from RGB-D cameras, while Figure 8b shows a typical live 3D reconstruction. Sometimes the algorithms may be different, but the core technique is the same.

Approaches to 3D Reconstruction (RGB Mapping)
Researchers have approached the subject of 3D reconstruction with various techniques and methods. Table 2 presents some of these.

Approaches of 3D Reconstruction Techniques and Methods
Characteristics Align the current frame to the previous frame with the ICP algorithm [47] For large-scale scenes, creates error propagation [33] Weighted average of multi-image blending [78] Motion blur and sensitive to light change  Figure 8a shows a general 3D reconstruction object workflow from RGB-D cameras, while Figure 8b shows a typical live 3D reconstruction. Sometimes the algorithms may be different, but the core technique is the same.

Approaches to 3D Reconstruction (RGB Mapping)
Researchers have approached the subject of 3D reconstruction with various techniques and methods. Table 2 presents some of these. Table 2. Approaches and characteristics of 3D reconstruction.

Techniques and Methods Characteristics
Align the current frame to the previous frame with the ICP algorithm [47] For large-scale scenes, creates error propagation [33] Weighted average of multi-image blending [78] Motion blur and sensitive to light change Sub-mapping-based BA [79] High reconstruction accuracies and low computational complexity Design a global 3D model, which is updated and combined with live depth measurements for the volumetric representation of the scene reconstructed [87] High memory consumption Visual and geometry features, combines SFM without camera motion and depth [88] Accuracy is satisfactory, cannot be applied to real-time applications Design system that provides feedback, is tolerate in human errors and alignment failures [89] Scans large area (50 m) and preserves details about accuracy Design system that aligns and maps large indoor environments in near real time and handles featureless corridors and dark rooms [47] Estimates the appropriate color, implementation of RGB-D mapping is not real time According to the literature, for the 3D reconstruction of various scenes, depth cameras are used in combination with various techniques and methods to extract more accurate, qualitative, and realistic models. However, when dealing with footage in mostly dynamic environments, there are some limitations that require solutions. Table 3 describes these limitations and suggests possible solutions. Table 3. Limitations and proposed solutions for 3D reconstruction in dynamic scenes.

Limitation of RGB-D in Dynamic Scenes Proposed Solutions
High-quality surface modeling Surface modeling, no points

Multi-View RGB-D Reconstruction Systems That Use Multiple RGB-D Cameras
Three-dimensional reconstruction of a scene from a single RGB-D camera is a risk and should be taken seriously because there are certain limitations. For example, in complex large scenes, it has low performance and requires high memory capacity or estimation of the relative poses of all cameras in the system. To address these issues, the multiple RGB-D camera was developed. Using this method, we can acquire data independently from each camera, which are then put in a single reference frame to form a holistic 3D reconstruction of the scene. Therefore, in these systems, calibration is necessary [90].

RGB-D SLAM Methods for 3D Reconstruction
3D reconstruction can be used as a platform to monitor the performance of activities on a construction site [91]. The development of navigation systems is one of the major issues in robotic engineering. A robot needs information about the environment, objects in space, and its position; therefore, various methods of navigation have been developed based on odometry [92], inertial navigation, magnetometer, active labels (GPS) [93], and label and map matching. The simultaneous localization and mapping (SLAM) approach is one of the most promising methods of navigation. Recent progress in visual simultaneous localization and mapping (SLAM) makes it possible to reconstruct a 3D map of a construction site in real-time. Simultaneous localization and mapping (SLAM) is an advanced technique in the robotics community and was originally designed for a mobile robot to consistently build a map of an unknown environment and simultaneously estimate its location in this map [94]. When a camera is used as the only exteroceptive sensor, this technique is called visual SLAM or VSLAM [95]. Modern SLAM solutions provide mapping and localization in an unknown environment [96]. Some of them can be used for updating a map that has been made before. SLAM is the general methodology for solving two problems [97,98]: (1) environment mapping and 3D model construction, and (2) localization using a generated map and trajectory processing [99].

RGB-D Sensors and Evolution
The data acquisition from depth cameras plays an important role in further processing of data in order to produce a qualitative and accurate 3D reconstructed model of the physical world. Therefore, the contribution of depth cameras' incorporated sensors is of major importance. Nowadays, the sensors have many capabilities and continue to evolve. The rapid evolution is due to the parallel development of technologies, and this is to be expected considering that depth cameras work with other devices or software. In short, there are two main types of sensors, active and passive, which complement each other in various implementations [100]. Although sensors provide many benefits, they also present errors [101] and inaccurate measurements [102]. In general, to achieve a high degree of detail, depth cameras should be calibrated. RGB-D cameras were developed in the last decade, but the foundations were laid in 1989. Figure 9 illustrates the evolutionary history of RGB-D sensors.
Digital 2022, 2, FOR PEER REVIEW 12 in space, and its position; therefore, various methods of navigation have been developed based on odometry [92], inertial navigation, magnetometer, active labels (GPS) [93], and label and map matching. The simultaneous localization and mapping (SLAM) approach is one of the most promising methods of navigation. Recent progress in visual simultaneous localization and mapping (SLAM) makes it possible to reconstruct a 3D map of a construction site in real-time. Simultaneous localization and mapping (SLAM) is an advanced technique in the robotics community and was originally designed for a mobile robot to consistently build a map of an unknown environment and simultaneously estimate its location in this map [94]. When a camera is used as the only exteroceptive sensor, this technique is called visual SLAM or VSLAM [95]. Modern SLAM solutions provide mapping and localization in an unknown environment [96]. Some of them can be used for updating a map that has been made before. SLAM is the general methodology for solving two problems [97,98]: (1) environment mapping and 3D model construction, and (2) localization using a generated map and trajectory processing [99].

RGB-D Sensors and Evolution
The data acquisition from depth cameras plays an important role in further processing of data in order to produce a qualitative and accurate 3D reconstructed model of the physical world. Therefore, the contribution of depth cameras' incorporated sensors is of major importance. Nowadays, the sensors have many capabilities and continue to evolve. The rapid evolution is due to the parallel development of technologies, and this is to be expected considering that depth cameras work with other devices or software. In short, there are two main types of sensors, active and passive, which complement each other in various implementations [100]. Although sensors provide many benefits, they also present errors [101] and inaccurate measurements [102]. In general, to achieve a high degree of detail, depth cameras should be calibrated. RGB-D cameras were developed in the last decade, but the foundations were laid in 1989. Figure 9 illustrates the evolutionary history of RGB-D sensors.

Sensing Techniques of RGB-D Cameras
There are different techniques to acquire data from depth cameras. These techniques fall into two categories, active and passive sensing, as well as the recently developed monocular depth estimation. The techniques of the first category uses a structured energy emission to capture an object in a static environment [103], as well as capture the whole scene at the same time. With active techniques, the 3D reconstruction becomes simpler.
In this case, there are also two subcategories, time of flight (ToF) and structured light (SL) cameras [104]. The second category is based on the triangulation principle [105,106], and through epipolar geometry, correspondence of the key points. In the third category, the depth estimation for the 3D reconstruction of an object is done by two-dimensional images [107]. Figure 10 shows the categories of techniques for data acquisition from RGB-D cameras.

Sensing Techniques of RGB-D Cameras
There are different techniques to acquire data from depth cameras. These techniques fall into two categories, active and passive sensing, as well as the recently developed monocular depth estimation. The techniques of the first category uses a structured energy emission to capture an object in a static environment [103], as well as capture the whole scene at the same time. With active techniques, the 3D reconstruction becomes simpler. In this case, there are also two subcategories, time of flight (ToF) and structured light (SL) cameras [104]. The second category is based on the triangulation principle [105,106], and through epipolar geometry, correspondence of the key points. In the third category, the depth estimation for the 3D reconstruction of an object is done by two-dimensional images [107]. Figure 10 shows the categories of techniques for data acquisition from RGB-D cameras. Figure 10. Techniques for RGB-D to acquire data and information.

Depth Image Processing (Depth Map)
The depth of the scene, combined with the color information, will compose the RGB-D data, and the result will be a depth map. A depth map is a metric value image that provides information relating to the distance of the surfaces of the scene objects. In fact, it is through depth estimation that the geometric relationships of objects within a scene are understood [108]. This process is achieved by epipolar geometry (i.e., the geometry of stereoscopic vision), which expresses a scene viewed by two cameras placed at different angles, or simply by the same camera shifted to different viewing angles (See Figure 11).

Depth Image Processing (Depth Map)
The depth of the scene, combined with the color information, will compose the RGB-D data, and the result will be a depth map. A depth map is a metric value image that provides information relating to the distance of the surfaces of the scene objects. In fact, it is through depth estimation that the geometric relationships of objects within a scene are understood [108]. This process is achieved by epipolar geometry (i.e., the geometry of stereoscopic vision), which expresses a scene viewed by two cameras placed at different angles, or simply by the same camera shifted to different viewing angles (See Figure 11). According to Figure 11, a point P in world coordinates (X,Y,Z) is projected on the camera sensor at point x = K [R,T]X, with x = (u,v,1) and X = (X,Y,Z,1), where K is the camera calibration matrix, and R,T the rotation and translation matrices of 3 × 3 and 3 × 1 size, respectively. The only information that can be obtained is the calculation of the halfline on which this point is located, a half-line starting from the center of the camera projection and extending away from the camera. Therefore, if there is a second camera in a different part of space, covering the same scene, it is possible, through trigonometry, to calculate the exact 3D coordinates of the point, as long as the mapping of the points of one camera to the points of the other can be achieved [109]. Solving the problem is simple, as it requires solving a three-equation system with three unknown values. The pixel position in the frame of each camera, as well as the affinity transformation between the coordinate Figure 11. Epipolar geometry (source: https://en.wikipedia.org/wiki/Epipolar_geometry (accessed on 12 July 2022)).
According to Figure 11, a point P in world coordinates (X,Y,Z) is projected on the camera sensor at point x = K [R,T]X, with x = (u,v,1) and X = (X,Y,Z,1), where K is the camera calibration matrix, and R,T the rotation and translation matrices of 3 × 3 and 3 × 1 size, respectively. The only information that can be obtained is the calculation of the half-line on which this point is located, a half-line starting from the center of the camera projection and extending away from the camera. Therefore, if there is a second camera in a different part of space, covering the same scene, it is possible, through trigonometry, to calculate the exact 3D coordinates of the point, as long as the mapping of the points of one camera to the points of the other can be achieved [109]. Solving the problem is simple, as it requires solving a three-equation system with three unknown values. The pixel position in the frame of each camera, as well as the affinity transformation between the coordinate systems of the two cameras, is available as data. The mapping of pixels to a frame plane is done through the algorithms discussed above.
Depth maps are produced using the methods described in Section 5.2, and they are directly related to environment lighting, object reflection, and spatial analysis. For example, bright lighting is responsible for creating outliers [110]. In addition, depth maps suffer from view angle reflective surfaces, occlusion boundaries [111], levels of quantization, and random noise (mainly indoor scene distances) [112], which are related to the distance of the object and the pixel position. To a certain extent, some of the above disadvantages are addressed. For example, the fusion of frames from different viewpoints, shape from shading (SfS), shape from polarization (SfP) techniques, or bilateral filtering help to repair the noise and smooth the depth map [113]. Qualitative depth maps have been an important concern for researchers, and they have devised techniques to solve the problems created during the 3D reconstruction process. Table 4 illustrates some problems with depth images and technical countermeasures tested. Table 4. Problems of the depth maps and countermeasures.

Cons of Depth Maps Countermeasures
Low accuracy Apply bilateral filter [106] Noise Convolutional deep autoencoder denoising [107] (HR) RGB but (LR) depth images Super-resolution techniques, high-resolution color images [83]. CNN to downsample an HR image sampling and LR depth image [87] Featureless region Polarization-based methods (reveal surface normal information) [102] Shiny surfaces, bright, transparency TSDF to voxelize the space [105], ray-voxel pairs [106] Depth maps are of great importance for extracting 3D reconstruction models; however, there are still limitations that pose challenges to the scientific community. Moreover, there are still some issues that remain open and need to be explored in the future. The main limitations are as follows: • Recording the first surface seen cannot obtain information for refracted surfaces; • Noise from the reflective surface viewing angle. Occlusion boundaries blur the edges of objects; • Single-channel depth maps cannot convey multiple distances when multiple objects are in the location of the same pixel (grass, hair); • May represent the perpendicular distance between an object and the plane of the scene camera and the actual distances from the camera to the plane surface seen in the corners of the image as being greater than the distances to the central area; • In the case of missing depth data, many holes are created. To address this issue, a median filter is used, but sharp depth edges are corrupted; • Cluttered spatial configuration of objects can create occlusions and shadows.
From the above limitations emerge some challenges, such as occlusions, camera calibration errors, low resolution, and high levels of ambient light (ToF), which are unsuitable for outdoor operation (structured light). In addition, depth noise increases with distance (SL) quadratically. Moreover, issues such as the correspondence between stereo or multiview images, multiple depth cues, computational complexity, spatial resolution, angle of projection, and multiple camera interference for dynamic scenarios remain open to investigation.

RGB-D Datasets
RGB-D data is essential for solving certain problems in computer vision. Nowadays, there are open databases containing large datasets of both indoor and outdoor scenes collected by RGB-D cameras and different sensors. The data are related to scenes and objects, human activities, gestures, and the medical field, and are used for applications such as simultaneous localization and mapping (SLAM) [114], representation [115], object segmentation [116], and human activity recognition [117]. Table 5 lists some of the most well-known datasets and the applications they used, such as semantic segmentation (SS), object detection (OD), pose (P), normal maps (NM), 3D semantic-voxel segmentation (3D SvS), instance segmentation (IS), and diffuse reflectance (DR). The NYU Depth dataset is the most popular for RGB-D indoor segmentation. It was created using a Microsoft Kinect v1 sensor, is composed of aligned RGB and depth images, and consists of labeled data containing semantic segmentation as well as raw data [118]. There are two versions: NYUv1 and NYUv2 (464 scenes (407,024 frames) with 1449 labeled aligned RGB-D images with 640 × 480 resolution). The dataset is split into a training set of 795 images and a testing set of 654 images. The difference is that the first type has fewer scenes and total frames (64 Scenes (108,617 Frames) with 2347 labeled RGB-D frames) [119]. NYUv2 originally had 13 different categories. However, recent models mostly evaluate their performance at the more challenging 40-classes settings [120].
The SUN RGB-D dataset [110] is the same category as NYU. Data was acquired with structured light and ToF sensors and used for semantic segmentation, object detection, and pose. This dataset provides 10,335 RGB-D images with the corresponding semantic labels. It contains images captured by different depth cameras (Intel RealSense, Asus Xtion, Kinect v1/2) since they are collected from previous datasets. Therefore, the image resolutions vary depending on the sensor used. SUN-RGBD has 37 classes of objects. The training set consists of 5285 images, and the testing set consists of 5050 images [121].
The Stanford2D3D dataset consists of indoor scene images, taken with a structured light sensor, which are used for semantic segmentation. It is a large-scale dataset that consists of 70,496 RGB images with the associated depth maps. The images are in 1080 × 1080 resolution and are collected in a 360 • scan fashion. The usual class setting employed is 13 classes [122].
The ScanNet dataset is an indoor dataset collected by a structured light and contains over 2.5 million frames from 1513 different scenes. It is used for 3D semantic-voxel segmentation [115].
The Hypersim dataset consists of indoor scenes that are captured synthetically and used for normal maps, instance segmentation, and diffuse reflectance [123].

Advantages and Limitations of RGB-D
RGB-D camera technology, as mentioned above, is increasingly being used in a variety of applications. However, as with any technology, apart from the advantages it provides, it also has some limitations, especially in terms of data collection. For instance, the real-time performance, dense model, no drifting with local optimization, and robustness to scene changes are camera innovations that do not work for large areas (voxel-grid), far away from objects (active ranging), or outdoors (IR). Moreover, it requires a powerful graphics cards and uses lots of battery (active ranging) resources. Table 6 shows the advantages and limitations of RGB-D cameras based on their sensors. In particular, the advantages and limitations of active and passive sensors in general are listed, and then specified in the subcategories of active sensors. In addition, there is a focus on sensor errors and the inaccuracies that arise from the measurement procedure.

Conclusions
In this report, through a literature review, the main aspects of the 3D reconstruction of scenes and objects, in both static and dynamic environments, using RGB-D cameras, are gathered, compared, discussed, and critically analyzed. In addition, approaches, methodologies, and techniques applied to date are summarized. Depth cameras are powerful tools for researchers, as this technology provides real-time stereoscopic models. On the other hand, this technology presents serious limitations, which need to be solved. For example, their infrared operation prevents the reconstruction of the model outdoors, and this is an issue that is still being studied. To eliminate the resulting problems, in both the process of 3D reconstruction of the objects and the final products, it may be necessary to devise stronger techniques and algorithms that cover all the weak points, resulting in more reliable objects in terms of their geometry. In addition, a very important issue that needs to be resolved concerns the amount of memory this process requires. Hence, technology should take into account both the integrity and reliability of 3D models and the performance of the systems concerned.
Some future research goals of this scientific field are to perform experiments with new CNN network architectures, to create methods with smaller memory requirements, to use machine learning, and to combine UAVs with depth cameras so that it is possible to capture scenes throughout the day and night.