1. Introduction
To conserve and preserve cultural heritage, the use of techniques and sensors for the 3D reconstruction of structures and sites of historical, artistic, and cultural interest in different parts of the world is a compelling research topic. For the digitization and documentation of cultural heritage, it is necessary to identify suitable techniques and strategies capable of building 3D models in the shortest possible time, documenting the state of conservation from a geometric and qualitative point of view.
The 3D models can be performed using image-based 3D modelling (IBM) or range-based modelling (RBM) [
1].
IBM methods use measurements of 2D images (generated by a passive sensor) in order to obtain 3D models. In recent years, a very successful approach in the construction of 3D models has been the one based on the
Structure from Motion (SfM) and
Multi-View Stereo (MVS) algorithms [
2,
3,
4,
5]. In order to speed up the image acquisition phase, the use of video is a challenging problem and has been an important research topic in the areas of photogrammetry and computer vision in recent years [
6,
7]. This stems from the enormous potential of this technique to acquire and cover large areas in a short time and the enormous progress made in the field of photogrammetry–computer vision in 3D reconstruction of objects using images. In addition, the development of high-performance low-cost sensors has enabled use in various fields of application. The process of building 3D models for photogrammetric purposes using a sequence of images from video is known as videogrammetry [
8]. Over the years, sensors and their formats have undergone rapid improvement. In 2005, Digital Cinema Initiatives (DCI) published the Digital Cinema System Specification, which establishes the standardized 2K and 4K container formats, with resolutions of 2048 × 1080 pixels and 4096 × 2160 pixels, respectively. The resolution of the video content inside follows the SMPTE 428-1 standard, which establishes the following resolutions for 4K distribution: (i) 4096 × 2160 pixels, full frame, 256:135 or ≈1.90:1 aspect ratio; (ii) 3996 × 2160 pixel, flat crop, 1.85:1 aspect ratio; and (iii) 4096 × 1716 pixels, CinemaScope crop, ≈2.39:1 aspect ratio. The 6K and 8K video resolution cameras are starting to be found in the market at a professional camera level, and consumer-grade 6K and 8K cameras are expected to become popular in the near future [
9]. Brilakis et al. [
10] wrote about 3D as-built documentation; the proposed framework includes the following five steps: (i) stereo camera set calibration, (ii) feature set detection and matching, (iii) structure and motion recovery, (iv) stereo mapping, and (v) intelligent data smoothing. Singh et al. [
11] explored the potential of the standard Sony DSC HX7V digital video camera and using Agisoft Photoscan software for 3D virtual city modelling. Alsadik et al. [
12] developed a method to find the minimal significant number of video images in terms of object coverage and blur effect; this reduction in video images is convenient to decrease processing time and to create a reliable textured 3D model compared with models produced by still imaging. Murtiyoso et al. [
13] showed how with the right workflow, by integrating low-cost imaging sensors with modern sensors found on smartphones, it is possible through a videogrammetric approach to scan and reconstruct three-dimensional models useful for cultural heritage documentation. Through a comparison with digital single lens reflex (DSLR) cameras, this latter study showed how a good compromise between geometric quality and overall cost can be achieved in the context of 3D documentation and reconstruction of CH. In order to document three different areas of the archaeological site “House of the Mithraeum” in the city of Mérida (Spain), Ortiz-Coder et al. [
14] developed a prototype consisting of two cameras (a high-resolution camera and a video graphics array (VGA) camera); this prototype is connected to a tablet capable of implementing a guidance system to control the trajectory and allow highly flexible and long-lasting movements. The results of this latter experimentation showed similar accuracies and a shorter acquisition time than terrestrial laser scanning in the 3D reconstruction of the point cloud.
The development of immersive video, i.e., a video involving capturing a live-action scene that presents a 360° field of view, made it possible to cover large areas in a short amount of time. Kwiatek and Tokarczyk [
15] discussed two applications of immersive video in photogrammetry. Firstly, the creation of a low-cost mobile mapping system based on Ladybug
®3 and a Global Positioning System (GPS) device was discussed. The second approach was a generation of 3D video-based reconstructions of heritage sites based on immersive video (structure from immersive video); a mobile camera mounted on a tripod dolly was used to record the interior setting, and immersive video, separated into thousands of still panoramas, was converted from video into 3D objects using Agisoft Photoscan Professional. Sun and Zhang [
16] used BLK360 and photogrammetry to create 3D models in order to assess the accuracy of videogrammetry applied to small settings of architectural heritage; the results show that the relative accuracy (median absolute errors/object dimensions) of spherical camera videogrammetry ranged from 1/500 to 1/2000, catering to the surveying and mapping of architectural heritage with medium accuracy and resolution.
Another technique used for the construction of the 3D point cloud is the one based on the direct measurement of three-dimensional geometric information of the object using active sensors [
17,
18]. An example of range-based technique is the Terrestrial Laser Scanner (TLS), which provides multiple advantages: (i) high accuracy; (ii) a large number of points forming an almost continuous surface; (iii) a high level of automation of the measurements; (iv) the possibility of recording the reflectance intensity of the laser beam, which could be used to investigate the properties of the analyzed object; and (v) the possibility of measuring unstructured areas [
19]. However, TLS is rather expensive and obtains data statically [
20]. In recent years, in order to increase the acquisition speed and find a system cheaper than TLS, a new range-based technique called Simultaneous Localization and Mapping (SLAM) has been developed [
21]. Using SLAM algorithms, a device can simultaneously localize (locate itself on the map) and map (create a virtual location map). SLAM devices are fast for data collection because they are mostly handheld or backpacked, and the data collection only requires walking around the setting [
22]. In general, there are two types of technological components used to implement SLAM: front- and back-end processes. Front-end processes include sensors, whereas back-end processes include mapping, localization, data fusion, and actuation, as shown in
Figure 1 [
23].
Two different methods of SLAM based on the front-end processing component are visual SLAM and SLAM LiDAR (Light Detection and Ranging). Visual SLAM (or vSLAM) uses images acquired from cameras and other image sensors. Visual SLAM can use simple cameras (wide-angle, fisheye, and spherical cameras), compound eye cameras (stereo and multi cameras), and RGB-D cameras (depth and ToF cameras). SLAM LiDAR, instead, is based on LiDAR measurement, which is a method that mainly uses a distance sensor. A LiDAR-based SLAM system uses a laser sensor to generate a 3D map of the setting. LiDAR measures the distance to an object by illuminating the object using an active laser “pulse” [
24]. As regards accuracy, Maboudi et al. [
25] showed that the standard deviation of the distances between the point cloud generated by a SLAM sensor (Zeb-Revo, which is a handheld 3D mapping scanner from GeoSLAM with a scan rate higher than 40,000 points per second and 30 m range indoors) and the one generated by Leica P20 TLS (reference of the comparison) was about 11 mm. Similarly, Oniga et al. [
26] found a standard deviation of 1.6 cm from the comparison of the point clouds, one generated by TLS and another by GeoSLAM ZEB Revo RT; in addition, through the comparison between the cross-sections extracted from the point clouds, the authors found that 80% of the sigma values were less than or equal to 1 cm. In the field of CH, Hess et al. [
27]—exploiting SLAM technology capable of using the sequence of data acquired during movement to estimate relative position in real time and, in particular, using the ZEB Horizon GeoSLAM sensor—analyzed geometrically and quantitatively the architectural typologies of Cistercian gardens in the context of a designated cultural landscape, located in Franconia (Bavaria, Germany). The experiments conducted in this research showed that this methodology has great potential not only in 3D reconstruction but also to provide valuable technical and scientific support in the monitoring, digital conservation, and sharing of cultural heritage. Zhang et al. [
28], for the documentation of a cultural heritage site (Turkish palace at the Seraya site—Nazareth, Israel), investigated methods to reduce noisy responses in order to improve data quality and highlight the underlying structure. Indeed, by using bilateral filtering based on point cloud normals and introducing new concepts of normal-based preservation, the authors showed the possibility of producing a more visually pleasing entity description and performing subsequent processing, including feature extraction and semantic segmentation.
1.1. Aim of Paper
The field of investigation concerns the use of latest-generation sensors in the field of CH capable of generating a 3D point cloud.
Regarding the use of images obtained from passive sensors, a photogrammetric approach, based on the SfM and MVS algorithms, was investigated by processing and comparing a series of datasets obtained from a 360° spherical camera.
Concerning the active sensors, a mobile SLAM sensor was tested in order to produce a point cloud in the shortest possible time and at the same time, that was metrically accurate. Considering the reliability and metric accuracy in three-dimensional reconstruction, a phase-shift TLS was used; the point cloud generated by the TLS represents the reference in the process of comparing point clouds generated by active and passive sensors.
The use of appropriate statistical indices makes it possible to investigate the metric quality of the point cloud acquired by several sensors. Therefore, the purpose of the paper is to identify the performance (quality, accuracy, acquisition, and processing times) of the latest sensors in 3D reconstruction processes of elements belonging to cultural heritage.
1.2. Organization of the Article
This paper is organized as follows.
The first part describes the active and passive sensors used in the experiment and the methodological approach adopted for the generation of the point cloud using the different acquisition technologies, with particular regard to the technique and principles of photogrammetric data acquisition (
Section 2.1 and
Section 2.2). The statistical indicators used to compare the different point clouds are described in
Section 2.3.
In the third paragraph, after the description of a 3D test field conducted in the Geomatics laboratory of the Polytechnic of Bari (
Section 3.1), the case study, represented by a structure of important architectural and historical interest in the field of CH (
Section 3.2), is illustrated. This structure was surveyed using active sensors such as TLS and GeoSLAM (
Section 3.3 and
Section 3.4) and passive sensors such as spherical cameras (
Section 3.5).
The experimental results and validation of the multi-sensor approach are described in
Section 4.
The discussion and conclusions (
Section 5 and
Section 6) are summarized at the end of the article.
3. Point Cloud Generation by Several Sensors and Techniques
For the construction of 3D point clouds of a site of historical and architectural interest, three different sensors were used in this experiment: Ricoh-Theta Z1, GeoSLAM Zeb RevoRT, and Terrestrial Laser Scanner mod. Z + F IMAGER® 5010C. Therefore, specific datasets were generated for each sensor used; however, in the specific case of the spherical camera, preliminary laboratory tests were carried out. The purpose of this experimentation was to assess the quality of the point cloud obtained from the 4K images.
3.1. Point Cloud Quality Assessment from 7K to 4K Images
This test, conducted within the Geomatics Laboratory of the Polytechnic of Bari, allowed the difference between the point cloud generated by the raw images and the frames extracted from the video to be evaluated.
The raw images were acquired with a tripod in different positions and on two height levels in order to achieve a rigid geometric acquisition configuration. Subsequently, a video reproducing the trajectory deduced from the positions acquired in photo mode was produced. The dataset made up of equirectangular raw images and that made up of the frames generated by the video were processed in the Agisoft Metashape environment in order to generate a dense point cloud. This software allows the dataset to be processed according to the size of the images. For example, with the “High” accuracy setting the software works with photos of the original size, the “Medium” setting causes image downscaling by a factor of 4 (2 times for each side), and at “Low” accuracy the source files are downscaled by a factor of 16.
In this case study, the “High” setting was used for alignment process and “Low” to generate the dense point cloud. The two point clouds were compared with each other in Cloud Compare software version 2.11; the comparison showed that the two point clouds were comparable with each other, i.e., they presented a maximum difference of 0.01 m (
Figure 5).
Furthermore, the two point clouds of the entire structure, generated from 4K and 7K images, were compared with one obtained from a TLS survey that was performed with HDS3000 that has a position accuracy of 6 mm.
The latter comparison showed the quality of the point cloud obtained by extracting frames from the video and the level of accuracy achievable using equirectangular images (
Figure 6 and
Figure 7).
Comparing the point cloud generated by the TLS (reference) and the point cloud generated by the equirectangular images with 4K and 7K resolution, it was possible to note an average distance of 0.04 m and a maximum distance of 0.029 m and 0.026 m, respectively.
3.2. Experimentation on Cultural Heritage Site: Buziaș Colonnade
The “Buziaș Colonnade” is a site of significant historical interest, located in a park of about 20 hectares in the homonymous town of Buziaș (
Figure 8) and about 30 km from the city of Timișoara (capital of the Timiș district—Romania).
The colonnade was commissioned by Emperor Franz Joseph of Austria around 1875 and was intended as a place for his wife Empress Elisabeth, better known as Sisi, to stroll during her stay for terminal treatment.
The Buziaș Colonnade is built in the Byzantine architectural style with wood carvings that give the impression of huge lace and has a total length of 533 m, making it the only one of its kind in Europe.
3.3. Terrestrial Laser Scanner
The TLS Z + F IMAGER
® 5010C used for the experiment has a range of about 187 m and is capable of acquiring point data with a vertical FOV of 320° and a horizontal FOV of 360°. For the TLS survey, four different quality levels can be set with the instrumentation used, depending on the resolution and measurement. The resolution for the TLS was set to “High” (6 mm at 10 m), with quality balanced and normal. To scan the entire architectural structure, 54 scans were carried out. Post-processing was performed using the in-house Z + F LaserControl software; the scans were aligned manually using flat targets as reference points that had been accurately positioned within the investigated site. The alignment phase of the scans resulted in a final RMSE for a total of approximately 280 million points acquired. The following images show several details of the point cloud obtained from the TLS survey processing (
Figure 9).
3.4. GeoSLAM
The survey of the structure under consideration was carried out using GeoSLAM ZEB Revo RT. After being stationed for about 15 s in a barycentric position with respect to the area to be surveyed, the operator began to walk around in order to acquire the LiDAR data. In particular, in order to reconstruct the entire structure, several closed paths were designed. Starting from the left front, the outer side was first acquired, followed by the inner side of the structure; the left side was then acquired similarly.
Figure 10 shows some details of the acquired point cloud as well as, in red, the paths followed for the data acquisition. Subsequently, the 3D data acquired from the laser scan were processed using the desktop software GeoSLAM Hub 6.1.0, whose different steps are rather automated. The format of the output point cloud was set to “LAS” and the density of the point cloud was set to 100% with shaded colors.
3.5. Spherical Videogrammetry
3.5.1. Point Cloud of the CH Site Using the Ricoh Theta Camera along the Axis Line of the Colonnade
A spherical video was acquired along the center line of the structure; the time of acquisition was 15 min and 46 s for a total video of 5.2 GB.
In Agisoft Metashape, it is possible to transform the video into a sequence of frames that will then be used as source images for the 3D reconstruction.
Moreover, it is possible to choose the automatic frame pitch that can be useful to skip similar sequential frames. In particular, it is possible to choose the automatic step of the frame (small, medium, large): the “small” value corresponds to a displacement of about 3% of the image width, the “medium” value corresponds to a displacement of 7%, and the “large” value corresponds to a displacement of 14% of the image width. In this case, a displacement value of 3% was used. From the video acquired along the axis line of the colonnade, 25,216 frames were extracted; considering the quantity of the frames and the high overlap of the images, a dataset of 2295 frames was created, selecting from the initial 25,216 frames one frame every 10 frames (3 frames per second). The extracted frames were aligned using the Agisoft Metashape software on the “High” setting. The processing of this dataset did not provide good results, as only 928 images were correctly aligned; this was due to the weakness of the geometric configuration of spherical images.
3.5.2. Point Cloud of the CH Site Using the Ricoh Theta Camera Following a Sinusoidal Pattern
Due to the poor results obtained, a new acquisition following a sinusoidal pattern was carried out, according to the sketch reported in
Figure 5b. The acquisition of the entire structure lasted 25 min and 37 s, for a total video of about 10 GB.
As described in the previous section, from the video a new dataset of 4153 images was built. In order to simplify the alignment process, the dataset was split into three datasets (see
Figure 11).
All the datasets were processed with the “High” setting in the alignment process, whereas when building 3D point cloud, the “Low” setting was used. The results of photogrammetric processing are summarized in
Table 1.
Lastly, the 3D point cloud was georeferenced using GCPs derived by the TLS survey. The average total error achieved in the three different datasets and evaluated in GCPs was about 0.02 m. Subsequently, the three datasets were merged; in this way a point cloud of about 12 million points was generated.
5. Discussion
The three sensors used were able to reconstruct a three-dimensional point cloud. From a statistical point of view, it turned out that the point cloud generated by videogrammetry, compared to the TLS reference cloud, showed an average difference of about 0.20 m and a variance of about 0.16 m. When comparing GeoSLAM with the TLS reference cloud, the mean difference and variance were of the order of a few centimeters. Further statistic parameters were analyzed in the descriptive comparison of the processed three-dimensional models. In particular, the kurtosis, i.e., the parameter relating to the deviation from distribution normality, was approximately 1.80 in both cases, demonstrating a smoothing of the distribution curve (outliers less extreme than the normal distribution). On the other hand, as regards the evaluation of the skewness parameter, the value close to zero obtained indicates a zero skewness that tends towards a left skewness for both sensors analyzed. The calculated MAD values made it possible to identify the presence of outliers; in particular, in the case of the TLS-GeoSLAM comparison, a value of approximately 0.05 m was obtained. Regarding the dataset obtained from spherical videogrammetry, the value from the MAD was approximately 0.15 m. This means that GeoSLAM, in contrast to the survey carried out with a videogrammetric approach using a spherical sensor, was closer to the TLS point cloud. The values obtained from the statistical analysis conducted were graphically confirmed by the extraction of a profile from a significant part of the structure. The profile confirmed, on the one hand, the noisiness of the dense point cloud obtained by spherical videogrammetry and, additionally, the high quality of the point cloud obtained with the GeoSLAM sensor. However, it is necessary to take into account the complexity of operations related to both the surveying and data-processing activities; therefore, to analyze the performance and peculiarities of each sensor, it is necessary to take into account a series of additional quantitative and qualitative parameters. The evaluation of the performance of these sensors used in this case study showed that each sensor has different characteristics regarding use, processing, and quality of the final results. It is not possible to express an unambiguous evaluation on these aspects; to this end, various aspects characterizing each sensor were taken into consideration, as shown below (
Table 3).
For each parameter in
Table 3, a value from 1 to 10 was assigned. Some parameters are objective, i.e., they take into account directly measurable and quantifiable aspects; others, on the contrary, require a subjective evaluation. For this reason, parameters that are not directly measurable were obtained through a questionnaire submitted to the Geomatics research team of experts in the field of surveying and 3D modelling.
The indicators taken into consideration describe the quantitative and qualitative aspects of the point cloud in terms of survey operations, processing, accuracy, detail, relative costs of instrumentation, and the necessary specialization of operators. In particular, for the “Acquisition” indicator, two parameters were taken into consideration: the acquisition time required for surveying activities and the ease with which it was carried out. Similarly, for the “Processing” indicator, the processing times and ease of use in managing the specialized software necessary for the production and final processing of the dense cloud were analyzed. For both indicators listed so far, for the quantification of the grade with reference to time, it was necessary to normalize the “time” figure by assigning a lower ranking in correspondence with high execution and processing times (and vice versa). Regarding the “Results” parameter, the different point clouds produced with the different sensors were analyzed and the final accuracy achieved; the geometric resolution and finally the chromatic quality of the three-dimensional model obtained were introduced as comparison and evaluation parameters. The last indicator analyzed was “Costs,” relating both to device purchase and maintenance, as well as costs relating to the specialized operators in charge of the survey operations. In the latter case, low values were considered in assigning the grade in correspondence with high costs.
The grades were translated to a ranking from 1 to 5 stars for each of the indicators analyzed, in order to be able to give a simpler and quicker reading of the potential of the sensors; one star is considered the lowest level of classification, whereas a high level of classification is associated with 5 stars. The results of this ranking are shown in
Table 4, where the star rating will be rounded up or down.
Table 4 also shows that although TLS provides a high-quality point cloud, the costs and skills involved are important in terms of both data acquisition and processing. The GeoSLAM sensor showed a high versatility in data acquisition but required high professionalism and medium–high costs in the three-dimensional reconstruction of the data. As far as the spherical camera is concerned, despite the low instrumental and professional costs and despite ease of use and reduced time in the acquisition processes, it showed a less-than-perfect level of accuracy and density of the point cloud.
6. Conclusions
The paper reports the comparison of point clouds generated by three different sensors: active sensors such as TLS and GeoSLAM and passive sensors such as the spherical camera.
The point cloud generated by the TLS survey represents the reference model for comparison with other models obtained from other sensors thanks to its ability to describe the architectural elements of the structure with high density and accuracy.
Regarding the Ricoh Theta Z1 spherical camera, the potential of spherical videogrammetry was analyzed, which allowed numerous equirectangular images to be generated in 4K. From the images, it was possible to create a point cloud in the SfM and MVS environment. A key role in the construction of the point cloud was the acquisition technique of the video and consequently of the extracted images; by using a sinusoidal pattern, it was possible to obtain a 3D point cloud of the structure under investigation. However, in some parts of the structure, the point cloud was rather noisy. In fact, as shown in the profile in
Figure 12, the point cloud generated by the spherical camera was very noisy and unable to accurately describe the architectural elements of the structure. The comparison between the model generated by the spherical camera and the one obtained by TLS showed an average deviation of about 20 cm, due to the noisiness of the point cloud and to the several outliers present.
The point cloud generated by the GeoSLAM sensor provided a detailed and accurate three-dimensional reconstruction of the site. In fact, from the C2C comparison with the TLS point cloud, an average distance of a few centimeters was obtained. In contrast to the model reconstructed from the equirectangular images, the point cloud obtained from the GeoSLAM sensor was more detailed and accurate in the reconstruction of the most detailed architectural elements. Compared to the model generated by the TLS, however, the density of the point cloud was lower and showed low noise on the different elements of the structure. Although GeoSLAM presented difficulties in acquiring the data and handling rather long structures, it made it possible to obtain a detailed point cloud suitable for reconstructing a 3D structure.
The application of the spherical videogrammetry technique, on the other hand, made it possible to rapidly acquire even structures with complex geometry; however, the quality of the processed point cloud was barely suitable for the 3D reconstruction of individual architectural elements. Therefore, research must be directed towards the development of algorithms that can improve the quality of the point cloud and, at the same time, reduce the noise characteristic of spherical sensors. In addition, the development of sensors capable of generating higher resolution frames, such as in 7K, can improve the 3D reconstruction process by providing better-quality architectural detail.