1. Introduction
In recent years, structure-from-motion coupled with multi-view stereo (SFM-MVS) techniques has been a focus of research in the fields of photogrammetry (image-based three-dimensional reconstruction) and computer vision, owing to the availability of low-cost vision sensors and methods for automated image processing [
1,
2]. SFM-MVS has influenced various areas, such as medicine (e.g., to examine changes in muscle tissue [
3]), cultural heritage (e.g., to document digital museum archives [
4], and visualization of museums in a virtual environment [
5,
6]), criminal investigation (e.g., forensic infography [
7], and inspection for concealed weapons [
8]), reverse engineering (e.g., investigation of industrial components [
9]), forestry and ecology [
10,
11], virtual and augmented reality [
12,
13,
14], applications in the entertainment and gaming industries [
15,
16], as well as other fields.
SFM is a non-contact measuring technique that is used to find a set of feature points that appear in different images. The feature points can be used in order to retrieve the captured scene and estimate the position and orientation of the camera stations. By utilizing the orientation parameters, the three-dimensional (3D) coordinates of the camera stations and sparse point cloud can be estimated in 3D space. MVS techniques can then be employed to group the images that share common viewpoints and add more points to the SFM cloud [
17]. However, surfaces with poor visual texture (uniform, monotone, repetitive textures, etc.) pose a problem that is related to the extraction of feature points required for the correspondence search (tie point detection) between different images. The lack of visual features results in empty holes appearing in the point clouds. Typically, features are identified using a feature detector-descriptor algorithm. Surfaces that possess strong morphological features together with high-frequency color changes enable easy and clear identification of feature points that are considered to be SFM-MVS friendly. Conversely, surfaces with uniform and monotone visual textures pose problems and affect the quality of 3D reconstruction. Therefore, texture analysis has been widely investigated in computer vision and pattern recognition applications because of its ability to extract discriminative features [
18].
Accuracy, time, and cost are important factors when choosing a 3D scanning approach. When compared to other 3D scanning techniques (e.g., laser scanning and time-of-flight), photogrammetry is time efficient and cheaper, but it lacks the accuracy that is needed for a range of applications. Specifically, objects with a texture-less surface pose a challenge to the SFM-MVS pipelines due to the lack of visual features. The problem of extracting features on texture-less surfaces is often dealt with by artificially enriching surfaces with image patterns. In [
3], Chang et al. proposed a multi-view image capturing system with projectors in order to monitor the deformity of a human spine to diagnose scoliosis. Because the torso surface is uniform and texture-less, synthetic image patterns were added to the surface while using video projectors before the image acquisition process. Ahmadabadian et al. implemented an automatic portable system for image acquisition with a novel projection system [
19,
20]. A total of 60 colored LED lights were located around a turntable to project a pattern on the object placed on the turntable. The structural resolution evaluation of the complex object resulted in an error of 0.3 mm for the system. With the application of noise function-based patterns (NFBPs), Koutsoudis et al. [
21] presented a data collection variant in order to improve the reconstruction quality using a commercial software package. The performance of the NFBPs was verified on a Cycladic figurine, where the wavelet noise pattern achieved a minimum root mean square (RMS) error of 0.24 mm. Recently, Santosi et al. [
22] introduced the idea of noise patterns generated from irrational numbers and random numbers of various hues. The performance evaluation of the noise patterns involved the reconstruction of an aluminum test model. The accuracy achieved for the highest-ranked pattern was 0.173 mm std. distance, and a mean distance of 0.016 mm. In our previous research [
23,
24,
25], we have shown that the structural resolution and elements in a pattern affect the quality of reconstruction. A random dense mesh of artificial elements on a texture-less surface results in a higher accuracy when compared with a uniform and repetitive texture.
In this study, we evaluate the performance of feature extraction methods on texture-less surfaces by applying synthetic noise patterns. The patterns (images) are generated from three known sources that have been proven to be effective, namely, noise functions, irrational numbers, and random numbers. Based on the related literature previously reviewed, we assume that the arrangement and intensity values (pixels) of the pattern images have an impact on the ability of the feature extraction methods to detect and identify the feature points. The aim is to evaluate the noise patterns with different feature extraction methods while using the results of real and virtual planar surface 3D reconstruction. Additionally, the aim is to find a combination of a noise pattern and a feature extraction method that gives the highest accuracy by evaluating the reconstructed polygonal models of a texture-less object. Furthermore, we develop a MATLAB-based plug-in that encompasses state-of-the-art methods for feature extraction (HARRIS, Shi-Tomasi, MSER, SIFT, SURF, KAZE, and BRISK) and feature matching.
The remainder of this paper is organized, as follows. In
Section 2, we present a brief overview of the SFM-MVS pipeline and different feature extraction methods. In
Section 3, the generation of noise patterns, outline data collection, and a feature extraction and matching plug-in are described. In
Section 4, we present and explain the evaluation results. We conclude the paper in
Section 6 by outlining important findings.
3. Proposed Methodology
As explained above, the morphological features of the object surfaces to be reconstructed are very important. Poorly textured and repetitive structures may lead to an erroneous or incomplete representation of the original scene due to a lack of visual features. The problem is that, to function properly, a feature detector requires a number of distinct features; if there is nothing on the surface, then it does not work. In order to avoid this dilemma, the examined surface must be, in an artificial way, enriched by projecting synthetic image patterns, adding a very dense network of the interest points. Therefore, in this study, we investigate which feature detection method is the most suitable, where algorithms must operate under very specific conditions.
3.1. Pattern Generation
For the purpose of this research, we have chosen seven different types of noise patterns that are generated from three different sources, including noise functions, irrational numbers, and random number generators. The first three patterns (Salt & Pepper, Gauss, and Speckle) categorized as NFBPs were produced while using the imnoise MATLAB function. NFBPs are generated through deterministic algorithms that are known for their pseudo-randomness and ability to provide irregularities. The second category of patterns is based on the digits of the irrational numbers = 3.14…, the golden ratio = 1.61…, and Euler’s number e = 2.71...; hence, they are known as irrational number patterns (INPs). The digits for the INPs were obtained while using “y-cruncher” software, which is a number-crunching program. These numbers were chosen because of their unique feature, the randomness of digits after the decimal. The last two patterns (Random and Random Eq) are categorized as random number patterns (RNPs) because of the same generation method. The random numbers for these patterns were generated while using the MATLAB function randi, which is a pseudo-random number generator.
For the INPs, the strings of digits obtained from the y-cruncher software consist of numbers ranging from 0 to 9, where the digits from 1 to 8 represent particular levels of gray. Digits 0 and 9 represent black and white colors, respectively. The strings of digits of each INP were equalized in the range of 0–255, resulting in a uniform intensity histogram. Similarly, the Random Eq pattern was produced by applying the histogram equalization to the random pattern in order to form a uniform intensity histogram. The number of digits required for both INPs and RNPs depends on the resolution of the projecting device. Therefore, a total of 786,432 digits were generated to match the resolution (1024 × 768 pixels) of the native projection system. Once a sufficient number of digits were calculated, the bitmap representation (synthetic images) of the patterns was generated while using MATLAB.
Figure 1 shows the partial bitmaps and intensity histograms of the generated synthetic images.
3.2. Experiment Setup and 3D Reconstruction Outline
The experiment was performed in two phases. In the first phase, the 3D digitization of a planar surface was accomplished in order to examine the performance of noise patterns using various feature extraction methods. The planar surface reconstruction assessment led to the selection of the best performing pattern that was used for the reconstruction of a complex featureless 3D object in the second phase of the experiment.
3.2.1. Phase One: Planar Surface Data Collection
A monochromatic planar surface is a worst-case scenario for an SFM-MVS pipeline, because it contains no distinctive morphological features required by the feature extraction process. This characteristic makes it useful for the evaluation of synthetically generated images (patterns) as well as feature extraction methods with several other benefits. First, the experimental setup is easy to adjust, and no special equipment is required. Second, there is no occlusion problem during the image acquisition process and noise patterns are visible in all images. Third, only a minimum number of images are required, thus reducing the time for 3D reconstruction. Finally, for a planar surface 3D reconstruction, experiments can be performed while using two approaches, real and virtual.
For real planar surface 3D reconstruction, a Sony
VPL-DX120 video projector (resolution of 1024 × 768 pixels) along with a Canon
PowerShot G11 digital camera (10 megapixels, 6–30 mm lens) were used. A plane white wall surface was considered to be the sample material (
Figure 2 (left)). During the image capturing process, the projector was positioned on a table at a throw distance of 3 m, providing the size of the projected image of 2.02 m × 1.51 m. Next to the projector, a camera that was mounted on a tripod was positioned in such a way that the optical axis of the camera and projector remained convergent. For all of the projections of noise patterns, a total of eight image sequences were captured at a low camera resolution of 1600 × 1200 pixels in order to avoid time-consuming proceedings. Each sequence contained five images that were captured at vertical camera motion while maintaining convergence between the camera-projector optical axis. The angle between two consecutive camera stations was approximately 10 degrees.
A virtual image data collection environment (scene) was created in order to examine the pattern behavior under ideal and perfect conditions. To realize the virtual environment, Blender (3D computer graphics software) [
43] was used (
Figure 2 (right)). A virtual plane surface was created with the same size as the projected image patterns in the real environment. The virtual scene was organized in the same way as the real environment in terms of the number and position of the camera stations, resolution of the rendered images, and the focal length of the camera. Eventually, image data were obtained from the virtual camera stations for each image pattern.
3.2.2. Phase Two: Complex Surface Image Acquisition
In phase two, the best performing pattern from the phase one experiments was used to reconstruct a polygonal 3D model of a test object using all of the feature extraction methods. The test object was produced based on a CAD model using a 3D printer in white color. The dimensions of the printed object were 260 × 150 × 60 mm and contained plane, arc, cylindrical, and spherical shapes, as shown in
Figure 3. Owing to its unicolor visual texture, the printed test object is considered to be a very undesirable object for 3D reconstructions by means of SFM-MVS pipelines.
For SFM-MVS pipelines to work, the subject should remain fixed in relation to the capturing device (or vice versa) during the image acquisition process. However, the image acquisition process becomes difficult when the subject has to be artificially enriched with image texture patterns. Two strategies can be adopted to overcome this dilemma: (i) the subject is partially enriched by projecting a pattern using a single projector; or, (ii) the subject is fully enriched with patterns while using multiple projectors, and collect the image data. The first approach is cumbersome, because it requires post-processing steps, including scale and partial scan alignment to merge into a single mesh, which makes it difficult for nonprofessionals. In addition, during the alignment of partial scans, errors can be introduced that affect the quality of the 3D reconstructed model. For these reasons, we selected the second strategy during the image acquisition process, and the camera was only moved in relation to the test object and the video projector that remained fixed during the entire process (
Figure 4).
The test object was placed on a static surface throughout the image acquisition process in order to capture the images. Two video projectors were used to completely enrich the test object with the best performing noise pattern (Random Eq) from the phase one experiments that was always projected from the same static position. Both of the projectors were positioned at a distance of 1 m and at an angle of ≈45 to the object, but opposite in the direction. The images were shot at two levels to cover the entire surface of the test object at all possible angles. At the first level, 36 images were captured by rotating the camera manually around the object, while 30 images were captured at the second level. Thus, a total of 66 images were taken, ensuring an overlap of more than 60% between images. In addition, the test object was also captured under daylight conditions without any pattern being applied. The image resolution was set to a maximum of 3648 × 2736 pixels supported by the camera. Following some preliminary tests, the focal length was set to 14 mm with an f-stop of , shutter speed 1/80 s, and in order to support the lighting conditions.
3.3. 3D Reconstruction Using Feature Extraction and Matching Plug-In
The SFM-MVS-based 3D reconstruction pipeline allows for the generation of 3D models starting from a set of images captured from multiple viewpoints. The SFM-MVS is a sequential pipeline that consists of an initial stage of feature extraction from the images, as discussed in
Section 2. The overall performance of the pipeline strongly depends on the quality of the initial feature extraction stage. In computer vision, there are various feature extraction methods for detecting and describing feature points in an image. Therefore, determining which feature extraction method (
Section 2.1) delivers the most discriminative power and robustness is significant. However, the SFM-MVS pipelines that are offered by different research groups only include popular methods, such as SIFT or SURF [
2]. On the other hand, black-box SFM-MVS-based software such as Agisoft Metashape [
44] do not provide any information about the methods. To this end, we developed a MATLAB-based plug-in that includes state-of-the-art algorithms for feature extraction, feature matching, and the generation of SIFT files written in D.G. Lowe’s ASCII format for further proceedings in the pipeline.
The SFM-MVS pipeline is implemented in two segments, as shown in
Figure 5. In the first segment, feature detection and description algorithms identify the potential keypoints and localize their position in 2D images while using all methods described in
Section 2.1 (except SIFT). Next, the extracted feature points are saved in SIFT files (feature descriptor) while using Lowe’s ASCII format. The header of the SIFT files is composed of the number of detected points and the size of the descriptor. The subsequent lines describe the 2D position and descriptor vector of each feature point identified in the images. For SIFT, SURF, KAZE, and BRISK feature detection methods, their designated feature descriptors are used, while a SURF descriptor is employed for the HARRIS, Shi-Tomasi, and MSER feature detectors. The parameters that are used for feature detection and description for each method are indicated in
Appendix A. Subsequently, the extracted feature points are matched between image pairs in the sequence using one of the methods described in
Section 2.2. In this work, we used an approximate method of nearest neighbor search to match large sets of image features with threshold-based strategy. The match threshold value was set to 5 for binary (BRISK) and 1 for nonbinary (SURF, KAZE, and SIFT) feature vectors (descriptor) to return only strong matches. Two feature vectors are only matched when the distance between them is less than the threshold value which represents a percent of the distance from the perfect match. The matched feature points are then saved in a text file. The feature matches text file consists of the names of the image pairs, and the total number and indices of the matched features between all of the image pairs. SIFT and feature matching files are written in this way as a VisualSfM prerequisite for the BA process. VisualSfM is an academic program that offers a complete SFM-MVS pipeline [
45]. For the SIFT feature extraction method, the entire SFM-MVS pipeline is implemented in VisualSfM for all image sets.
In the second part of the SFM-MVS pipeline, image sets along with the respective SIFT files are imported into VisualSfM for further processing. The text files that comprise the feature matches between image pairs in an image set are also imported. The imported feature matches are filtered while using the RANSAC algorithm to remove the outliers and also to find a transformation function in terms of Homography Matrix. This matrix provides the perspective transform of the second image with respect to the first. The location and orientation of the camera stations and 3D coordinates of tie points on the images in 3D space are estimated, and sparse point clouds are obtained as a result of the bundle adjustment. Finally, using the CMVS-PMVS methods on the sparse cloud of points, dense reconstruction of the model cloud is obtained, which can be transformed into a polygonal 3D model.
4. Results
The dense point clouds that are reconstructed from the image sets collected through real and virtual experiments in phase one (
Figure 6) are quantitatively and qualitatively analyzed. The quality of dense point clouds can be evaluated while using different criteria, including the number of vertices (
Table 1) and RMS reprojection error, also known as the standard deviation (
Table 2). A reprojection error is the distance between a keypoint detected in an image, and a corresponding 3D point reprojected into the same image. This distance is usually calculated while using the best fit method, and it represents surface deviations in the point clouds. The number of vertices provides a quantitative measure and directly reflects the quality of 3D point clouds. Although the reprojection error calculated for a point cloud provides a qualitative measure, it cannot be directly compared with other point clouds. This is due to the varying number of vertices in the point clouds that are input data for the calculation of reprojection errors. Therefore, observing each criterion separately cannot lead to a valid conclusion regarding the quality of 3D reconstruction. The ratio of the number of vertices to the standard deviation is introduced in order to compare the RMS (standard deviation) for each point cloud generated with different feature extraction methods and evaluate the performance of image patterns. The ratio
between the number of
and the standard deviation
expressed in pixels is calculated while using Equation (
11).
The obtained ratio values are then normalized in Equation (
12), where
is the highest calculated value of a feature extraction method for each image pattern. Finally, a quality score,
Q, for each image pattern, is calculated as the average value of the normalized percentage values of all feature extraction methods (a total of seven methods) through Equation (
13).
The quality score, Q, only provides a relative comparison of the synthetically generated image patterns with employed feature extraction methods from the aspect of 3D reconstruction quality of point clouds. The highest quality score, Q, for an image pattern makes it the best performing among all other image patterns with the assumption that it performed best, on average, with all feature extraction methods.
While using Equations (
11)–(
13), the 3D reconstruction quality score,
Q, for each synthetic image pattern was calculated.
Figure 7 provides a graphical representation of the quality score measured for each pattern using the values in
Table 1 and
Table 2. From the graph, it is evident that the Random Eq pattern obtained the highest quality score in the real experiments. It can be noted that the quality score
Q is smaller for the image sets that were collected in the virtual environment, and this behavior can be explained by the lack of noise in virtual images.
In the second phase of the experiments, the true accuracy of the synthetic image pattern is challenged on a texture-less 3D object to find the best feature extraction method. The test object was captured under the projection of the Random Eq pattern, the highest-ranked pattern in Phase One, and polygonal 3D models of the test object were generated while using all of the feature extraction methods. Without any post-processing, the raw 3D polygonal models were evaluated. In order to evaluate the reconstruction quality, the surface deviation comparison between polygonal models and the CAD model (reference model) was carried out while using CloudCompare [
46] software. For the evaluation, the polygonal models were registered with the reference model using the ICP algorithm [
47], which attempts to minimize the alignment error between the two meshes. As a result of the comparison, the mean distance and standard deviation are calculated.
Figure 8 shows the heatmap representation of the calculated distances, where the color-schemes of red, blue, and green represent the positive, negative, and no deviations, respectively, while other deviations are presented in the range of ±0.5 mm. The gray color in the scale bar reflects values that are outside the range. In addition, the Gaussian distribution of the signed distances for all ploygonal 3D models is shown in
Figure 9. Furthermore, the 3D reconstruction without the use of any pattern proved to be problematic as the SFM-MVS failed in the spatial alignment of the images due to lack of features.
Table 3 indicates the quantitative measures for each polygonal model of the test object while using different feature extraction methods. The Shi-Tomasi method achieved the highest number of vertices in the polygonal models, which can be explained because the algorithm is designed to detect the maximum number of corners in the images. The projection of the image patterns produces many surface features that are well adapted to the Shi-Tomasi algorithm in the feature extraction process. The KAZE method exhibited the lowest mean distance and standard deviation values for the polygonal model when compared with the reference model. This behavior can be explained by the fact that, when blurring the images, the KAZE algorithm preserves the boundaries of the objects, which results in a more accurate matching in the subsequent process; hence, a 3D reconstruction of the better quality.
5. Discussion
Two types of surfaces are used in order to evaluate the behavior of feature detection methods with synthetic noise patterns. In the first phase of experiments, the projection of the pattern is captured on a plane wall in the real environment. Because the surface of the object under observation is plane, the pattern shown in the corresponding image would be similar to that of the projected pattern. In addition, the structure and randomness of the pattern is maintained as well as focus of the projector. Consequently, the 3D reconstruction of the surface is essentially the reconstruction of the pattern itself. These factors prove the ability of this method to examine the efficiency of image patterns and find their accuracy.
In phase one experiments, the point clouds of virtual planar surfaces generally resulted in more vertices than the real planar surfaces and the quality graph also tends to be more consistent. This can be explained by the fact that in the case of real experiments, the pixels of the pattern image can lose or change their intensity when projected on the wall. Additionally, the surface color, lighting conditions, and limitations of the projection system (such as color aberration) can affect the projected image. However, these factors are not involved in the case of virtual experiments. Therefore, synthetic image patterns can behave differently in real and virtual environments. As shown in
Figure 7, the Random Eq pattern, the highest ranked pattern in real conditions, achieved a quality score of about 20% higher than the worst (Salt&Pepper) pattern, while it scored lower (5%) than the same pattern in the simulated environment.
The polygonal models that are generated in the second phase of the experiments are analyzed in order to evaluate the performance of the feature extraction methods. Because the experiments were performed under real conditions, only the highest ranked pattern from the phase one real environment is used during the image acquisition process. All of the feature detection methods were able to recreate 3D polygonal models of the test object with varying accuracy. Because the Gaussian distribution of the measured distances (
Figure 9a) is widespread with a mean distance of
, the Harris corner detector exhibited maximum deviation
. While the efficiency gain for other feature extraction methods (Shi-Tomasi 43%, MSER 58%, SIFT 55%, SURF 54%, KAZE 64%, and BRISK 48%) is observed as compared to the Harris detector in terms of standard deviation. Because 3D printing technology (Sindoh 2X Series) was used to create the test object, dimensional inaccuracies can occur during the printing process. To determine this error, the printed object was measured from 12 different places while using a digital vernier caliper and the difference was found from the corresponding locations on the CAD model. Subsequently, descriptive statistics, including mean
, standard deviation
, median
, min
, max
, and RMS
of those differences were calculated. Although the dimensional inaccuracies (very small) are present in the printed object, they remain constant for all of the evaluated polygonal 3D models.
6. Conclusions
In this paper, we evaluated the performance of feature extraction methods with synthetic noise patterns on texture-less surfaces. Seven state-of-the-art feature detector algorithms were included for feature extraction and then tested on a planar and a challenging object. Experiments were conducted in two phases, one of which culminated in the best performing pattern used to reconstruct a texture-less object in the second phase to find the best feature extraction method. The best performing pattern was one with the highest quality score average that was determined by considering all feature detector algorithms. We discovered that the Random Eq. pattern achieved higher quality score (63%) as compared to the other noise pattern, followed by the Gaussian (61%) and Euler (56%) noise patterns, respectively. The characteristics of the patterns involved intensity histograms and the Random Eq pattern has a uniform distribution of the intensity histogram.
Due to its performance and the application of the Random Eq pattern, polygonal models of the test object were generated for each feature extraction method. When compared to the CAD model, the polygonal model that was generated with the KAZE algorithm showed minimal deviations. Furthermore, the high number of mesh vertices does not ensure higher efficiency, as the KAZE detector produced 86K (with 64% gain) of mesh vertices, while the Shi-Tomasi detector obtained 90K (with 43% gain). Depending on the application, where the surface texture is important, it is vital to choose the right feature detector for high quality 3D reconstruction. For instance, high quality is desirable in preserving cultural heritage artifacts or creating replicas. The choice of a feature detection method and a synthetic image pattern can substantially increase the accuracy of the reconstructed polygonal 3D model.
While 3D reconstruction of texture-less objects using synthetic image patterns is a viable solution where no other digitization approaches are possible or applicable, it is rather limited to indoor conditions. Although the use of multiple projectors eliminates the need of post-processing, the image acquisition stage is still time consuming and needs a very experienced camera operator. A compact system, such as multi-camera rig system with multiple projectors, can overcome some of the limitations and reduce the duration of image acquisition. Further research will focus on achieving a greater degree of automation in image acquisition for texture-less 3D surface reconstruction while using SFM-MVS pipelines.