Unsupervised 3D Reconstruction with Multi-Measure and High-Resolution Loss

Multi-view 3D reconstruction technology based on deep learning is developing rapidly. Unsupervised learning has become a research hotspot because it does not need ground truth labels. The current unsupervised method mainly uses 3DCNN to regularize the cost volume to regression image depth. This approach results in high memory requirements and long computing time. In this paper, we propose an end-to-end unsupervised multi-view 3D reconstruction network framework based on PatchMatch, Unsup_patchmatchnet. It dramatically reduces memory requirements and computing time. We propose a feature point consistency loss function. We incorporate various self-supervised signals such as photometric consistency loss and semantic consistency loss into the loss function. At the same time, we propose a high-resolution loss method. This improves the reconstruction of high-resolution images. The experiment proves that the memory usage of the network is reduced by 80% and the running time is reduced by more than 50% compared with the network using 3DCNN method. The overall error of reconstructed 3D point cloud is only 0.501 mm. It is superior to most current unsupervised multi-view 3D reconstruction networks. Then, we test on different data sets and verify that the network has good generalization.


Introduction
Three-dimensional reconstruction refers to the establishment of a mathematical model suitable for computer representation and processing of three-dimensional objects. It has been widely used in the fields of virtual reality, robotics, automatic driving, and land resource utilization [1,2]. Multi View Stereo (MVS) [3] restores dense representations of three-dimensional scenes from multi-view images and calibrated cameras. This has always been an important subject in 3D reconstruction. MVS has been extensively studied for decades.
There are many traditional methods, including voxel-based [4,5], feature point diffusion [6,7], and depth map fusion [8,9]. Voxel-based methods divide the 3D space into regular grids, and then estimate whether each voxel is attached to the surface. The disadvantages of this representation are spatial discretization errors and high memory consumption, and its accuracy mainly depends on the resolution of the voxel. In the method based on feature point diffusion, the 3D coordinates of feature points are first generated from the matched image. Then, the 3D points are expanded to the spatial neighborhood according to the principle of luminosity consistency and visibility. In this way, the blank area may be seriously affected by the texture-free problem. Depth map fusion decomposes the complex MVS problem into a relatively small per-view depth map estimation problem, focusing on one reference and several source image depth maps at a time, and then all depth maps are fused together to form the final point cloud.
The traditional methods have been mature and achieved good results in scene reconstruction from Lambertian surfaces. However, in the case of illumination variation, low Sensors 2023, 23, 136 2 of 18 texture area, and non-Lambertian surface, the reconstructed match is not reliable. Aanaes et al. [10] and Knapitsch et al. [11] evaluated some mainstream MVS algorithms and found that although the most advanced traditional methods performed very well in accuracy, there was still a lot of room for improvement in the completeness of reconstruction.
In recent years, with the rapid development of deep learning neural networks, a large number of researchers have tried to introduce deep learning into MVS. Earlier supervised learning MVS include SurfaceNet [12] and LSM [13]; however, SurfaceNet and LSM are limited to small-scale reconstruction due to common shortcomings of voxel representation. They either employ a divide-and-conquer strategy or are only applicable to synthetic data with low resolution inputs. Recent deep learning-based methods mainly use Convolutional Neural Network (CNN) to infer the depth map of each view and then carry out a separate multi-view fusion process to build a 3D model. The most representative one is MVSNet, an end-to-end deep learning architecture proposed by Yao et al. in 2018 [14]. The main idea is to build a cost volume based on the plane scanning process, and then conduct 3DCNN regularization to return the depth. It computes a depth map one at a time instead of the entire 3D scene. The accuracy and completeness of the network reconstruction point cloud and the generalization ability of the network are better than most 3D reconstruction methods of the same period. Therefore, this method is widely used for depth estimation in most deep learning MVS networks [15][16][17][18][19][20]. However, to ensure the accuracy of depth calculation, the storage requirement is three times that of image resolution. So, it is difficult to process highresolution images, which is also a common problem of 3DCNN regularization. To solve this problem, Yao et al. [21] replaced 3DCNN with cyclic regularization based on GRU in 2019. They sequentially adjusted 2D cost maps along the depth direction by gated recursive units (GRUs). It greatly reduced memory consumption and made high-resolution reconstruction possible. However, this results in memory reduction and increased running time.
In 2019, Luo et al. [22] improved the MVSNet cost volume by using piecewise confidence aggregation to improve the matching accuracy and robustness. In 2020, Gu et al. [23] and Yang et al. [16] both used the coarse-to-fine method to construct the cost volume pyramid. The result is a compact, lightweight network. The network can predict high-resolution depth maps to obtain better reconstruction results. The network computing speed has been greatly improved. Different from the cost-volume method of MVSNet, Chen et al. [24] proposed a point-based deep network framework. It directly processed the target scene as a point cloud to predict depth in a coarse-to-fine way. Although this method has a good reconstruction effect, its run time increases almost linearly with the iteration level.
Different from previous methods, Wang et al. [25] introduced the idea of Patch-Match [26] into MVS network and proposed PatchmatchNet, a supervised network. In addition, a learnable and adaptive module was adopted to improve the transmission mode and cost calculation of PatchMatch. The results show that compared with the previous 3D cost volume using rules, the proposed method has faster speed and lower memory requirements. It can also process high-resolution images. So, it is more suitable for resource-limited devices.
Supervised learning MVS networks need to use the true depth of the image as a baseline and compare it with the depth predicted by the network to calculate the error. This approach requires accurate image depth information in the training set. This depth information needs to be either obtained by high-precision instruments such as laser scanners or calculated using traditional geometric methods. The former needs a lot of manpower and financial resources, the latter is easy to have a larger error. Because of this, it is difficult to produce high-quality MVS data sets. Therefore, there are not many MVS data sets available today. However, unsupervised learning can solve these problems well. It does not require image depth information. It constructs the loss function by the relation of luminosity, feature, and semantics among images from different perspectives. Therefore, unsupervised learning can be adapted to more data sets. It also reduces the difficulty of making MVS data sets.
Unsupervised learning MVS usually uses photometric consistency or geometric consistency as self-supervised signals. In 2019, Khot et al. [17] first proposed an unsupervised MVS framework, Unsup_MVS. It used a method like MVSNet to predict depth and used the photometric consistency among multiple views as the supervised signal. Most previous MVS infer depth maps only for reference images, while the MVS2 network proposed by Dai et al. [18] is symmetric for all views. It treats each view equivalently while predicting the depth map of each view. In 2020, Huang et al. [19] proposed an unsupervised multi-metric network framework M3VSNet. In this architecture, multi-scale pyramid feature aggregation is used to construct a 3D cost volume with more context information, and the loss function combines pixel loss and feature loss. In 2021, Xu et al. [20] combined data augmentation and semantic segmentation as self-supervised signals, making the reconstruction effect comparable to that of the most advanced supervised learning networks. Yang et al. [27] comprehensively used various methods such as deep fusion, mesh generation and deep rendering in unsupervised networks to optimize the pseudo depth.
Inspired by the above research, we propose an unsupervised multi-view 3D reconstruction network framework named Unsup_patchmatchnet based on PatchMatch. The network depth estimation module is the same as the main part of the supervised network PatchmatchNet [25], both of which adopt multi-scale depth prediction. The self-supervised method adopts the multi-metric union. It integrates the consistency of photometric, semantic, and feature points. Photometric consistency has always been considered the most basic supervision signal. It is assumed that the photometric of the same point or patch from different perspectives will hardly change. The current unsupervised MVS networks mainly adopt this method. Semantic consistency can provide abstract matching clues to guide supervision and can enhance robustness to color fluctuations. In this paper, the non-negative matrix factorization method used in reference [20] is adopted for semantic consistency. Therefore, this part does not need to be pre-trained and has good robustness to different scenes. Feature point detection and matching is a common method in traditional 3D reconstruction. The feature points generally have good robustness and can adapt to the influence of rotation, scale, and illumination. Feature point consistency utilizes the alignment of feature points matched between different perspectives to guide depth refinement. Compared with a large number of image pixels, there are fewer feature points between images. The image depth of the location of feature points can be estimated quickly by using the consistency of feature points. PatchMatch can quickly spread the correct depth to the surrounding points. Therefore, the network will be more efficient and accurate. In addition, depth maps with multiple resolutions are generated in the process of multi-scale depth estimation, and higher resolution images are input in model testing or practical applications. Therefore, we adopt the high-resolution loss method. We up-sample the depth maps estimated at each stage of the network to the maximum resolution, and then calculate the loss of various measures.
Our main contributions are summarized as follows: (1) In this paper, we propose for the first time an end-to-end unsupervised multi-view 3D reconstruction network Unsup_patchmatchnet based on PatchMatch. The network incorporates multi-metric self-supervised signals.
(2) Based on the detection and matching of feature points in traditional methods, we propose the feature point consistency loss as the self-supervised signal of unsupervised MVS for the first time.
(3) For the first time, we propose a method to use high-resolution loss instead of multi-scale loss. The ablation experiment in Section 3.2.1 verifies that it can further improve the network performance.

Materials and Method
The proposed Unsup_patchmatchnet network structure is shown in Figure 1. The network can be divided into two main parts. The first part is a multi-scale depth estimation based on PatchMatch. The second part is the multi-metric high-resolution joint loss The proposed Unsup_patchmatchnet network structure is shown in Figure 1. The network can be divided into two main parts. The first part is a multi-scale depth estimation based on PatchMatch. The second part is the multi-metric high-resolution joint loss which fuses the photometric consistency loss, semantic consistency loss, and feature point consistency loss.

Multi-Scale Depth Estimation Based on PatchMatch
The depth estimation module adopts the backbone part of Patchmatchnet [25]. This section has not been significantly changed. This module can be divided into four stages, estimating depth maps of four different scales, and gradually refining the depth from coarse to fine. The multi-scale features are extracted first, and then the depth is estimated at each scale. The process of each scale is basically the same. Starting from the smallest scale (stage 3), depth initialization is performed first, that is, sampling within the image depth range of [dmin, dmax]. Each subsequent iteration or upsampling to a new scale is not initialized, but a small random perturbation is added to the depth. Adaptive propagation is to add the current depth value of a pixel that has the same depth as that pixel to the depth assumption of that pixel. The adaptive evaluation is to calculate the weight of each depth hypothesis for each pixel and regress the depth of each pixel. The depth map generated at the lower scale is upsampled as the initial depth value at the higher scale. In stage 0, propagation is no longer carried out, but the multi-scale guided convolution network MSG-Net [28] is used to up-sample the depth map and generate the final depth map.

Multi-Measure and High-Resolution Loss
Unsup_patchmatchnet integrates a variety of measurement losses such as photometric consistency loss, semantic consistency loss, and feature point consistency loss.

Photometric Consistency Loss
Photometric consistency [29] means that the same object should have the same color value from different perspectives. Photometric consistency is the most common method for calculating losses in unsupervised MVS networks.
As shown in Figure 2, assuming that pixel on the reference image corresponds to pixel ′ in the source image, ′ can be calculated by the following formula:

Multi-Scale Depth Estimation Based on PatchMatch
The depth estimation module adopts the backbone part of Patchmatchnet [25]. This section has not been significantly changed. This module can be divided into four stages, estimating depth maps of four different scales, and gradually refining the depth from coarse to fine. The multi-scale features are extracted first, and then the depth is estimated at each scale. The process of each scale is basically the same. Starting from the smallest scale (stage 3), depth initialization is performed first, that is, sampling within the image depth range of [d min , d max ]. Each subsequent iteration or upsampling to a new scale is not initialized, but a small random perturbation is added to the depth. Adaptive propagation is to add the current depth value of a pixel that has the same depth as that pixel to the depth assumption of that pixel. The adaptive evaluation is to calculate the weight of each depth hypothesis for each pixel and regress the depth of each pixel. The depth map generated at the lower scale is upsampled as the initial depth value at the higher scale. In stage 0, propagation is no longer carried out, but the multi-scale guided convolution network MSG-Net [28] is used to up-sample the depth map and generate the final depth map.

Multi-Measure and High-Resolution Loss
Unsup_patchmatchnet integrates a variety of measurement losses such as photometric consistency loss, semantic consistency loss, and feature point consistency loss.

Photometric Consistency Loss
Photometric consistency [29] means that the same object should have the same color value from different perspectives. Photometric consistency is the most common method for calculating losses in unsupervised MVS networks.
As shown in Figure 2, assuming that pixel p j on the reference image corresponds to pixel p j in the source image, p j can be calculated by the following formula: where j represents the pixel index, K is the internal reference matrix, T is the projection matrix from the reference perspective to the source perspective, and D is the depth.
where j represents the pixel index, K is the internal reference matrix, T is the projection matrix from the reference perspective to the source perspective, and D is the depth. Thus, image ′ which is warped from the source image to the reference perspective can be constructed.
where i is the index of the source view. The photometric consistency loss is the sum of the photometric loss of each reference image and all related source images.
Where ∇ represents the gradient, ⨀ is the dot product, N represents the number of source images, represents the binary validity mask of the i-th source image. The mask of the j-th pixel , of the i-th source image is calculated as follows:

Semantic Consistency Loss
In reality, due to the different illumination conditions and the reflected light of the object, the same object in different viewing angles will have inconsistent luminosity. Figure 3 shows the images of a scene in the DTU dataset from three different perspectives under the same lighting condition. The same object has different luminosity at different angles. The difference is even more pronounced if the lighting conditions are different. Therefore, we introduce semantic consistency loss to compensate for the lack of photometric consistency loss function. Thus, image I i which is warped from the source image to the reference perspective can be constructed.
where i is the index of the source view. The photometric consistency loss is the sum of the photometric loss of each reference image and all related source images.
where ∇ represents the gradient, is the dot product, N represents the number of source images, M i represents the binary validity mask of the i-th source image. The mask of the j-th pixel p i,j of the i-th source image is calculated as follows:

Semantic Consistency Loss
In reality, due to the different illumination conditions and the reflected light of the object, the same object in different viewing angles will have inconsistent luminosity. Figure 3 shows the images of a scene in the DTU dataset from three different perspectives under the same lighting condition. The same object has different luminosity at different angles. The difference is even more pronounced if the lighting conditions are different. Therefore, we introduce semantic consistency loss to compensate for the lack of photometric consistency loss function.
where j represents the pixel index, K is the internal reference matrix, T is the projection matrix from the reference perspective to the source perspective, and D is the depth. Thus, image ′ which is warped from the source image to the reference perspective can be constructed.
where i is the index of the source view. The photometric consistency loss is the sum of the photometric loss of each reference image and all related source images.
Where ∇ represents the gradient, ⨀ is the dot product, N represents the number of source images, represents the binary validity mask of the i-th source image. The mask of the j-th pixel , of the i-th source image is calculated as follows:

Semantic Consistency Loss
In reality, due to the different illumination conditions and the reflected light of the object, the same object in different viewing angles will have inconsistent luminosity. Figure 3 shows the images of a scene in the DTU dataset from three different perspectives under the same lighting condition. The same object has different luminosity at different angles. The difference is even more pronounced if the lighting conditions are different. Therefore, we introduce semantic consistency loss to compensate for the lack of photometric consistency loss function.  Semantic consistency means that the corresponding pixels of different images should have the same semantic type. In this paper, the method described in the literature [20] is adopted. Firstly, the pre-trained VGG16 network is used to extract image features, and then the non-negative matrix factorization method is used to cluster image pixels to generate a semantic segmentation map. As shown in Figure 4, the calculation method is similar to the photometric consistency loss. According to Equation (1), the pixel p j on the reference image is calculated and the corresponding pixel p j in the source image is calculated. Then the warped segmentation map S i from the i-th source view is reconstructed by bilinear sampling.
have the same semantic type. In this paper, the method described in the literature [20] is adopted. Firstly, the pre-trained VGG16 network is used to extract image features, and then the non-negative matrix factorization method is used to cluster image pixels to generate a semantic segmentation map. As shown in Figure 4, the calculation method is similar to the photometric consistency loss. According to Equation (1), the pixel on the reference image is calculated and the corresponding pixel ′ in the source image is calculated. Then the warped segmentation map ′ from the i-th source view is reconstructed by bilinear sampling.
Finally, the cross-entropy loss of each pixel between the warped segmentation map ′ and the reference segmentation map 1 is calculated as the semantic consistency loss.
where, ( 1, ) = onehot(argmax( 1, )), represents the binary validity mask of the i-th source image, the calculation is the same as in Section 3.2.1.

Feature Point Consistency Loss
The boundary of pixel semantic clustering in the image is not accurate enough, while the position of image feature points is relatively accurate. The network can be guided to predict more accurate image depth by the position offset of the matched feature point pairs between the reference image and the source image.
The feature point consistency requires that the position of the feature point projected into the source image from the reference image should be the same as the position of the feature point matched in the source image. As shown in Figure 5, The feature point 1 in the reference image matches the feature point in the source image. The position of 1 projected into the source image through Equation (1) is ′ . According to the requirement of feature point consistency, and ′ should be in the same position. However, in practice, there is usually a deviation between the two points. The distance between two points is the loss of feature points consistency. In this paper, we use SIFT [30] algorithm to extract feature points. SIFT algorithm has the characteristics of rotation invariance and scale invariance and is robust to illumination changes. The quality of SIFT feature point matching is relatively high. Although the speed is relatively slow, it is only used in model training and does not affect the speed of 3D reconstruction. Finally, the cross-entropy loss of each pixel between the warped segmentation map S i and the reference segmentation map S 1 is calculated as the semantic consistency loss.
where, f S 1,j = onehot argmax S 1,j , M i represents the binary validity mask of the i-th source image, the calculation is the same as in Section 3.2.1.

Feature Point Consistency Loss
The boundary of pixel semantic clustering in the image is not accurate enough, while the position of image feature points is relatively accurate. The network can be guided to predict more accurate image depth by the position offset of the matched feature point pairs between the reference image and the source image.
The feature point consistency requires that the position of the feature point projected into the source image from the reference image should be the same as the position of the feature point matched in the source image. As shown in Figure 5, The feature point p 1 in the reference image matches the feature point p i in the source image. The position of p 1 projected into the source image through Equation (1) is p i . According to the requirement of feature point consistency, p i and p i should be in the same position. However, in practice, there is usually a deviation between the two points. The distance between two points is the loss of feature points consistency. In this paper, we use SIFT [30] algorithm to extract feature points. SIFT algorithm has the characteristics of rotation invariance and scale invariance and is robust to illumination changes. The quality of SIFT feature point matching is relatively high. Although the speed is relatively slow, it is only used in model training and does not affect the speed of 3D reconstruction. The quality of feature point detection and matching is directly related to the texture of the image. As shown in Figure 6, the two scenarios "Scan2" and "Scan92" are in the DTU dataset. The "Scan2" image is richly textured, while the "Scan92" image is relatively smooth. All scenes in DTU share 49 camera perspectives. In both scenes, we selected images from two camera perspectives for SIFT feature point detection and matching. The number of matched feature point pairs in Scan2 is much higher than that in scan92. Therefore, if the sum of position offsets of all feature points is taken as the error, the impact of the high-texture scenes on the network will be much greater than that of the low-texture scenes. This will reduce the generalization of the network. In addition, as shown in Figure  7, SIFT also produces wrong feature point matching for some images with repeated textures. If the average value of the position deviation of feature points is taken as the error, the accuracy will also be reduced. Therefore, we take the median of the position deviation of feature points as the feature points consistency error of two images.  The quality of feature point detection and matching is directly related to the texture of the image. As shown in Figure 6, the two scenarios "Scan2" and "Scan92" are in the DTU dataset. The "Scan2" image is richly textured, while the "Scan92" image is relatively smooth. All scenes in DTU share 49 camera perspectives. In both scenes, we selected images from two camera perspectives for SIFT feature point detection and matching. The number of matched feature point pairs in Scan2 is much higher than that in scan92. Therefore, if the sum of position offsets of all feature points is taken as the error, the impact of the high-texture scenes on the network will be much greater than that of the low-texture scenes. This will reduce the generalization of the network. In addition, as shown in Figure 7, SIFT also produces wrong feature point matching for some images with repeated textures. If the average value of the position deviation of feature points is taken as the error, the accuracy will also be reduced. Therefore, we take the median of the position deviation of feature points as the feature points consistency error of two images. The quality of feature point detection and matching is directly related to the texture of the image. As shown in Figure 6, the two scenarios "Scan2" and "Scan92" are in the DTU dataset. The "Scan2" image is richly textured, while the "Scan92" image is relatively smooth. All scenes in DTU share 49 camera perspectives. In both scenes, we selected images from two camera perspectives for SIFT feature point detection and matching. The number of matched feature point pairs in Scan2 is much higher than that in scan92. Therefore, if the sum of position offsets of all feature points is taken as the error, the impact of the high-texture scenes on the network will be much greater than that of the low-texture scenes. This will reduce the generalization of the network. In addition, as shown in Figure  7, SIFT also produces wrong feature point matching for some images with repeated textures. If the average value of the position deviation of feature points is taken as the error, the accuracy will also be reduced. Therefore, we take the median of the position deviation of feature points as the feature points consistency error of two images.
The feature points consistency error between the reference image and the i-th source image is: where represents the number of SIFT feature points between the reference image and the i-th source image.
Therefore, the feature points consistency loss function is defined as: where N represents the number of source images.

High-Resolution Loss
At present, there are some multi-scale MVS networks similar to the one proposed in this paper [16,25]. All of them adopt the sum of losses at different scales as the total loss of the network. However, in the test process, the image resolution is relatively high. Therefore, in this paper, the depth images predicted in each stage are upsampled to high-resolution images, and then the loss of various measures is calculated.
The photometric consistency loss and semantic consistency loss are calculated at each stage. Therefore, the predicted depth map is firstly upsampled to the resolution of stage 0 by bilinear interpolation at each stage and then calculate the loss. The total loss is the sum of the losses at each stage.
The total photometric consistency loss is calculated as follows: where k represents the number of stages and represents the number of iterations of stage k.
The total semantic consistency loss is calculated as follows: When calculating the feature point consistency error, SIFT feature points are first detected and matched on the reference image (denoted as image 0) and the i-th source image. The matching point pair (p 0,j , p i,j ) is calculated. According to Equation (1), the backprojection point p i,j of p 0,j in the i-th source image can be calculated. Then, the feature point consistency error of the j-th feature point is: The feature points consistency error between the reference image and the i-th source image is: where N S represents the number of SIFT feature points between the reference image and the i-th source image. Therefore, the feature points consistency loss function is defined as: where N represents the number of source images.

High-Resolution Loss
At present, there are some multi-scale MVS networks similar to the one proposed in this paper [16,25]. All of them adopt the sum of losses at different scales as the total loss of the network. However, in the test process, the image resolution is relatively high. Therefore, in this paper, the depth images predicted in each stage are upsampled to high-resolution images, and then the loss of various measures is calculated.
The photometric consistency loss and semantic consistency loss are calculated at each stage. Therefore, the predicted depth map is firstly upsampled to the resolution of stage 0 by bilinear interpolation at each stage and then calculate the loss. The total loss is the sum of the losses at each stage.
The total photometric consistency loss is calculated as follows: where k represents the number of stages and n k represents the number of iterations of stage k. The total semantic consistency loss is calculated as follows: where k represents the number of stages and n k represents the number of iterations of stage k. Feature point consistency loss is based on image feature point detection and matching. In the experiment, we find that the accuracy of feature point detection and matching decreases greatly when the image resolution is low. Therefore, the feature point consistency loss is calculated only at stage 0. The total feature point consistency loss is calculated as follows: Finally, the total loss of the network is calculated as follows: where, λ 1 , λ 2 and λ 3 are set to 0.8, 0.1, and 0.1 respectively.

Results
Firstly, we conducted a comparative experiment based on the DTU dataset [10] to comprehensively evaluate the performance of Unsup_patchmatchnet, including the 3D reconstruction effect, running memory usage, and time consumption. Secondly, the ablation experiment is conducted to analyze the influence of each module on the network performance. Finally, the generalization performance of the network is verified based on the Tanks and Temples dataset.

Performance Evaluation Based on DTU
The DTU dataset is an indoor multi-view stereo dataset containing 124 different scenes. All scenes share the same 49 camera views. Each view contains seven light variations. The segmentation method of the training set, test set, and validation set used in this paper are the same as that used in most previous MVS networks [13,[17][18][19]24].

Implementation Details
We designed the network using PyTorch and trained it using only the DTU training set. The parameter Settings of the depth estimation module are the same as those of PatchmatchNet. The number of source images is set to 4. During model training, the input image resolution is 640 × 512. The images are obtained from the center clipping of the original image after downsampling. The depth sampling range of the images is 425 mm to 935 mm. We trained in parallel on four Nvidia GTX 1080Ti GPUs using four batches for a total of 40 epochs. During the network test, the input image resolution is 1600 × 1200. After depth prediction, 3D point clouds of each scene are reconstructed for network performance evaluation.

Result on DTU Dataset
The reconstructed 3D point cloud was evaluated according to the evaluation criteria provided by the DTU dataset. DTU evaluation criteria mainly include accuracy (Acc.), completeness (Comp.), and overall. Accuracy (Acc.) is measured by the distance between the reconstructed 3D point cloud and the real object point cloud, indicating the accuracy of the reconstructed points. completeness (Comp.) indicates the completeness of the reconstructed surface of the object; Overall is the average of accuracy and completeness and is a comprehensive standard of error. The smaller the value of the three criteria, the smaller the error of the reconstructed point cloud.
In this paper, we compare some recent traditional geometric methods, supervised and unsupervised learning MVS networks. The results are shown in Table 1. The overall performance of Unsup_patchmatchnet is still some distance from that of the best-supervised learning networks, but it exceeds the traditional geometric methods and other unsupervised learning networks. Compared with supervised learning networks, Unsup_patchmatchnet has a significant gap in both accuracy and completeness. For example, compared with PatchmatchNet, the accuracy error and completeness error are improved by 0.086 mm and 0.212 mm, respectively. Compared with traditional geometric methods, the accuracy error of Unsup_patchmatchnet is not the lowest. It is 0.23 mm higher than Gipuma and 0.171 mm higher than Tola. However, the completeness has obvious advantages. It is 0.384 mm lower than that of Gipuma and 0.701 mm lower than that of Tola. Therefore, the overall performance exceeds that of traditional geometric methods. Compared with other unsupervised learning networks, our network outperforms them in both accuracy and completeness. The completeness of the reconstructed model improved relatively little. Compared with MVS2, the completeness error is only reduced by 0.026 mm. However, the accuracy stands out. The accuracy error is more than 0.123 mm lower than that of other networks. Ultimately, Unsup_patchmatchnet outperforms several other MVS networks in terms of overall performance. Figure 8 shows some reconstructed 3D point cloud comparisons. It can be seen intuitively from the 3D point cloud that the details of the 3D point cloud reconstructed by Unsup_patchmatchnet are richer and more complete.

Memory and Run-Time Comparison
In practical applications, MVS mainly runs the test process of the network, that is, the 3D reconstruction process. Therefore, we compare the memory usage and time consumption of Unsup_patchmatchnet with other networks in the testing process. Memory usage is the maximum memory requirements during testing, and time consumption is the average time to estimate depth once.
We compared some of the most advanced supervised MVS(MVSNet [14], CVP-MVSNet [16], PatchmatchNet [25]) and unsupervised MVS(M3VSNet [19], JDACS-MS [20]). The resolution of all network input images is set to 1600 × 1200. The result is shown in Figure 9. In terms of memory usage, MVSNet, CVP-MVSNet, M3VSNet, and JDACS-MS all have more memory usage due to 3DCNN regularization. PatchmatchNet and Un-sup_patchmatchnet adopt PatchMatch mode, which significantly reduces memory usage and is only 1/5 of that of the other four networks. In terms of time consumption, 3DCNN

Memory and Run-Time Comparison
In practical applications, MVS mainly runs the test process of the network, that is, the 3D reconstruction process. Therefore, we compare the memory usage and time consumption of Unsup_patchmatchnet with other networks in the testing process. Memory usage is the maximum memory requirements during testing, and time consumption is the average time to estimate depth once.
We compared some of the most advanced supervised MVS(MVSNet [14], CVP-MVSNet [16], PatchmatchNet [25]) and unsupervised MVS(M3VSNet [19], JDACS-MS [20]). The resolution of all network input images is set to 1600 × 1200. The result is shown in Figure 9. In terms of memory usage, MVSNet, CVP-MVSNet, M3VSNet, and JDACS-MS all have more memory usage due to 3DCNN regularization. PatchmatchNet and Un-sup_patchmatchnet adopt PatchMatch mode, which significantly reduces memory usage and is only 1/5 of that of the other four networks. In terms of time consumption, 3DCNN regularization consumed more time. However, CVP-MVSNet and JDACS-MS also adopt the cascade mode based on 3DCNN regularization, so the time consumption is larger.

Effect of Different Loss Modules
This section mainly analyzes the influence of different loss function modules in Un-sup_patchmatchnet on network performance. The effects of semantic consistency loss, feature point consistency loss, and high-resolution loss modules on network performance are

Effect of Different Loss Modules
This section mainly analyzes the influence of different loss function modules in Un-sup_patchmatchnet on network performance. The effects of semantic consistency loss, feature point consistency loss, and high-resolution loss modules on network performance are compared. It is divided into 4 cases: (a) None of the three modules (only the photometric consistency loss).
(b) Only semantic consistency loss.
(c) Semantic consistency loss and feature point consistency loss.
(d) All three types of modules. Except for the above modules, other parameter Settings are the same as those in Section 3.1. Table 2 shows the evaluation and comparison of the reconstruction effect on the DTU test set under the four conditions. It can be seen that adding each module to the loss function improves the accuracy and completeness of the reconstruction effect to a certain extent. After adding the semantic consistency loss module, the overall error is reduced by 10.61%. This indicates that semantic consistency loss can further optimize the predicted image depth by constraining the semantic consistency between pixels of images from different perspectives. After adding the feature point consistency module, the overall error is reduced by 7.96%. This shows that the accuracy of the predicted depth map can be further improved by constraining the position offset of feature point pairs between images from different perspectives. After adding the high-resolution loss module, the overall error is reduced by 11.48%. The resolution of the image used in the test is higher than that used in the training. The high-resolution loss method enables the network trained with low-resolution images to reconstruct high-resolution images.  Figure 10 shows the comparison of reconstructed 3D point clouds in four cases. It can be seen intuitively from the figure that the point cloud is gradually complete and rich after the three modules are added in turn. This proves the effectiveness of semantic consistency loss, feature point consistency loss, and high-resolution loss modules.

Effect of High-Resolution Loss
The effectiveness of the high-resolution loss method has been demonstrated. This section changes the resolution adjustment of depth maps at different scales to analyze the impact of different high-resolution loss calculation methods on network performance. As shown in Figure 11, we divided four cases for comparison: √ indicates that the module is included, × indicates that the module is not included. Figure 10 shows the comparison of reconstructed 3D point clouds in four cases. It can be seen intuitively from the figure that the point cloud is gradually complete and rich after the three modules are added in turn. This proves the effectiveness of semantic consistency loss, feature point consistency loss, and high-resolution loss modules.  Table 2.

Effect of High-Resolution Loss
The effectiveness of the high-resolution loss method has been demonstrated. This section changes the resolution adjustment of depth maps at different scales to analyze the impact of different high-resolution loss calculation methods on network performance. As shown in Figure 11, we divided four cases for comparison: Gro un d Tru th (a) Figure 10. Comparison of reconstruction results after adding each module. (a-d) has the same meaning as in Table 2. The network performance comparison results of the four up-sampling methods of depth maps are shown in Table 3. Compared with no high-resolution loss (mode (a)), the overall reconstruction error of mode (b), mode (c) and mode (d) are reduced by 3.53%, 7.95% and 11.48%, respectively. Therefore, it can be concluded that the depth map resolution is positively correlated with the accuracy and completeness of the reconstructed 3D point cloud. The high-resolution loss method is proven to be effective. The network performance comparison results of the four up-sampling methods of depth maps are shown in Table 3. Compared with no high-resolution loss (mode (a)), the overall reconstruction error of mode (b), mode (c) and mode (d) are reduced by 3.53%, 7.95% and 11.48%, respectively. Therefore, it can be concluded that the depth map resolution is positively correlated with the accuracy and completeness of the reconstructed 3D point cloud. The high-resolution loss method is proven to be effective.

Generalization Ability on Tanks and Temples
In this section, the Tanks and Temples dataset is used to test the generalization of the network. We use model parameters trained on the DTU dataset. No adjustments were made to the model. The input image size is set to 1920 × 1056. The number of source images is set to 6. The test results are shown in Table 4. The data in the table are F scores, and the higher the score, the better the network performance. As can be seen from Table 4, in most scenes, the reconstruction effect of Unsup_ patchmatchnet is better than that of MVS2 and M3VSNet. In particular, the scores of "Francis"and "Train" scenarios were twice and 1.5 times higher than those of the other two network models, respectively. Although the scores in the "M60" and "Panther" scenarios are not the highest, the difference is not large, which is 11.34% and 8.37% lower than the highest scores in the other two networks, respectively. Finally, the overall average score of Unsup_patchmatchnet exceeds the other two network models. Figure 12 shows the reconstructed point cloud. It can be seen that no matter whether considering smaller objects (such as "Family" and "Horse") or larger scenes (such as "Playground" and "Lighthouse"), Unsup_patchmatchnet can be reconstructed well. For some non-ideal areas, such as the sand in "Playground", it can also be well presented. In conclusion, it can be proved that Unsup_patchmatchnet has strong generalization properties.
that Unsup_patchmatchnet has strong generalization properties.

Discussion
The results in Table 1 fully prove the effectiveness of our proposed network. It goes beyond most current methods of unsupervised learning. The improvement of accuracy is obvious, but the improvement of integrity is relatively small. However, compared with supervised methods, the gaps in accuracy and completeness are obvious. This reflects the unsupervised learning approach still faces great challenges. In addition, although current deep learning methods are more comprehensive than traditional geometric methods, the accuracy of traditional geometric methods is very high. Our proposed network incorporates feature point consistency loss. This is essentially a combination of deep learning

Discussion
The results in Table 1 fully prove the effectiveness of our proposed network. It goes beyond most current methods of unsupervised learning. The improvement of accuracy is obvious, but the improvement of integrity is relatively small. However, compared with supervised methods, the gaps in accuracy and completeness are obvious. This reflects the unsupervised learning approach still faces great challenges. In addition, although current deep learning methods are more comprehensive than traditional geometric methods, the accuracy of traditional geometric methods is very high. Our proposed network incorporates feature point consistency loss. This is essentially a combination of deep learning methods and traditional methods. This is the reason why the accuracy of Unsup_patchmatchnet improved significantly compared with other unsupervised learning methods.
We demonstrate that our method has the advantages of fast running time and low memory usage in depth estimation. That is, we are comparing the performance of the network in the test. Because the actual use is mainly in the process of testing, the model will not be retrained. However, our method takes a long time to train. All unsupervised learning methods take more time to calculate the loss function. The feature point consistency loss proposed by us requires feature point detection and matching of the image, so it consumes more time. We extract and save the feature point information of the image in advance, thus greatly reducing the training time.
The results in Table 3 demonstrate the effectiveness of the high-resolution loss method. When the DTU data set is used for model training, the resolution of the input image is generally 640 × 512. However, at the time of the test, the input image resolution was 1600 × 1200. The resolution of the training image is much lower than that of the test image. We haven't changed the resolution of the input image. Because this is more conducive to comparison with other methods, but also limited by memory. We upsample the predicted depth map. This is equivalent to improving the resolution of the training image. However, the depth map resolution is also limited by memory.
We analyzed the reasons for the poor reconstruction quality of "M60" and "Panther" scenes. The network designed in this paper adopts the feature point consistency loss function. This makes the depth predicted by the pixel at the feature point more accurate. The network proposed in this paper estimates image depth based on PatchMatch. It propagates the depth of the feature points to the surrounding pixels. Due to the complex textures in the "M60" and "Panther" scenes, the feature points are dense. This results in some smooth areas with more alternative depths. Therefore, the 3D points calculated from the same position of objects in different viewing angles have large errors. Points with large errors will be filtered in the reconstruction. The reconstructed point cloud tends to produce more cavities. The integrity of the reconstructed point cloud will drop.
The proposed feature point consistency loss also has some limitations. As can be seen from Figure 5, some low-texture images have few or no feature points. Therefore, in some low-texture scenes, feature point consistency loss is not effective.

Conclusions
Based on the idea of PatchMatch, in this paper, we propose an unsupervised learning multi-view 3D reconstruction network Unsup_patchmatchnet. The network incorporates multi-metric high-resolution joint loss such as photometric consistency, semantic consistency, and feature point consistency. Experiments on DTU datasets show that Un-sup_patchmatchnet is superior to some current unsupervised multi-view 3D reconstruction networks in accuracy and completeness. Unsup_patchmatchnet has the advantages of less memory usage and less time consumption. The test of different datasets also verifies that Unsup_patchmatchnet has good generalization properties. In the future, we will focus on 3D reconstruction of low texture objects. This is a difficult problem in both traditional and deep learning methods. However, in real life, low-texture scenes are very common. Therefore, line features and plane features can be used to replace point features in image feature extraction. This way of using the overall feature can use more image information, can better adapt to the situation of low texture. In addition, although the network proposed in this paper has fast computing speed and low memory consumption, it is still unable to achieve real-time 3D reconstruction on the mobile terminal. Therefore, we will continue to lightweight the network structure in the future to improve its utility.