DDL-MVS: Depth Discontinuity Learning for Multi-View Stereo Networks

,


16
Multi-view stereo (MVS) techniques have been widely used to obtain dense 3D reconstruction 17 from images. MVS allows aerial images to be converted into accurate 3D models, which provide a 18 more comprehensive representation of the large scene. This 3D information can be used for various 19 applications, such as digital surface modelling [1], landform analysis [2], and urban planning [3]. It 20 provides valuable insights into the shape, structure, and topography of the scene, enabling better 21 understanding and interpretation of remote sensing data. 22 Traditional MVS techniques [4-6] extract dense correspondences from multiple calibrated 23 views and generate a dense 3D representation (i.e., point cloud or dense triangle mesh) of the scene. 24 These methods rely on image correspondences in the RGB space, which are sensitive to textureless 25 and non-Lambertian surfaces, and lighting variations. Recent developments in deep learning allow 26 the use of learned feature maps instead of directly working on RGB images to build more robust 27 MVS pipelines [7][8][9][10][11][12][13][14][15][16][17]. By learning feature maps about the objects in the scene, learning-based MVS 28 methods have demonstrated better completeness than traditional methods in reconstructing man-29 made objects with low texture and non-Lambertian surfaces. Recent learning-based MVS methods 30 learn to reconstruct the depth map from input images by regularizing the 3D cost volume [7,13] or 31 by Patchmatch-based iterative optimization [17,18]. Still, depth estimation remains challenging, and 32 depth discontinuities at transitions between object boundaries are usually erroneous [19,20]. While 33 this kind of error can be alleviated by post-processing filters, it often reduces the completeness of 34 the reconstruction. 35 In MVS pipelines, it is common for a single depth value to be estimated per pixel, accompanied 36 by a smooth surface assumption. This spatial regularization technique results in higher-quality depth 37 maps, as shown in previous studies such as [21,22], which in turn improves the completeness of the 38 reconstructed 3D model. However, a limitation of this approach is that it tends to oversmooth the true 39 depth continuities at object boundaries, as pointed out in recent works such as [20,23]. Furthermore, 40 1 2 1 2 probability depth Figure 1. We propose to estimate depth as a bimodal univariate distribution. Using this depth representation, we improve multi-view depth reconstruction, especially across geometric boundaries.
as illustrated in Fig.1, pixels near depth discontinuities can pose ambiguity in determining which 41 side of the depth boundary they belong to. 42 These findings motivate us to pursue two complementary objectives. First, we aim to explicitly 43 detect the geometric edges, instead of relying solely on photometric edges that capture color and 44 texture changes [21,22]. Geometric edges more accurately indicate the true locations of object 45 boundaries than photometric edges (see Fig. 1). We propose to estimate geometric boundary maps 46 jointly with the depth maps, such that smooth depth surfaces can be enforced while considering the 47 local geometry. Second, as shown in Fig. 1, we propose to estimate per-pixel depth as a univariate 48 bimodal distribution rather than as a single depth value. This allows us to explicitly represent the 49 depth ambiguity and avoids over-smoothing the depth discontinuities. We integrate both objectives 50 into a multi-task learning architecture to improve the depth accuracy while avoiding the completeness 51 trade-off of previous approaches. 52 To confirm the validity of our idea, we integrate it into the existing learning-based Multi-53 View Stereo (MVS) pipeline. Extensive experiments that we ran on various benchmark datasets 54 (see Sect. 4) demonstrate that our method obtains better results. 1 Moreover, our method has high 55 generalization capabilities, which have been validated by training our model on one dataset and 56 testing it on other datasets. 57 In summary, the contributions of this work to multi-view stereo networks are: (1) a novel 58 multi-task learning architecture for joint estimation of depth maps and object boundary maps for 59 learning-based multi-view stereo pipelines; (2) a bimodal depth representation that represents 60 depth as a distribution learned from multi-view images; (3) a general loss formulation for depth 61 discontinuity-based spatial regularization, which helps to learn discontinuities in depth and to 62 regularize the depth maps.

63
The structure of this article is as follows: In Section 2, we will review the existing literature 64 and contrast the most related works to our approach. Section 3 will delve into our methodology 65 and outline the MVS pipeline we use to test our approach. Section 4 will present and discuss the 66 experimental results obtained from our study. Finally, in Section 5, we will conclude the paper by 67 summarizing our main findings. As learning-based MVS networks are inspired by photogrammetry-based MVS algorithms and 70 developed from two-view methods, we review photogrammetry-based MVS algorithms, learning-71 based two-view methods, and the recent development in learning-based MVS networks.  Multi-View stereo methods purely built upon photogrammetry and multi-view geometry theory 74 are usually referred to as traditional multi-view stereo methods. Janai et al. [24] showed that the 75 taxonomy of the traditional multi-view stereo methods can be divided into four classes based on 76 their representations of the scene and output. These scene representations are depth maps, point 77 clouds, volumetric representations, and mesh or surfaces.

78
Volumetric representations use either discrete occupancy function [25] or levelset alike signed 79 distance functions [26], which limits them to small-scale reconstruction. The most common mesh-80 based approaches run variations of the marching cubes algorithm [27] on top of a signed distance 81 function based on a volumetric surface representation [28].

82
The seminal point cloud-based method by Furukawa et.al. [4] has shown that starting with 83 an initial sparse set of point features it is possible to create an initial set of patches and densify 84 them by iterative greedy expansion and photo-geometric filtering. These methods usually demand a 85 uniformly sampled sparse set of points across the image domain to be able to create point clouds 86 with better completeness.

87
Depth map-based approaches usually first try to estimate a 2.5D depth map for each view. By 88 using multi-view fusion pipelines [28,29], these depth maps are consolidated into a single geometric 89 model. Although the plane sweeping algorithm [30] has high memory consumption, it was the 90 most commonly used technique for depth map estimation. To use plane sweeping stereo for a 91 large dynamic range of outdoor videos, Pollefeys et al. [31] took advantage of GPS and inertia 92 measurements to place the reconstructed models in geo-registered coordinates. Using random 93 initialization and propagation techniques, the PatchMatch-based MVS algorithms [5,32] were able 94 to estimate the depth map of each view with low memory consumption. In this work, we use a 95 differentiable PatchMatch-based module to achieve a similar goal. Learning-based two-view methods have introduced the initial building blocks for two-view 98 stereo matching and depth estimation, which were later adapted for multi-view settings. The most 99 common building blocks for learning-based depth map estimation pipelines are feature extraction 100 and depth estimation from the feature space. Shared weight-based feature extraction was introduced 101 by [ EdgeStereo [37] uses a pre-trained sub-network for detecting the edges, and the edge cues are 106 then fed into the disparity branch to improve the disparity map. Tosi et al. [20] showed that it is 107 possible to improve the quality of the learning-based two-view stereo networks by integrating an 108 MLP-based bimodal mixture density network. In their work, they improved the accuracy of stereo 109 matching networks [35,36] that were used as a backbone to their mixture density head. Inspired by 110 these works, we also represent depth as bimodal distribution, and we jointly estimate depth maps and 111 object boundary maps in the multi-view stereo setting using a novel multi-task learning architecture. 112 Our pipeline does not involve any parallel (sub)networks and learns directly from multi-view images 113 to estimate edge-depth pairs jointly.

114
The continuous disparity network [23] aims to regress the multi-modal depth by jointly estimat-115 ing both probability and offset volume by minimizing a Wasserstein distance between the ground 116 truth and the distribution estimated from the volumes. The offset volume aims to obtain continuous 117 disparity estimations. Our method avoids regressing the offset values and instead, directly estimates 118 bimodal distribution parameters.

134
In contrast to two-view or multi-view Plane Sweep stereo [28,29] and cost volume regularization 135 methods [7,34], to reduce memory consumption our pipeline estimates depth maps by fully avoiding 136 the cost volume creation and usage of 3D CNN networks. For this, we are leveraging differentiable 137 PatchMatch-based Multi-View Stereo as part of the internal structure of our pipeline [5,17].

138
The recent work of PatchMatchNet [17] showed state-of-the-art results in terms of reconstruction 139 completeness, which is used as a baseline in this work.

140
To enhance the quality of scene reconstruction, our proposed method focuses on estimating 141 the geometric boundaries of objects in the scene where depth discontinuities occur. We introduce a 142 technique to regularize the depth map by incorporating an estimated boundary map. Our approach 143 distinguishes itself from DEF-MVSNet [38] in terms of how edge information is represented and 144 modeled. While DEF-MVSNet primarily focuses on determining flow directions as pixel offsets, our 145 method explicitly learns and smooths the edge map by defining each pixel as a bimodal distribution. 146 This distinction contributes to the unique characteristics of our approach.

147
Similarly, our method deviates from BDE-MVSNet [39], which also aims to find flow directions 148 for edge pixels using gradient information. Instead, we explicitly learn the boundary map, placing 149 emphasis on regularizing smoothness in regions that are not classified as boundaries. In comparison 150 to ElasticMVS [40] which proposes an elastic part representation for encoding physically connected 151 part segmentations, our approach focuses solely on explicitly learning the boundary map. By utilizing 152 the boundary map for regularization, our objective is to enhance smoothness rather than encode 153 physically connected part segmentations and capture surface connectedness and boundaries within 154 the image.

155
During the development of our method, we also explored some depth derivative-based loss 156 functions, similar to those utilized in previous works [37,41]. However, we did not observe significant 157 improvements when employing these loss functions. Therefore, we adopted a different approach 158 by explicitly learning the boundary map to regulate smoothness in regions that are not classified as 159 boundaries.

160
In comparison to two-view stereo matching pipeline SMD-Nets [20], we employ a mixture 161 density network as an internal structure for depth refinement, inputting it with RGB-Depth pairs 162 instead of rectified left-right image pairs. Unlike previous methods, we learn the depth and boundary 163 map simultaneously, utilizing the same backbone architecture for estimating the density parameters 164 and boundary map in parallel. In comparison with previous methods, our pipeline utilizes a 2D 165 CNN-based U-Net architecture [42] to estimate the bimodal depth density parameters for each pixel 166 in discrete space. In contrast to existing MVS approaches with depth map representations, in which the depth 169 of each pixel is expressed as a single value, our approach takes advantage of a bimodal depth 170 representation that represents depth as distribution. Our depth map is thus not a common grid 171 of per-pixel scalars, but per-pixel mixture density parameters. The motivation of this module is 172 to implicitly integrate the uncertainty notion into our pipeline, which enables us to learn depth 173 discontinuities for spatial regularization of the depth map and to further alleviate the noise gathered 174 in intra-object transitions, foreground-background transitions, and partial occlusions.

175
The overview of our proposed network architecture is shown in Fig. 2. Our network has three 176 parts, namely, feature extraction, coarse-to-fine PatchMatch Stereo (PMS), and depth discontinuity 177 learning, detailed as follows. We employ a widely used technique in the Computer Vision field known as feature pyramid 180 learning [12,13,16], which enables us to build our algorithm in a coarse-to-fine regression fashion. 181 We adopted the Feature Pyramid Networks [43] with residual connections between encoder and 182 decoder, and use three layers of decoder outputs as our extracted features. Each subsequent level 183 has half the resolution of the level before it, and the finest level has half the width and height of the 184 Being agnostic to the backbone our method is independent of the underlying rough depth 188 estimation method. Both cost volume regularization and the PatchMatch-based approach can be 189 used for depth estimation. We follow PatchmatchNet [17] that demonstrates good reconstruction 190 completeness and low memory demands. Our pipeline regress three levels of initial depth maps in a 191 coarse-to-fine manner. 192 We randomly initialize the depth values at the coarsest level, and at a finer level, we initialize 193 the depth values with the outputs of the coarser levels. Following the initialization step, we run an 194 iterative feedback loop between the propagation and evaluation steps. We propagate our estimates 195 with good scoring values to the neighboring pixels. In the evaluation step, we use candidate depth 196 values for differentiable homography warping and matching cost computation. The output of coarse-to-fine PMS is a conventional depth map of half the resolution (half width 199 and half height) of the original input. Hui et al. [44] showed that a low-resolution depth map can be 200 progressively upsampled with the guidance of the associated high-resolution color image. This idea 201 inspires the proposed framework's attempt to match the resolution of the color and depth images.

202
Contrary to existing learning-based networks [7,45], which revise depth maps using residual 203 networks, we refine depth maps via learning mixture density parameters and geometric edge maps. 204 In contrast to SMD-Nets [20], which employ corrected image pairs as input, we use RGB-depth 205 pairs as the input to the depth refinement network and convolutional mixed density networks as the 206 internal structure. To the best of our knowledge, our work is the first learning-based MVS method 207 that explicitly learns depth discontinuity maps (aka geometric edge maps) to simultaneously refine 208 the quality and improve the smoothness of the depth maps.

209
In our pipeline, we use a 2D CNN-based U-Net [42] architecture to estimate the bimodal depth 210 density parameters of each pixel in a discrete space. During the development of our pipeline, we 211 experimented with different network variations to learn separate boundary maps and mixture density 212 parameters, including two parallel network streams and single encoder and multiple decoder archi-213 tectures. However, we found that using multiple subnetworks increased the number of parameters 214 and GPU memory demands without leading to any substantial improvement in results. Therefore, we 215 chose to use a single encoder and decoder architecture for our proposed pipeline. Based on the fact 216 that depth maps have piecewise smoothness and that they can be improved by spatial regularization 217 to smooth regions as shown in earlier works [21, 22,46], we propose to refine depth-map quality by 218 learning depth discontinuities.
Previous methods based on pixel-wise single value estimates implicitly balance the depth 220 estimation error between nearby foreground and background pixels for boundary points. Our 221 refinement network regresses the parameters of a bimodal distribution. We use the bimodal Laplacian 222 distribution, which was inspired by Tosi et al. [20] work. During development, we observed that the 223 Laplacian distribution [47] had slightly better results than the Gaussian. The Laplacian distribution 224 has a sharper shape modality than Gaussian. It optimizes over L 1 distance instead of L 2 distance 225 between the groundtruth and estimated mean. This makes it more robust against outliers. The 226 bimodal Laplacian density distribution can be written as where α is the mixture weight that can be seen as the likeliness of each mode. Later in our work (see 228 Sec. 4), we observe that the network learns to assign different α values to different scene parts, and 229 in most cases it is binary classifying foreground and background pixels. µ 1 and µ 2 are the two depth 230 estimates of the corresponding modes. σ 1 and σ 2 are the two depth variance measures of each depth 231 value. We also treat α σ 1 and 1−α σ 2 as responsibility scores, which aims to determine the responsible 232 mode for the depth of a given pixel.

233
Besides extending bimodal depth estimation to the multi-view case, our proposed convolutional 234 mixture density network also shows that with a single stream compact discontinuity learning network 235 architecture, it is possible to achieve three goals: (1) Upsampling; (2) Refining; (3) Multi-task 236 learning.  Depth-groundtruth loss. This loss term measures the difference in depth maps between prediction and the groundtruth. It is defined as the mean absolute error (MAE) of the estimated depth map, i.e., L 1 distance between the estimated depth and ground-truth depth across all stages of the PMS and the final reconstructed depth, where k ∈ {0, 1, 2, 3} denotes the scale index of the coarse-to-fine PMS that estimates initial low-241 resolution depth maps, with 0 representing the finest input and output resolution, and from 3 to 1 242 the coarser-to-finer scales of the PMS output.D k and D k represent the ground-truth depth map and 243 estimated depth map at resolution level k, respectively. The DTU dataset [48] contains masks that 244 identify pixels with valid ground truth depth information. N k represents the number of pixels in each 245 scale.

246
Edge-depth loss. Geometric edges or boundaries are expected where there are depth discontinuities in the depth map. Thus, the edge-depth loss term measures how much the estimated edges agree with the second-order depth variations (i.e., depth discontinuities). It is defined as the mean squared error (MSE) (L 2 distance) between the estimated edge E and groundtruth changes of variations in depthD, where ϕ is the function that takes Laplacian of the depth and threshold value τ to return the 247 mask image where the Laplacian response [49] of the depth map is higher than the τ. The DTU 248 dataset [48] contains masks that identify pixels with valid ground truth depth information. The 249 variable N represents the count of masked pixels that have corresponding ground truth labels for 250 depth. With this term, we explicitly inform the network that we are expecting geometric edges or 251 boundaries at the pixels where there exist depth discontinuities. We calculate depth discontinuities 252 using the Laplacian operator, which is the second-order depth change.

253
Smoothness loss. Except for the geometric edges and boundaries with depth discontinuities, real-world objects typically demonstrate piecewise smoothing surfaces. Thus, we would like to encourage local smoothness for the regions without depth discontinuities. We achieve this by introducing an edge-aware smoothness loss term to penalize second-order depth variations in nonboundary regions, where E i will have an estimated value close to 1 for boundaries and close to 0 for non-boundary 254 pixels. ω is a weight function that plays a role of a switch, which returns a value close to 0 for 255 boundaries and close to 1 for non-boundary pixels. Thus, second-order depth change in non-boundary 256 regions contributes to our smoothness loss. β is a tunable hyper-parameter that controls the sharpness 257 of change in the ω function. N denotes the number of pixels in the image space Ω with a valid 258 grountruth depth. To the best of our knowledge, this is the first time depth discontinuities are 259 explicitly learned and used for spatial regularization in multi-view stereo networks.

260
Bimodal loss. We adopt a common approach of minimizing the negative-log likelihood of the 261 distribution to increase the likelihood of true depth. Tosi et al. [20] have demonstrated that this loss 262 term for bimodal depth in the two-view stereo setting can produce inspiring results. Our bimodal 263 loss term is defined as whereD i represents the groundtruth depth measured at pixel i, and θ is the parameter of the bimodal 265 distribution introduced in Eq. 1. The distribution p can be computed using the Eq. 1. N denotes the 266 number of the pixels in the image space Ω with a valid grountruth depth.

267
Total loss. We simply use the weighted sum of the aforementioned loss terms as a training criterion for our network to optimize the parameters via backpropagation. λ 1 = 4, 268 λ 2 = 1.25, and λ 3 = 0.5 are hyper-parameters empirically set based on our experiments on the 269 validation set.

271
We used the same model to quantitatively evaluate the generalization capabilities of our method 272 and to compare it with other methods. All the metric results of the other methods were collected 273 from the corresponding papers, and the 3D point clouds of other papers were reconstructed using the 274 code and pre-trained models provided by the authors.

279
The DTU dataset [48] is a benchmark with 120 scenes captured by a structured-light sensor 280 under seven different lighting conditions. It has been widely used for developing learning-based 281 MVS methods and evaluating their performance in terms of completeness and accuracy. All the 282 learning-based methods in Tab   The ETH3D dataset [51] is a collection of calibrated images, containing various indoor and 289 outdoor environments, including urban scenes, garages, and rooms. The dataset provides ground 290 truth camera poses, 3D point cloud geometry, and images for each scene, making it suitable for tasks 291 such as camera pose estimation and 3D reconstruction.  In this section, we present our findings based on the DTU benchmark [48], where we evaluated 297 the performance of our method using the accuracy, completeness, and overall metrics. The accuracy 298 metric measures the mean error distance between the closest points in the reconstruction and the 299 reference based on structured light. The completeness metric quantifies the mean error distance 300 between the closest points in the reference and the reconstruction. The overall metric is the algebraic 301 mean of accuracy and completeness. Lower scores indicate better performance in this benchmark. 302 The result on the DTU dataset is reported in Tab. 1. For a fair comparison, all techniques were 303 trained on the same dataset and employed the same validation and train split. From the result, we can 304 see that traditional photogrammetry-based methods generally have better accuracy, while learning-305 based methods have better completeness and overall performance. Furthermore, it also reveals that 306 the completeness gap between learning-based and photogrammetry-based methods is bigger than 307 their gap in accuracy, which motivated us to use a coarse-to-fine PMS to build our initial depth 308 In this section, we present our findings based on the " Tanks and Temples" dataset [50]. This 317 benchmark has three metrics, namely, recall, precision, and F-score. Recall and precision represent 318 the completeness and accuracy of the reconstruction, respectively, both measured in percentage (%). 319 The F-score combines precision and recall, and it is defined as the harmonic mean of a model's 320 precision and recall.

321
In our experiments, we used our model trained using the DTU dataset with 14 epochs with 322 all the proposed loss terms. We compared the results against those from our baseline method 323 PatchmatchNet [17]. For both methods, we ran the same depth map fusion algorithm with the same 324 threshold value to not gain any advantage in the evaluation process. As can be seen from the statistics 325  [51]. Following the benchmark, the accuracy and completeness measures are quantified using the percentage of points below a 2 cm error margin (the higher the better).
Intermediate set Advanced set Methods  Table 4. Ablation study on the point clouds and depth maps from the DTU dataset [48]. L 1 : depth-groundtruth loss; L 2 : edge-depth loss; L 3 : smoothness loss; L 4 : bimodal loss. Note that L 2 and L 3 cannot be separated because they together work for edge-aware smoothness.
reported in Tab. 3, our results on the intermediate set have better performance on all evaluation 326 metrics. On the advanced set, our results demonstrate better accuracy and F-score, and the results 327 from PatchmatchNet have slightly better completeness. As depicted in Fig. 3, our approach improves 328 baseline [17] in accurately capturing the overall geometry and exhibits improved completeness in 329 smooth regions. This is substantiated by both qualitative and quantitative results, which demonstrate 330 that our approach outperforms the baseline in terms of overall reconstruction quality. In this section, we present our findings based on the ETH3D benchmark [51]. The ETH3D 333 benchmark [51] consists of high-resolution images of scenes with sparse scene coverage, high 334 viewpoint variation, and camera parameter information. The quantitative evaluation of our method 335 and the comparison with PatchmatchNet [17] on the ETH3D dataset [51] are detailed in Tab. 2. Both 336 methods have used the same fusion pipeline. Our method demonstrates better accuracy and F-score, 337 while PatchmatchNet has better completeness. We have conducted an ablation study to understand and analyze the contributions of the 340 aforementioned loss terms of our architecture. The results are detailed in Tab. 4. Since the edge-341 depth loss and the smoothness loss terms together strive for edge-aware smoothness, we do not 342 separate them in our experiments. We retrieve the last two metrics from the validation set while 343 tuning our hyper-parameters. The "Depth map" represents the accuracy of the estimated depth map, 344 calculated using mean absolute error (MAE) between the estimated depth map and groundtruth. 345 "Error > 8 mm" represents the percentage of points in the depth map having a higher error than 8 346 mm.  Table 5. Evaluation of depth map errors in boundary and smooth regions using the DTU dataset [48].
metrics with bimodal and depth ground-truth loss while having edge-aware smoothness term results 350 in better accuracy. Our network also improves the arithmetic mean of accuracy and completeness if 351 we compare it against the baseline.  From the above experiments and evaluation, our method demonstrates superior reconstruction 354 quality in terms of completeness and overall quality, which benefits from our depth discontinuity 355 learning. To understand the role of depth discontinuity learning in reconstruction, we visualize 356 the learned depth discontinuities (denoted as edge maps) for a few randomly picked examples in 357 Fig. 4 (a), and compare them with the edge maps predicted using the seminal learning-based edge 358 detection method HED [60]. We can see that by learning depth discontinuities, our network can 359 retrieve edges where the true depth discontinuities lie. Thus, as a key component for learning-based 360 MVS pipelines, our discontinuity-aware depth learning is more robust to photometrical changes, 361 shadows, and small variations in depth. In the earlier stage of the development of DDL-MVS, we 362 tried to feed the network with HED [60] output and jointly refine the depth and edge maps similar to 363 EdgeStereo [37]. It turned out that even after refinement, the edges were too sensitive to photometric 364 changes, leading to higher depth errors.

365
To reveal how our depth discontinuity learning contributes to depth estimation, we demonstrate 366 the α map of each example in the last column of Fig. 4 (a), where α is the mixture weight in the 367 bimodal Laplacian density distribution (see Eq. 1). It is surprisingly interesting to observe that our 368 network tries to learn to differentiate foreground and background, for which the α values express a 369 binary classification for foreground and background pixels.

370
Our suggested framework enhances the quality of depth maps for both smooth and boundary 371 regions, as demonstrated quantitatively in Tab. 5. We have computed mean absolute error (MAE) 372 between the estimated depth and the groundtruth. In contrast to the rest of the pixels, which 373 correspond to a smooth area, boundary pixels are those pixels where the laplacian of the groundtruth 374 depth is greater than 5. The proposed approach also enhances the quality of the point clouds as demonstrated qualita-376 tively in Fig. 5, from which we can see that thin structures and smooth regions are captured more 377 completely, and the boundary regions have a lower amount of noise. 378 Figure 6 presents a comparison of our proposed learning-based MVS method with two methods, 379 namely COLMAP [32] and PatchmatchNet [17] (our baseline). COLMAP is a state-of-the-art 380 traditional photogrammetry-based method. We visualized the outcomes of the methods on four 381 different scene parts from the "Tanks and Temples" [50] dataset, and the last column of the figure 382 shows the exterior of the courtroom's top, with the lower part of the point cloud clipped to better 383 reveal the ceiling's completeness and accuracy.

384
To ensure a fair comparison, we provided COLMAP with the ground-truth camera parameters. 385 Our experiments demonstrate that our proposed method generates denser and more complete point 386 clouds than the traditional photogrammetry-based method. However, the traditional method achieves 387 better accuracy, partially due to its sparsity. Our method's results exhibit the highest completeness 388 and are cleaner than the other methods. Additionally, our method outperforms PatchmatchNet [17] 389 in terms of reconstruction accuracy. Please refer to the supplementary video for more visual 390 comparisons. To further evaluate the generalization capabilities of our proposed methods, we conducted 393 experiments using aerial images. Aerial images are commonly used in remote sensing applications 394 for tasks such as large-scale 3D reconstruction. For our experiments, we utilized the BlendedMVS 395 dataset [52], which consists of aerial images with a low resolution of 768 × 576 images. Figure 7 demonstrates some example images from the BlendedMVS dataset, illustrating the 397 qualitative results obtained from our proposed methods. These results show that our method can 398 generate 3D reconstructions from aerial images, even with low-resolution input images. This 399 indicates the potential of our approach for remote sensing applications that require large-scale 3D 400 reconstructions from aerial imagery.

401
On a single RTX 2080, the time needed for depth inference per image is 90 ms when using 402 5 neighboring views, and increases to 110 ms when using 7 neighboring views. As an illustration 403 of the running time, the bottom left building example in Fig. 7 is comprised of 77 images. It takes 404 79.993 seconds to generate a point cloud from the calibrated views with the default 5 neighboring 405 views. In Fig. 4, we report our comparison of GPU memory demands with existing learning-based 408 MVS networks on the DTU dataset [48], from which we can see that the memory demand of our 409 network is much lower than most of the existing networks. In the DTU dataset with the default 410 parameters and the 5-view case, the average depth inference time for our model is 345ms. This 411 is comparable to the performance of PatchmatchNet [17], which took 300ms. We used a GPU of 412 NVIDIA GeForce RTX 2080 for the experiments. Although our method has good completeness and a good overall score (see Tab. 1), it has still 415 not reached the accuracy level of traditional photogrammetry-based algorithms such as Gipuma [5], 416 which is a common weakness in recently developed learning-based MVS methods with high com-417 pleteness score. In this paper, our goal is to improve the accuracy of the reconstruction process while 418 simultaneously maintaining a high level of completeness. Although the accuracy of our proposed 419 network is not among the highest compared to some traditional state-of-the-art methods, we would 420 like to emphasize that currently, learning-based approaches struggle to achieve a state-of-the-art 421 accuracy result while maintaining a high completeness score. This is due to the trade-off between 422 accuracy and completeness in the depth map fusion process, which is a key component of the 423 reconstruction pipeline. Such a trade-off implies that increasing completeness leads to an increasing 424 potential source of noise. Although using bimodality helps to reduce the noise, we observe that our 425 work, like other traditional and learning-based algorithms, contains noise, especially in sparsely 426 viewed regions that may need further research. It is also worth noting that in this work we have used 427 the same fusion pipeline as in other papers [7,17].

429
We have presented a strategy for improving the baseline MVS network by learning depth 430 discontinuities. The proposed depth discontinuity learning module has demonstrated superior 431 performance compared to the baseline [17]. The results of our ablation study, as shown in Tab. 4, 432 highlight the significant reduction in depth map error achieved by incorporating the proposed DDL 433 module, reducing the error by more than 30%. Experimental findings presented in Tab. 5 demonstrate 434 the enhanced quality of our approach in terms of depth map accuracy in smooth and boundary regions. 435 Moreover, our visual results shown in Fig. 3 and Fig. 5 revealed that the reconstructed point cloud 436 obtained from our approach exhibits improved accuracy in capturing object and scene details 437 compared to the baseline model while maintaining completeness.

438
The results of Fig. 5 and Tab. 5 further reinforce the superiority of our method, with better 439 qualitative and quantitative results in both smooth and boundary regions in the DTU [48] dataset. 440 These results indicate that our method has strong generalization capabilities and the ability to produce 441 high-quality depth maps with improved accuracy and precision. Furthermore, our experimental 442 results demonstrate the potential of our method for remote sensing applications, such as large-scale 443 point cloud reconstruction from aerial images.

444
Our experiments have demonstrated that learning depth maps as a mixture distribution and 445 integrating depth discontinuities into the network as prior knowledge for piecewise smoothness 446 regularization leads to improved reconstruction quality, with enhanced accuracy and overall quality 447 of the final reconstruction.