A Conﬁdence-Aware Cascade Network for Multi-Scale Stereo Matching of Very-High-Resolution Remote Sensing Images

: Stereomatching plays an essential role in 3D reconstruction using very-high-resolution (VHR) remote sensing images. However, it still faces unignorable challenges due to the multi-scale objects in large scenes and the multi-modality probability distribution in challenging regions, especially the occluded and textureless areas. Accurate disparity estimation in stereo matching for multi-scale objects has become a hard but crucial task. In this paper, to tackle these problems, we design a novel conﬁdence-aware unimodal cascade and fusion pyramid network for stereo matching. The fused cost volume from the coarsest scale is used to generate the initial disparity map, and then the learnable conﬁdence maps are generated to construct the unimodal cost distributions, which are used to narrow down the next-stage disparity search range. Moreover, we design a cross-scale interaction aggregation module to leverage multi-scale information. Both smooth-L1 loss and stereo focal loss are applied to regularize the disparity map and unimodal cost distribution, respectively. Compared to two state-of-the-art stereo matching networks, extensive experimental results show that our proposed network outperforms them in terms of average endpoint error (EPE) and the fraction of erroneous pixels (D1).


Introduction
Stereo matching, estimating disparities from stereo image pairs, is one of the most fundamental problems in computer vision tasks and remote sensing applications such as earth observation [1,2], autonomous driving [3], robot navigation [4], SLAM [5], etc. [6]. Owing to the increasing resolution and volume of remote sensing images, precise 3D reconstruction using multi-view VHR remote sensing images becomes possible, providing a new way to observe on-ground targets. As the fundamental task of 3D reconstruction, stereo matching finds pixelwise correspondences from rectified stereo image pairs and estimates horizontal disparities, which can be further used to calculate elevation and construct 3D models. Typically, large-scene remote sensing images contain objects of various sizes and heights, such as skyscrapers, residential buildings, and woods. Multiscale objects result in different disparity ranges, which make stereo matching methods difficult to extract accurate correspondences.
Traditional stereo matching algorithms can be implemented using a four-step pipeline: matching cost computation, cost aggregation, disparity computation, and refinement [7]. Numerous methods were proposed during past decades, they are mainly divided into three categories, i.e., global, local, and semi-global methods. Global methods usually solve an optimization problem by minimizing a global objective function containing some regularization terms [8,9], suffering from an expensive time cost. On the contrary, local methods Consequently, our proposed method aims to improve the stereo matching accuracy of remote sensing images by absorbing the advantages of multi-stage methods and making them suitable for the characteristics of remote sensing images.
Multi-view VHR remote sensing images acquired from pushbroom cameras can be applied to precise 3D reconstruction due to the growing resolution [32]. However, compared to the natural images, there are more difficult scenes in VHR remote sensing images [33]. First, the disparities in remote sensing stereo pairs can be both positive and negative according to the complicated viewing conditions. We illustrate the disparity range (in Figure 1) of each ground truth DSP (disparity map) in IGARSS2019 [34] data fusion contest dataset US3D [34,35]. Second, a lot of challenging areas exist which easily produce ambiguous disparity estimation results, including occluded areas and textureless areas with repetitive patterns which cause difficulties in obtaining accurate correspondences. In addition, the disparity probability distributions in those areas are susceptible to multi-modal. Last, multiscale objects in remote sensing images, which contain various disparities, further increase the difficulty to find the suitable disparity search range. As shown in Figure 2, comparing with the proposed network, the CFNet [30] fails to produce good results on US3D. To this end, two motivating and challenging problems in terms of remote sensing images arise: how to estimate precise disparities for multi-scale objects in large scenes and how to regularize the multi-modality probability distributions in such challenging areas. In this paper, we propose a novel confidence-aware unimodal cascade and fusion pyramid network for multi-scale stereo matching of VHR remote sensing images. Specifically, toward the characteristics of various disparity search ranges in remote sensing images, we modify the group-wise cost volume to cover the whole disparity search range. As for the multi-scale cost aggregation problems, existing cross-scale aggregation algorithms [22,36] adaptively combine the results of cost aggregation at multiple scales. Different from those methods, we build cross-scale cost volume interaction in a cascade framework for remote sensing images. Last but not least, considering the multi-modality of disparity probability distributions, we propose a multi-scale module to generate learnable confidence maps, which are used to generate the next stage search range, and a multi-scale unimodal distribution loss is applied to regularize cost distribution.
The rest of this paper is organized as follows. Section 2 first illustrates the overall framework of the proposed confidence-aware cascade network and then introduces each module of the network in detail. In Section 3, the experimental results of stereo matching for multi-scale objects in remote sensing images are shown, then both qualitative and quantitative analyses demonstrate the superiority of the proposed network. In Section 4, ablation experiments on different settings are conducted to prove the effectiveness of each module in the proposed network. Finally, the conclusions are drawn in Section 5.

Method
In this section, the proposed cascade stereo matching network with cross-scale interaction and confidence map is demonstrated in detail. First, the whole structure of our proposed network is illustrated. Then, the fused cost volume for the coarsest scale is introduced, which consists of the reconstruction group-wise cost volume for VHR remote sensing images. Moreover, the confidence-aware disparity refinement method embedded in the cascade framework is presented. Last, the smoothL1 loss and unimodal cost distribution regularization loss are elaborated, which are employed for disparity map and cost distribution, respectively.

The Architecture of the Proposed Network
The overall architecture of the proposed network is shown in Figure 3. Given a rectified remote sensing image pair I l and I r , we first employ a siamese UNet-like [37,38] module to extract multi-scale features, which shares an encoder-decoder architecture with skip connections between multi-scale feature maps (as shown in Figure 4). The encoder is composed of five residual blocks followed by a SPP module to better incorporate multiscale context information. The encoder module is similar to HSMNet [37] and CFNet [30] and which is proven to be efficient and contains various context information. Moreover, the decoder upsamples the hierarchical feature maps and concatenates them with the feature maps from skip links of the encoder. Then, the extracted multi-scale features are fed into multi-scale group-wise cost volume construction.
Different from one-stage cost aggregation methods [23][24][25], multi-stage cost aggregation methods [28][29][30] are proven to be more effective, which can reduce the computational complexity and time cost by progressively refining the disparity estimation. We divide multi-scale feature maps into fused and cascade cost volumes to predict multi-resolution disparity, respectively, and we fuse multi-scale cost aggregation results to capture low and high resolution information. In addition, we build a multi-scale confidence prediction module to regularize cost distributions and leverage the learned confidence map to generate the next stage's disparity search range progressively. The training loss functions employed in our method are the stereo focal loss [31] and smoothL1 loss [24,25], which are used to regularize cost distribution and disparity, respectively. The details of the aforementioned modules will be discussed in the following sections.

Fused Cost Volume of the Coarsest-Scale Feature
In the cascade multi-scale cost aggregation frameworks, generating low-resolution initial disparity maps is indispensable. Different from the existing methods [29,39] which do not use low-resolution feature maps, CFNet fuses three lowest resolution feature maps to generate a more accurate initial disparity map in an encode-decoder process. Noticing the effectiveness of this module, we adopt the same cost volume construction in CFNet, which uses both concatenated and group-wise correlation [26] feature maps to generate low-resolution cost volume. The detail of the combined volume is given as: where F i l and F i r are the extracted feature maps at scale i and N c represents the number of feature channel. N g is the group size of correlation. , denotes the inner product and ⊕ is the feature concatenation.
By densely sampling the whole disparity range in low resolution, the hypothesis plane interval equals to 1, by which we can efficiently generate the initial cost volume with size H/2 i × W/2 i × D max /2 i × F. Then, the improved encoder-decoder architecture with 3D hourglass aggregation module is used to fuse the three lowest cost volumes. As shown in Figure 5, specifically, the combination volumes from 1/32, 1/16 scale, firstly employed four 3D convolution layers and skip connections, respectively, are concatenated into the scale 1/8. In addition, the 3D hourglass network is implemented to further aggregate the fusion cost volume, and the intermediate outputs are used for the following cross-scale cost aggregation and the confidence-aware algorithm can be employed based on the final output of this scale. As shown in Figure 1, due to the various disparity ranges in remote sensing images, the cost volume and its disparity regression range need to be reset to accommodate remote sensing images. According to the disparity regression algorithm [23], the probability of each disparity d is calculated from the predicted cost c d via the softmax operation. Then, the continuous disparity mapsd can be calculated by the weighted sum of d. In this paper, we reconstruct the three initial coarsest-scale cost volumes with the range of [−D i max , D i max ] same to the corresponding disparity regression range. As shown in Figure 6, the coarsest cost volumes have size of [2 × F i × H i × W i ], which are suitable for the characteristics of remote sensing images, where D i max is the max disparity at scale i and F i , H i , W i denote the dimension size of feature, height, and width at scale i, respectively. Thus, the three coarsest-scale predicted disparity mapsd i can be calculated as follows, where so f tmax means the softmax operation, and c i d denotes the estimated cost volume at scale i. so f tmax(−c i d ) is the discrete disparity probability distribution.
Stacked Feature Map Figure 6. The reconstruction of cost volume for remote sensing images.

Confidence-Based Unimodal Distribution Regularization
Aiming at generating the next stage disparity search range based on the initial fusion disparity estimation, existing methods [28] make a straightforward, uniformly sampling with a predefined range. However, such a method cannot adaptively adjust pixel-wise property. The pixels in challenging areas should empirically expand their search ranges. To tackle this problem, refs. [30,31] propose a variance-based disparity range search algorithm and a pixel-wise confidence map to adaptively quantify the various search ranges. The degree of multimodal distribution is highly correlated to the probability of prediction error. Different from the previous works, we adopt the learnable confidence estimation networks [31] to embed in the multi-scale variance-based disparity refinement framework. The multi-scale confidence estimation network and cascade disparity refinement framework are presented in the following subsection.
Cost volume usually reflects the similarities between corresponding pixel pairs, where the truly matched pixel should have the lowest cost or the highest similarity [31]. This hypothesis requires the cost distribution is unimodal peaked at the true disparity and increases with the distance to the true position. The unimodal distribution truth is defined as: , σ > 0 is the variance that controls the sharpness of the peak around the true disparity. It is clear that pixels across different challenging areas should have different cost distribution. For example, a pixel at a robust corner usually has a sharp peak, while pixels in textureless regions may have flat distribution. Consequently, we add a confidence estimation network to adaptively predict the σ for each pixel. In the meantime, the predicted confidence maps are used to compute the next stage disparity search range and employ the multi-scale stereo focal loss combined with P(d).
The confidence estimation network takes the predicted cost volume as input and uses a few layers to predict a confidence map for each pixel. The network consists of a 3 × 3 convolutional layer followed by BN and ReLU, and another 1 × 1 convolutional layer followed by sigmoid activation, after which we can predict a confidence map f ∈ [0, 1] H i ×W i . The larger confidence value refers to a more robust correspondence. Then, the σ of cost distribution truth is calculated as: where α > 0 is the scale factor and β > 0 is the lower bound for σ and avoids numerical error of zero-dividing. Accordingly, the cost distribution truth in Equation (3) should be modified.

Cascade Cost Volume for Disparity Refinement
Given the predicted confidence maps, the variance value σ i can be computed through Equation (4). Therefore, it is reasonable to implement the confidence-aware variance to evaluate the disparity search range of the next stage, where lower confidence value corresponds to a wider search space to correct the wrong estimation [30]. Thus, the next stage's disparity search range can be computed as: where δ means bilinear interpretation. s i , ε i are two learnable hyperparameters, which are initialized as 0. The two learnable parameters are proven to be more robust [30] than human-selected parameters. Finally, the next stage's disparity search range can be uniformly sampled as: where N i−1 is the number of disparity hypothesis planes at stage i − 1. Then, the cost volume of the next stage with size H After that, we employ a similar 3D hourglass cost aggregation module in 1/8 scale to predict the refined disparity map. Thus, the coarse-to-fine cascade disparity estimation framework is built by progressively narrowing down the disparity search range.

Cross-Scale Cost Aggregation
The cross-scale interaction in stereo matching, not only introduced in the traditional algorithm [36] but the learning-based methods [22,40], is observed as beneficial to aggregate multi-scale feature information. In addition, the cross interaction modules in HRNet [41] are proposed for learning sufficient feature representations for human pose estimation. With the observation application on remote sensing images that the recent popular methods pay more attention to large-scale objects, as shown in Figure 2, we add the cross-scale interaction module to further aggregate rich cost information in our cascade framework.
With the analyses in [22,36], we add a similar combination manner in HRNet, which adaptively fuse the cost volume results performed in different scales. Specifically, the cross-scale combination is:Ĉ whereĈ i is the cost volume after cross-scale cost aggregation, whileC k is the intermediate outputs of different scales. In addition, f k defines a general combination function of different-scale cost volume. Specifically in the function f k , the τ denotes the identify function, while ⊕ means bilinear upsampling (δ) to the consistent resolution i followed by a 1 × 1 convolution layer.

Loss Function
Considering the loss function, we first adopt the smooth L 1 loss to supervise the multi-scale estimated disparity map and adopt the stereo focal loss, which is proposed in [22], to further regularize the cost distribution based on P p (d), P gt (d). Due to the low sensitivity to outliers compared to L 1 loss [42], smooth L 1 loss is widely used in object detection and stereo matching, which is given as follows: where where N is the valid disparity in ground truth, d denotes the disparity label, andd is the predicted disparity.
In order to supervise distribution loss between the predict and ground truth and considering the severe sample imbalance problem since one pixel only has one true disparity, the stereo focal loss [31] is proposed to focus on truth disparities, which is inspired by the focal loss [16] designed to solve the sample imbalance problem in object detection [43]. The stereo focal loss is defined as: where γ > 0 denotes a focusing parameter, and the loss is deprecated to cross entropy loss when γ = 0, while γ > 0 performs more weights to positive disparities so that the positive disparities only compete with a few negative ones. In conclusion, our final loss functions consist of the aforementioned losses defined as: where λ 1,2 are two trade-off hyperparameters. In addition, L SF is used to supervise cost volume distribution, while L SL is to supervise disparity maps.

Result
In this section, we first introduce the dataset and metrics for evaluation and experimental settings. Then, the experimental results compared to other state-of-the-art networks are presented to evaluate the performance.

Datasets and Evaluation Metrics
We conduct extensive experiments on two datasets: SceneFlow and US3D. Both datasets contain positive and negative disparities, and details are listed in Table 1.  [20] is a large-scale synthetic dataset including 35,454 positive training and 4370 test images with a resolution of 960 × 540, so as the negative ones. The RGB images in SceneFlow are rendered into cleanpass and finalpass settings, where cleanpass includes lighting and shading effects. In contrast, the finalpass images also contain motion and defocus blur. We use the whole positive and negative finalpass images to pre-train our network. An example is shown in Figure 7, where there are similar-scale objects in the foreground. (2) The US3D dataset (https://ieee-dataport.org/open-access/data-fusion-contest-2019-dfc2019, accessed on 8 February 2022) is the track2 data of the 2019 IEEE Data Fusion Contest [34,35]. The stereo pairs in this dataset are from 69 VHR WorldView-3 multi-view remote sensing images, which are acquired from 2014 to 2016 over Jacksonville and Omaha in the United States. The stereo pairs in this dataset are geographically non-overlapped with the rectified size of 1024 × 1024. The whole dataset has 4292 and 50 stereo pairs for training and testing, which contain various landscapes such as residential buildings, skyscrapers, woods, and rivers. An example is shown in Figure 8, which contains more complex multi-scale objects compared with SceneFlow. In order to evaluate our proposed network, two quantitative metrics, the average endpoint error in pixels (EPE) and the fraction of erroneous pixels (D1), are used to assess the performance. D1 is robust to outliers with large disparity errors, while EPE measures errors to sub-pixel level.

Implementation Details
We use PyTorch to implement our network and employ the Adam optimizer with β 1 = 0.9, β 2 = 0.999 to train the whole network in an end-to-end way. The input size of images is set to 512 × 512. The asymmetric chromatic augmentation and asymmetric occlusion [30,37] are used for data augmentation.
We implement a two-stage strategy to train our network. Specifically, we first pre-train our model in the SceneFlow dataset from scratch for 20 epochs, and then finetune our pre-trained model on US3D dataset for 300 epochs. In the whole training process, the initial learning rate is set to 0.001 and is downscaled by 10 after epoch 200. We normalize the pixel to [0, 1] to decrease the radiometric influence in remote sensing images. Every experiment is conducted on 2 NVIDIA Titan-RTX GPUs with every 8 mini-batch.
The disparity regression range is set to [−128, 128]. Since the variance θ p reflects the shape of unimodal distribution, it is bounded in [α, α + β]. For the best performance of disparity estimation for remote sensing images, we set the α = 1.0, β = 1.0, respectively. Then, the parameter γ in stereo focal loss is set to 5.0 to balance positive and negative disparity samples. As for the three weighted parameters in smooth L 1 loss, we follow the settings in previous stacked hourglass networks [24,25,30,32], and set 0.5, 0.7, 1.0 for the three intermediate outputs respectively. In our final loss function, the λ 1 , λ 2 are set to 1.0, 0.8, respectively, to balance the three training losses.

Comparisons with Other Stereo Methods
To further evaluate the effectiveness of the proposed network, we conduct the comparative experiments with state-of-the-art stereo matching networks, including CasStereo and CFNet which are also based on cascade disparity refinement frameworks, AANet proposed for real-time stereo matching, PSMNet which is a typical stereo matching network, and AcfNet improving PSMNet with cost regularization. The end-point error (EPE) and 3-pixel error (D1) are used to assess the quantitative performance, where EPE is the mean disparity error in pixels and 3-pixel error is the average percentage of pixel whose EPE is bigger than 3 pixels. In order to illustrate the visual results of the improvements, we compute the pixel-wise error map to evaluate the prediction error; cold colors in the error map denote small prediction errors, while warm colors denote large prediction errors. The quantitative result is shown in Table 2, which reflects the performance of networks trained on both the SceneFlow and US3D test datasets. Obviously, our network outperforms them on remote sensing images pretrained on SceneFlow. That is mainly because the proposed network holds the ability to estimate accurate disparities for multi-scale objects in remote sensing images. Subsequently, we illustrate some visual results of US3D samples with two multi-stage cascade networks to further show the improvements on multi-scale objects. Herein, Figure 9 illustrates the visual performance of disparity estimation between our method with different algorithms. There are multiple landscapes in multi-scale scenarios. Obviously, from the disparity map and error map, our method shows many improvements in such multi-scale disparity estimation, while the other performs badly in large-scale or small-scale disparity estimation. This proves the effectiveness of our proposed multi-scale module. It is noteworthy that the insight of cascade disparity refinement framework has also been investigated in CasStereo and CFNet. CasStereo predicts an initial disparity estimation by constructing higher-resolution sparse cost volume and progressively, uniformly samples a pre-defined range to generate the next stage disparity search range. In addition, CFNet argues that the fusion of small resolution cost volume can generate a more accurate initial disparity map than higher-resolution sparse cost volume. However, from the quantitative and visual illustration in implementing the aforementioned methods on VHR remote sensing images, CasStereo cannot perform well on such larger-scale disparity estimation since the initial high-resolution sparse cost volume cannot catch a sufficient large context with the first stage's network. In contrast, CFNet performs well in large-scale disparity by losing the disparity accuracy in such a small scale, which proves the effectiveness of initial cost volume fusion. The large-scale disparity information can be well initialized based on such small resolution cost volume. Nevertheless, more complex scenarios make such cascade disparity refinement frameworks perform worse without multi-scale information interaction. Consequently, our method leverages sufficient multi-scale cost volume interaction to tackle this problem in VHR remote sensing images. From the illustration from Figure 9, our method with cross-scale cost volume interaction performs best both in the large and small scale of disparity estimation.

Discussion
In this section, we conduct ablation experiments for evaluating the effectiveness of each module in our proposed network.

Analysis of the Variance-Based Methods
Variance estimation is an important component of the disparity refinement framework, which can automatically adjust the flatness of the unimodal distribution according to the matching uncertainty. The variance of unimodal distribution can be uniformly set for all pixels, while CFNet proposes adaptive uncertainty maps based on the pixel-wise disparity candidates. Different from these methods, we leverage a learnable cost refine module to compute confidence maps. For comparison, we respectively implement the uniform setting, uncertainty-based method in CFNet, and our confidence-based module on US3D dataset. Figure 10 shows several pixel-wise results from SceneFlow and US3D, where the confidence maps show that the synthetic SceneFlow has many simple structures, while there are more complex scenarios in VHR remote sensing images. As expected in such challenging regions: occlusions, thin structures, textureless patterns; the confidence map in our method provides high variances to flatten the corresponding disparity cost distributions, which can balance the cost aggregation for different pixels. Table 3 shows the results of implementing different variance-based methods. The results demonstrate the effectiveness of adaptive variance estimation. Comparing with the uncertainty-based method in CFNet, our learnable confidence-based method gives more improvements.

Analysis of the Cross-Scale Interaction Module
To further evaluate the effectiveness of the cross-scale cost volume aggregation module, we add such modules in CasStereo and CFNet which are called CasStereo-c and CFNet-c, respectively. The quantitative results are listed in Table 4. Comparing with the results from Table 2, equipped with our proposed cross-scale interaction module, CasStereo-c and CFNet-c both improve a lot in terms of EPE and D1. Figure 11 shows the visual results of disparity estimation of re-implemented CasStereo-c, CFNet-c and our method. As we expect, the disparity maps and corresponding error maps both illustrate the significant improvements in multi-scale regions, which prove the effectiveness of the proposed crossscale interaction module in VHR remote sensing scenarios.

Stereo Pair
CasStereo-c CFNet-c Our Ground Truth Figure 11. The visualization comparison results of cascade networks combining with cross-scale interaction module on US3D dataset. For each example, the first row shows the disparity map, and the second row shows the error map. Cold colors in the error map denote small prediction errors, while warm colors denote large prediction errors.

Analysis of the Loss Settings
In our proposed network, the stereo focal loss and smooth L1 loss are employed to supervise cost volume distribution and disparity estimation map, respectively. First, to evaluate the effectiveness of stereo focal loss applied in remote sensing images, we equip our method with stereo focal loss, cross entropy loss, and none. Table 5 illustrates the results. It is obvious that the stereo focal loss performs better than cross entropy loss, which demonstrates the effectiveness of balancing weight from positive and negative disparities. In order to find the optimal λ 1 , λ 2 setting in our final loss function, we implement different experimental settings of λ 1 , and λ 2 between [0, 1]. As shown in Table 6, the weight setting of 0.8 for stereo focal loss and 1.0 for smooth L 1 yields the best performance.

Conclusions
In this paper, we develop a novel confidence-aware unimodal cascade and fusion pyramid network to improve the disparity estimation for multi-scale objects in VHR satellite images. We first use the fused cost volume from the coarsest scale to generate an initial disparity map, and then construct the unimodal cost distributions by a learnable confidence prediction network, which are able to narrow down the next-stage disparity search range. Moreover, we design a cross-scale interaction aggregation module to leverage multi-scale information. In the whole training process, both smooth-L1 loss and stereo focal loss are applied to regularize the disparity map and unimodal cost distribution, respectively. Our network shows a strong ability to handle multi-scale disparity estimation. Experiment results show that our network performs well compared to two state-of-the-art stereo matching networks with higher precision.
Nowadays, with the gradual growth of data volume of remote sensing images, it is difficult to annotate enough ground truth for a deep learning model to train. Thus, the deep model should perform well on unseen scenarios; however, our network cannot generate satisfactory results for the domain adaptation task of stereo matching. Therefore, in order to make our proposed network work for other datasets without ground truth, we plan to try it in self-supervised ways and extract domain-invariant features.