# Disparity Estimation of High-Resolution Remote Sensing Images with Dual-Scale Matching Network

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- Our network learns stereo matching at both low and high scales, helpful for disparity estimation in large areas with texture-less and repeating patterns, as well as maintenance of main structures and details.
- We construct cost volumes from negative to positive values [36], making our network able to regress both negative and nonnegative disparities for remote sensing images.
- A 3D encoder-decoder module built by factorized 3D convolutions [37] is developed for cost aggregation, which can improve the stereo matching at disparity discontinuities and occlusions. Compared with standard 3D CNNs, the computational cost is markedly reduced.
- We employ a refinement module that ensures the network outputs high-quality full-resolution disparity maps.

## 2. Related Work

**2D architectures.**These architectures usually deploy an encoder-decoder design and adopt the general flow: the encoder extracts deep features from the input stereo pair, then a correlation layer encodes similarity into feature channels by computing the inner product of the left and right feature vectors along spatial and disparity dimensions, forming a 3D cost volume [21], which is finally parsed to disparity map by the decoder. The pioneering network is DispNet-C [20]. Following it, CRL [23] proposes a two-stage network that combines a DispNet-C model with a second sub-network for cascade learning of the disparity residual. iResNet [19] produces an initial disparity estimation, then iteratively refines it using multiple feature correlation and reconstruction error. MADNet [24] applies a coarse-to-fine strategy, starting from a coarser level of features to predict an initial disparity map then up-sampling it to a finer level with the assistance of warping operations. AANet [25] learns stereo matching on three scales, with an adaptive aggregation module for interaction among the different scales. Our network shares a similar idea with AANet, but we regress disparity maps from low scale to high scale in a coarse-to-fine manner, while AANet regresses that of three scales in parallel. These architectures run efficiently thanks to the efficiency of 2D convolution operations on modern GPUs. However, their accuracy is inferior to 3D architectures.

**3D architectures.**These architectures follow a similar flow to the former category. Differently, they encode similarity by computing the difference of left and right feature vectors [27,29] or directly concatenating them [22,26,28] to form a 4D cost volume, then the 4D tensor is processed by 3D convolutions. GC-Net [22] is the first attempt. Following this new design, PSMNet [26] utilizes spatial pyramid pooling layers (SPP) [38] in its feature extractor to integrate features with multiple scales and deploys a stack of 3D hourglass modules to learn cost volume regularizing. ECA [28] introduces an explicit cost aggregation module to improve the 3D optimization by using a 3D encoder-decoder module. StereoDRNet [29] applies 3D dilated convolutions inside its stacked encoder-decoders to further improve the effectiveness and add a refinement sub-network for enhancing the disparity map. Due to that real geometric context [22] can be explicitly learned by 3D convolutions, 3D architectures achieve better accuracy than 2D architectures in most cases. However, 3D convolutions require higher computational effort owing to more parameters and floating-point operations (FLOPs). To make the model runs in real-time, StereoNet [27] constructs a low-resolution cost volume to produce a coarse prediction then hierarchically guide it to the original resolution with the (resized) left image. MABNet [30] proposes a multibranch adjustable bottleneck module that is less demanding on parameters to replace standard 3D convolutions, making the model lightweight. However, their accuracy decreases compared to the heavy models. We also explore how to make the model lightweight. In our network, we replace conventional 3D convolutions with efficient factorized 3D convolutions to reduce the computational burden.

## 3. Dual-Scale Matching Network

#### 3.1. Overview

#### 3.2. Components

#### 3.2.1. Feature Extraction

#### 3.2.2. Cost Volume Creation

#### 3.2.3. Cost Aggregation

#### 3.2.4. Disparity Calculation

#### 3.2.5. Disparity Refinement

#### 3.3. Loss Function

## 4. Experiments

#### 4.1. Dataset and Metrics

#### 4.2. Implementation Details

#### 4.3. Results and Analyses

#### 4.3.1. Overall Result

#### 4.3.2. Result on Challenging Areas

**Texture-less Regions.**In these regions, intensities of pixels change feebly, making them difficult to distinguish, which can lead to ambiguous results. We list several examples of disparity estimation on texture-less regions in Figure 5. The lawn and highways are texture-less. DSM-Net performs best by outputting disparity maps with less ambiguity on all scenes. Empirically, the elevation of a piece of flat lawn is the same everywhere, DSM-Net predicts more consistent disparities than the others. The elevation of a sloping highway varies continuously, the disparity map output by DSM-Net is continuous, while discontinuity appears on that of the other two.

**Repetitive Patterns.**Image patches in these regions have extremely similar appearances, which can cause fuzzy matching. Several examples are depicted in Figure 6. Residences contained in the images exhibit similar textures and colors. In the disparity maps predicted by StereoNet and PSMNet, some residences are joined to the boundaries of their neighboring residences. While in the disparity maps output by our DSM-Net, most of the residences are discriminated, and the edges are better maintained, thus less confusion occurs.

**Disparity Discontinuities and Occlusions.**Disparity discontinuities can lead to the edge-fattening issue, and in occluded areas, there is no matching. Due to tall objects, such challenges are ubiquitous in remote sensing images. We demonstrate three examples of disparity estimation results on areas containing high buildings and other objects, as shown in Figure 7. There is an occluded shadow behind the building in the first image, StereoNet gives an obvious wrong disparity prediction that is inconsistent with the surrounding ground, while PSMNet and DSM-Net give consistent disparities, and disparity map output by DSM-Net is flatter. In the second image, the disparity maps output by PSMNet and DSM-Net show smooth transition around the building, especially in the facade, and the edges of buildings are more explicit in the disparity map output by DSM-Net. Predictions of the third image also indicate DSM-Net performs best. We argue that the 3D encoder-decoder module contributes to this improvement. In StereoNet, the cost is simply aggregated by five stacked 3D convolution layers. In DSM-Net, we aggregate the cost using the deeper module with subsampling and up-sampling operations. This design makes the module rectify wrong cost values with information captured from a larger view. Deeper layers make the cost aggregation module have a more powerful ability to approximate the matching for occlusions. PSMNet adopts a stacked multiple hourglass module for cost aggregation, which shares a similar idea to ours, and it can be seen that PSMNet also performs better than StereoNet.

## 5. Discussion

#### 5.1. Single-Scale vs. Dual-Scale

#### 5.2. Plain Module vs. Encoder-Decoder Module

#### 5.3. Without Refinement vs. With Refinement

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Kang, J.H.; Chen, L.; Deng, F.; Heipke, C. Context pyramidal network for stereo matching regularized by disparity gradients. ISPRS-J. Photogramm. Remote Sens.
**2019**, 157, 201–215. [Google Scholar] [CrossRef] - Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 807–814. [Google Scholar]
- Zhang, L.; Seitz, S.M. Estimating optimal parameters for MRF stereo from a single image pair. IEEE Trans. Pattern Anal. Mach. Intell.
**2007**, 29, 331–342. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Hirschmueller, H.; Scharstein, D. Evaluation of cost functions for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MO, USA, 17–22 June 2007. [Google Scholar]
- Rhemann, C.; Hosni, A.; Bleyer, M.; Rother, C.; Gelautz, M. Fast Cost-Volume Filtering for Visual Correspondence and Beyond. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
- Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis.
**2002**, 47, 7–42. [Google Scholar] [CrossRef] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM.
**2017**, 60, 84–90. [Google Scholar] [CrossRef] - Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 39, 640–651. [Google Scholar] [CrossRef] [PubMed] - Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, GA, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Proceedings of the Conference on Neural Information Processing Systems, Montreal, WI, USA, 8–13 December 2014. [Google Scholar]
- Zbontar, J.; LeCun, Y. Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches. J. Mach. Learn. Res.
**2016**, 17, 65. [Google Scholar] - Luo, W.J.; Schwing, A.G.; Urtasun, R. Efficient Deep Learning for Stereo Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 5695–5703. [Google Scholar]
- Park, H.; Lee, K.M. Look Wider to Match Image Patches With Convolutional Neural Networks. IEEE Signal Process. Lett.
**2017**, 24, 1788–1792. [Google Scholar] [CrossRef] [Green Version] - Seki, A.; Pollefeys, M. Patch based confidence prediction for dense disparity map. In Proceedings of the British Machine Vision Conference, York, UK, 19–22 September 2016. [Google Scholar]
- Seki, A.; Pollefeys, M. SGM-Nets: Semi-global matching with neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6640–6649. [Google Scholar]
- Shaked, A.; Wolf, L. Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 6901–6910. [Google Scholar]
- Ye, X.Q.; L, J.M.; Wang, H.; Huang, H.X.; Zhang, X.L. Efficient Stereo Matching Leveraging Deep Local and Context Information. IEEE Access.
**2017**, 5, 18745–18755. [Google Scholar] [CrossRef] - Poggi, M.; Tosi, F.; Batsos, K.; Mordohai, P.; Mattoccia, S. On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: A Survey. IEEE Trans. Pattern Anal. Mach. Intell.
**2021**. [Google Scholar] [CrossRef] [PubMed] - Liang, Z.F.; Feng, Y.L.; Guo, Y.L.; Liu, H.Z.; Chen, W.; Qiao, L.B.; Zhou, L.; Zhang, J.F. Learning for Disparity Estimation through Feature Constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2811–2820. [Google Scholar]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
- Dosovitskiy, A.; Fischer, P.; Ilg, E.; Haeusser, P.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2758–2766. [Google Scholar]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-End Learning of Geometry and Context for Deep Stereo Regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
- Pang, J.H.; Sun, W.X.; Ren, J.S.J.; Yang, C.X.; Yan, Q. Cascade Residual Learning: A Two-stage Convolutional Neural Network for Stereo Matching. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 878–886. [Google Scholar]
- Tonioni, A.; Tosi, F.; Poggi, M.; Mattoccia, S.; di Stefano, L. Real-time self-adaptive deep stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 195–204. [Google Scholar]
- Xu, H.F.; Zhang, J.Y. AANet: Adaptive Aggregation Network for Efficient Stereo Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Electr Network, 14–19 June 2020; pp. 1956–1965. [Google Scholar]
- Chang, J.R.; Chen, Y.S. Pyramid Stereo Matching Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
- Khamis, S.; Fanello, S.; Rhemann, C.; Kowdle, A.; Valentin, J.; Izadi, S. StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 596–613. [Google Scholar]
- Yu, L.D.; Wang, Y.C.; Wu, Y.W.; Jia, Y.D. Deep Stereo Matching with Explicit Cost Aggregation Sub-Architecture. In Proceedings of the Innovative Applications of Artificial Intelligence Conference, New Orleans, LA, USA, 2–7 February 2018; pp. 7517–7524. [Google Scholar]
- Chabra, R.; Straub, J.; Sweeney, C.; Newcombe, R.; Fuchs, H. StereoDRNet: Dilated Residual StereoNet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11778–11787. [Google Scholar]
- Xing, J.B.; Qi, Z.; Dong, J.Y.; Cai, J.X.; Liu, H. MABNet: A Lightweight Stereo Network Based on Multibranch Adjustable Bottleneck Module. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 340–356. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2018–2025. [Google Scholar]
- Menze, M.; Geiger, A. Object Scene Flow for Autonomous Vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
- Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. In Proceedings of the German Conference on Pattern Recognition, Münster, Germany, 2–5 September 2014; pp. 31–42. [Google Scholar]
- Mayer, N.; Ilg, E.; Fischer, P.; Hazirbas, C.; Cremers, D.; Dosovitskiy, A.; Brox, T. What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation? Int. J. Comput. Vis.
**2018**, 126, 942–960. [Google Scholar] [CrossRef] [Green Version] - Ji, S.P.; Liu, J.; Lu, M. CNN-Based Dense Image Matching for Aerial Remote Sensing Images. Photogramm. Eng. Remote Sens.
**2019**, 85, 415–424. [Google Scholar] [CrossRef] - Tao, R.S.; Xiang, Y.M.; You, H.J. An Edge-Sense Bidirectional Pyramid Network for Stereo Matching of VHR Remote Sensing Images. Remote Sens.
**2020**, 12, 4025. [Google Scholar] [CrossRef] - Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell.
**2015**, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 539–546. [Google Scholar]
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
- Zeiler, M.D.; Taylor, G.W.; Fergus, R. Adaptive Deconvolutional Networks for Mid and High Level Feature Learning. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2018–2025. [Google Scholar]
- Fisher, Y.; Vladlen, K. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv
**2015**, arXiv:1511.07122. Available online: https://arxiv.org/abs/1511.07122 (accessed on 23 November 2015). - Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
- Bosch, M.; Foster, K.; Christie, G.; Wang, S.; Hager, G.D.; Brown, M. Semantic Stereo for Incidental Satellite Images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1524–1532. [Google Scholar]
- Le Saux, B.; Yokoya, N.; Hansch, R.; Brown, M.; Hager, G. 2019 Data Fusion Contest [Technical Committees]. IEEE Geosci. Remote Sens. Mag.
**2019**, 7, 103–105. [Google Scholar] [CrossRef] - Martin, A.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv
**2016**, arXiv:1603.04467. Available online: https://arxiv.org/abs/1603.04467 (accessed on 14 March 2016). - Atienza, R. Fast Disparity Estimation using Dense Networks. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018; pp. 3207–3212. [Google Scholar]

**Figure 1.**Overview of DSM-Net, which consists of five components, including feature extraction, cost volume creation, cost aggregation, disparity calculation, and disparity refinement.

**Figure 2.**The operation for the cost volume creation. The right feature shifts from $-{D}_{max}$ to ${D}_{max}-1$, making the resulting cost volume covering the range [$-{D}_{max}$, ${D}_{max}$).

**Figure 3.**The architecture of the refinement module. Dilated convolutions [43] are used within the module, “dilation” denotes the dilation rate, “s” denotes the convolution stride. Each convolution layer is followed by a batch normalization layer and a leaky ReLU activation layer ($\alpha =0.3$), except the 1*1*3 and 3*3*1 layer.

**Figure 4.**Disparity maps output by different networks. The image in the first column is from the OMA set, and the others are from the JAX testing set. Predictions of DenseMapNet, StereoNet, PSMNet, and DSM-Net are respectively labeled with yellow, red, green, and blue.

**Figure 5.**Disparity estimation results of different models on texture-less regions. Predictions of StereoNet, PSMNet, and DSM-Net are respectively labeled with red, green, and blue (subsequent figures in this section are also labeled in this way). Tile numbers are JAX-122-019-005, JAX-079-006-007, and OMA-211-026-006, from left to right.

**Figure 6.**Disparity estimation results of different models on repetitive patterns. Tile numbers are JAX-280-021-020, JAX-559-022-002, and OMA-132-027-023, from left to right.

**Figure 7.**Disparity estimation results of different models on disparity discontinuities and occluded areas. Tile numbers are JAX-072-011-022, JAX-264-014-007, and OMA-212-007-005, from left to right.

**Figure 8.**Disparity maps output by networks with single-scale and dual-scale learning schemes (tile number: JAX-072-001-006). The outputs of DSM-Net-v1, DSM-Net-v2, and DSM-Net are labeled with red, green, and blue, respectively.

**Figure 9.**The plain module for cost aggregation in DSM-Net-v3. In this variant, the 4D volume output by the eighth convolution layer at the low scale is up-sampled and added to the initial cost volume at the high scale. Each convolution layer is followed by a batch normalization layer and a leaky ReLU activation layer ($\alpha =0.3$), except for the last two layers.

**Figure 10.**Disparity maps output by networks with different cost aggregation modules (tile number: JAX-122-022-002, JAX-156-009-003). The outputs of DSM-Net-v3 and DSM-Net are respectively labeled with purple and blue.

**Figure 11.**Disparity maps output by networks without and with refinement operations (tile number: JAX-068-006-012, JAX-113-004-011). The outputs of DSM-Net-v4 and DSM-Net are respectively labeled with purple and blue.

**Table 1.**The architecture of the shared 2D CNN. Construction of residual blocks is designated in brackets with the number of stacked blocks, “s” denotes the stride of convolution. Each convolution layer is followed by a batch normalization [41] layer and a ReLU activation layer, except conv1_3 and conv2_3.

Name | Setting | Output | |||
---|---|---|---|---|---|

Input | H × W × 3 | ||||

conv0_1 | 5 × 5 × 32, s = 2 | $\frac{H}{2}$×$\frac{W}{2}$× 32 | |||

conv0_2 | 5 × 5 × 32, s = 2 | $\frac{H}{4}$×$\frac{W}{4}$× 32 | |||

conv0_3 | $\left[\begin{array}{c}3\times 3\times 32\hfill \\ 3\times 3\times 32\hfill \end{array}\right]\times 6$ | $\frac{H}{4}$×$\frac{W}{4}$× 32 | |||

Low scale | High scale | ||||

conv2_1 | 3 × 3 × 32, s = 2 | $\frac{H}{8}$×$\frac{W}{8}$× 32 | conv1_1 | 3 × 3 × 32 | $\frac{H}{4}$×$\frac{W}{4}$× 32 |

conv2_2 | $\left[\begin{array}{c}3\times 3\times 32\hfill \\ 3\times 3\times 32\hfill \end{array}\right]\times 4$ | $\frac{H}{8}$×$\frac{W}{8}$× 32 | conv1_2 | $\left[\begin{array}{c}3\times 3\times 32\hfill \\ 3\times 3\times 32\hfill \end{array}\right]\times 4$ | $\frac{H}{4}$×$\frac{W}{4}$× 32 |

conv2_3 | 3×3×16 | $\frac{H}{8}$×$\frac{W}{8}$× 16 | conv1_3 | 3×3×16 | $\frac{H}{4}$×$\frac{W}{4}$× 16 |

**Table 2.**The architecture of the 3D encoder-decoder module for cost aggregation, each convolution layer is followed by a batch normalization layer and a leaky ReLU activation layer ($\alpha =0.3$), except conv14 and conv15. Factorized 3D convolutions are designated in brackets, “s” denotes the stride of convolution, and “$Trans$” denotes the transpose convolution [42]. Note that we use two independent modules with the same structure to separately aggregate the cost volumes, the output of conv14 at the low scale is up-sampled and added to the initial high-scale cost volume before the high-scale aggregation. For an input image of size $H\times W$ and evaluating a range of D candidate disparities, the cost volume is of size $\frac{D}{{2}^{k}}\times \frac{H}{{2}^{k}}\times \frac{W}{{2}^{k}}$ for k subsampling operations.

Name | Setting | Low Scale | High Scale |
---|---|---|---|

Cost volume | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 16$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 16$ | |

conv1 | $3\times 3\times 3\times 16$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 16$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 16$ |

conv2 | $\left[\begin{array}{c}3\times 1\times 1\times 16\hfill \\ 1\times 3\times 3\times 16\hfill \end{array}\right]$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 16$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 16$ |

conv3 | $\left[\begin{array}{c}3\times 1\times 1\times 16\hfill \\ 1\times 3\times 3\times 16\hfill \end{array}\right]$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 16$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 16$ |

conv4 | $3\times 3\times 3\times 32,s=2$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 32$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 32$ |

conv5 | $\left[\begin{array}{c}3\times 1\times 1\times 32\hfill \\ 1\times 3\times 3\times 32\hfill \end{array}\right]$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 32$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 32$ |

conv6 | $\left[\begin{array}{c}3\times 1\times 1\times 32\hfill \\ 1\times 3\times 3\times 32\hfill \end{array}\right]$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 32$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 32$ |

conv7 | $3\times 3\times 3\times 64,s=2$ | $\frac{D}{32}\times \frac{H}{32}\times \frac{W}{32}\times 64$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 64$ |

conv8 | $\left[\begin{array}{c}3\times 1\times 1\times 64\hfill \\ 1\times 3\times 3\times 64\hfill \end{array}\right]$ | $\frac{D}{32}\times \frac{H}{32}\times \frac{W}{32}\times 64$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 64$ |

conv9 | $\left[\begin{array}{c}3\times 1\times 1\times 64\hfill \\ 1\times 3\times 3\times 64\hfill \end{array}\right]$ | $\frac{D}{32}\times \frac{H}{32}\times \frac{W}{32}\times 64$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 64$ |

conv10 | $\begin{array}{c}\mathrm{Trans}\phantom{\rule{4.pt}{0ex}}3\times 3\times 3\times 32,\phantom{\rule{4.pt}{0ex}}\mathrm{s}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}2\\ \mathrm{add}\phantom{\rule{4.pt}{0ex}}\mathrm{conv}6\end{array}$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 32$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 32$ |

conv11 | $\left[\begin{array}{c}3\times 1\times 1\times 32\hfill \\ 1\times 3\times 3\times 32\hfill \end{array}\right]$ | $\frac{D}{16}\times \frac{H}{16}\times \frac{W}{16}\times 32$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 32$ |

conv12 | $\begin{array}{c}\mathrm{Trans}\phantom{\rule{4.pt}{0ex}}3\times 3\times 3\times 16,\phantom{\rule{4.pt}{0ex}}\mathrm{s}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}2\\ \mathrm{add}\phantom{\rule{4.pt}{0ex}}\mathrm{conv}3\end{array}$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 16$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 16$ |

conv13 | $\left[\begin{array}{c}3\times 1\times 1\times 16\hfill \\ 1\times 3\times 3\times 16\hfill \end{array}\right]$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 16$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 16$ |

conv14 | $3\times 3\times 3\times 16$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 16$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 16$ |

conv15 | $1\times 1\times 1\times 1$ | $\frac{D}{8}\times \frac{H}{8}\times \frac{W}{8}\times 1$ | $\frac{D}{4}\times \frac{H}{4}\times \frac{W}{4}\times 1$ |

**Table 3.**The usage of the dataset in our experiments. “JAX” represents Jacksonville and “OMA” represents Omaha (the same below).

Stereo Pair | Mode | Training/Validation/Testing | Usage |
---|---|---|---|

JAX | RGB | 1500/139/500 | Training, validation, and testing |

OMA | RGB | -/-/2153 | Testing |

**Table 4.**Quantitative results of different methods on the JAX testing set and the whole OMA set. The best results are bold.

Model | JAX | OMA | Time (ms) | ||
---|---|---|---|---|---|

EPE (Pixel) | D1 (%) | EPE (Pixel) | D1 (%) | ||

DenseMapNet | 1.7405 | 14.19 | 1.8581 | 14.88 | 81 |

StereoNet | 1.4356 | 10.00 | 1.5804 | 10.37 | 187 |

PSMNet | 1.2968 | 8.06 | 1.4937 | 8.74 | 436 |

Bidir-EPNet | 1.2764 | 8.03 | 1.4899 | 8.96 | - |

DSM-Net | 1.2776 | 7.94 | 1.4757 | 8.73 | 168 |

**Table 5.**Quantitative results of different models on individuals of specific stereo pairs from the JAX testing set and OMA set. The best results are bold.

Tile | EPE (Pixel) | D1 (%) | ||||||
---|---|---|---|---|---|---|---|---|

DenseMapNet | StereoNet | PSMNet | DSM-Net | DenseMapNet | StereoNet | PSMNet | DSM-Net | |

JAX-122-019-005 | 1.5085 | 1.4992 | 1.4815 | 1.2292 | 7.84 | 8.17 | 4.52 | 3.65 |

JAX-079-006-007 | 1.6281 | 1.3743 | 1.2158 | 1.2082 | 12.42 | 9.71 | 8.46 | 8.16 |

OMA-211-026-006 | 1.9534 | 1.5783 | 1.5739 | 1.4830 | 15.92 | 10.67 | 9.23 | 9.05 |

JAX-280-021-020 | 1.3427 | 1.1412 | 1.0413 | 0.9772 | 10.42 | 6.75 | 6.26 | 5.43 |

JAX-559-022-002 | 1.5756 | 1.5323 | 1.3536 | 1.2977 | 15.03 | 13.56 | 10.55 | 10.14 |

OMA-132-027-023 | 1.5421 | 1.4018 | 1.3657 | 1.3054 | 11.84 | 9.61 | 9.21 | 8.45 |

JAX-072-011-022 | 1.6813 | 1.3914 | 1.1675 | 1.0718 | 17.42 | 8.22 | 6.71 | 5.27 |

JAX-264-014-007 | 1.6105 | 1.3083 | 1.0688 | 1.0528 | 15.54 | 6.67 | 4.65 | 3.81 |

OMA-212-007-005 | 1.6740 | 1.3359 | 1.2587 | 1.1720 | 11.51 | 7.79 | 6.90 | 5.26 |

**Table 6.**Configurations and comparisons of the variants. The best results are bold, the checkmark indicates that the network has this configuration.

Model | Scale | Cost Aggregation | Refinement | JAX | OMA | Time (ms) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|

Low | High | Plain | Encoder-Decoder | Without | With | EPE | D1 | EPE | D1 | ||

DSM-Net-v1 | ✓ | ✓ | ✓ | 1.3788 | 9.23 | 1.5327 | 9.77 | 78 | |||

DSM-Net-v2 | ✓ | ✓ | ✓ | 1.3195 | 8.34 | 1.4984 | 8.75 | 149 | |||

DSM-Net-v3 | ✓ | ✓ | ✓ | ✓ | 1.3554 | 8.73 | 1.5078 | 8.91 | 469 | ||

DSM-Net-v4 | ✓ | ✓ | ✓ | ✓ | 1.2817 | 8.03 | 1.4951 | 8.98 | 160 | ||

DSM-Net | ✓ | ✓ | ✓ | ✓ | 1.2776 | 7.94 | 1.4757 | 8.73 | 168 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

He, S.; Zhou, R.; Li, S.; Jiang, S.; Jiang, W.
Disparity Estimation of High-Resolution Remote Sensing Images with Dual-Scale Matching Network. *Remote Sens.* **2021**, *13*, 5050.
https://doi.org/10.3390/rs13245050

**AMA Style**

He S, Zhou R, Li S, Jiang S, Jiang W.
Disparity Estimation of High-Resolution Remote Sensing Images with Dual-Scale Matching Network. *Remote Sensing*. 2021; 13(24):5050.
https://doi.org/10.3390/rs13245050

**Chicago/Turabian Style**

He, Sheng, Ruqin Zhou, Shenhong Li, San Jiang, and Wanshou Jiang.
2021. "Disparity Estimation of High-Resolution Remote Sensing Images with Dual-Scale Matching Network" *Remote Sensing* 13, no. 24: 5050.
https://doi.org/10.3390/rs13245050