# Matching Large Baseline Oblique Stereo Images Using an End-to-End Convolutional Neural Network

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- complex geometric and radiometric distortions inhibit these algorithms to extract sufficient invariant features with a good repetition rate, and thus it would increase the probability of outliers;
- Universally repetitive textures in images may result in numerous non-matching descriptors with very similar Euclidean distances, due to the fact that the minimized loss functions only consider the matching descriptor and the closest non-matching descriptor;
- Because of the fact that feature detection and matching are carried out independently, the feature points to be matched using above methods can only achieve pixel-level accuracy.

## 2. Methodology

#### 2.1. IHesAffNet for Feature Extraction

Algorithm 1. IHesAffNet Region Extraction. |

Begin(1) Divide image into grids. (2) For one grid, extract Hessian points. Compute the average information entropy ${{\rm Y}}_{i}$ of the grid by Equation (1). (3) Set the threshold ${T}_{i}$ to be ${{\rm Y}}_{i}/2$, and remove all the Hessian points that lower than ${T}_{i}$. (4) Go to Step (2) and (3), until all the grids are processed. Then, save the Hessian points. (5) For one Hessian point, use AffNet6 to extract affine invariant region, until all the Hessian points are processed. Then, save the Hessian affine invariant regions. (6) Select the stable regions by dual criteria as $\left(W+H\right)/{\tau}_{1}\le \left(\u03f5+\vartheta \right)/2\le \left(W+H\right)/{\tau}_{2}$ and $\u03f5/\vartheta \le {e}_{\mathrm{T}}$. End |

#### 2.2. Descriptor Generating by MTHardNets

_{1}between matching descriptors

**R**

_{i}and

**P**

_{i}can be calculated by

_{2}and D

_{3}between non-matching descriptors can also be computed. The purpose of the multiple networks is to push the distance D

_{1}as close as possible and simultaneously pull the distances D

_{2}and D

_{3}as far as possible to emphasize in class similarity and across class discrimination.

**D**from m pairs of corresponding descriptors $({\mathit{R}}_{i},{\mathit{P}}_{i})$ based on HardNet by using Equation (3):

Algorithm 2. MTHardNets Descriptor. |

Begin(1) Given $\mathit{\tau}=\left({\mathit{r}}_{i},{\mathit{p}}_{i}\right)$, then select $K$ closest non-matching patches $\left({\mathit{n}}_{i}^{1\mathrm{st}},{\mathit{n}}_{i}^{2\mathrm{nd}},{\mathit{n}}_{i}^{K\mathrm{th}}\right)$ for $\mathit{\tau}$. (2) Transform current $K+2$ patches into unit descriptors with 128-D, and compute distance D _{1} by Equation (3). Similarly, compute distances D_{2} and D_{3}.(3) Estimate a distance matrix D by Equation (4), then generate $\mathit{S}$ by Equation (5).(4) Generate EWLF by using $S$ and Equation (6), and build EWLF model by Equation (7). (5) For each weight group, train the MTHardNets and compute MDD by Equation (8). (6) Use the highest MDD to simplify EWLF model as Equation (9). End |

#### 2.3. Match Optimizing by DLT-LSM

Algorithm 3. DLT-LSM. |

Begin(1) For one correspondence $\mathit{x}$ and ${\mathit{x}}^{\prime}$, determine the initial affine transform $\mathit{B}$ by Equation (10). Initialize $\mathit{H}$ using $\mathit{B}$. Set the threshold of maximum iterations to be ${N}_{\mathrm{T}}$. (2) Build correlation windows $\mathit{\Omega}$ and ${\mathit{\Omega}}^{\prime}$ by Equations (11) and (12), and then compute $\rho $ by Equation (13). (3) Build LSM error equation as Equation (14), then compute $\mathit{X}$ and update $\mathit{H}$. If the number of iterations $N$ is less than ${N}_{\mathrm{T}}$, then go to Step (2); otherwise, correct ${\mathit{x}}^{\prime}$ by Equation (15). End |

#### 2.4. Training Dataset and Implementation Details

## 3. Results

#### 3.1. Test Data and Evaluation Criteria

- Number of correct correspondences ${n}_{{\epsilon}_{0}}$ (the unit is pair): The matching error is calculated using ${\mathit{H}}_{0}$ and Equation (16). If the error of a corresponding point is less than the given threshold ${\epsilon}_{0}$ (1.5 pixels in our experiment), it was regarded as a inlier and used to compute ${n}_{{\epsilon}_{0}}$.
- Correct ratio β (in percentage, %) of matches: It is computed by $\beta ={n}_{{\epsilon}_{0}}/num$, where num (the unit is pair) is the number of total matches.
- Root mean square error ${\epsilon}_{\mathrm{RMSE}}$ (the unit is pixel) of matches: It is calculated by$${\epsilon}_{\mathrm{RMSE}}=\sqrt{\frac{1}{num}{{\displaystyle \sum}}_{i=1}^{m}{\mathit{y}}_{i}^{\prime}-{\mathit{H}}_{0}{\mathit{y}}_{i}{}^{2}},$$
- Matching distribution quality MDQ: Lower MDQ value indicates the geometric homogeneity of the Delaunay triangles, which are generated based on matched points. Thus, the MDQ can be a uniform distribution metric of matches. This index is estimated by previous Equation (2).
- Matching efficiency η (the unit is second per pair of matching points): We compute the η according to average run time for one pair of corresponding points, namely $\eta =t/num$, where t (the unit is second) denotes the total test time of the algorithm.

#### 3.2. Experimental Results of the Key Steps of the Proposed Method

#### 3.3. Experimental Results of Comparison Methods

- The proposed method.
- Detone’s method [26]: This approach uses a fully convolutional neural network (MagicPoint) trained on an extensive synthetic dataset which poses a liability to real scenarios. The homographic adaptation (HA) strategy is employed to transform MagicPoint into SuperPoint, which boosts the performance of the detector and generate repeatable feature points. This method also combines SuperPoint with a descriptor subnetwork that generates 256 dimensional descriptors. Matching is achieved using NNDR metric. While the use of HA outperforms classical detectors, the random nature of the HA step limits the invariance of this technique to geometric deformations.
- Morel’s method [10]: This method samples stereo images by simulating discrete poses in the 3D affine space. It uses SIFT algorithm to simulated image pairs and transforms all matches to the original image pair. This method was shown to find correspondences from image pairs with large viewpoint changes. However, false positives often occur for repeating patterns.
- Matas’s method [12]: The approach extracts features using MSER and estimates SIFT descriptors after normalizing the feature points. It uses the NNDR metric to obtain matching features.

## 4. Discussion

#### 4.1. Discussion on Experimental Results of the Key Steps of the Proposed Method

#### 4.2. Discussion on Experimental Results of Comparison Methods

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Qin, R. A critical analysis of satellite stereo pairs for digital surface model generation and a matching quality prediction model. ISPRS J. Photogramm. Remote Sens.
**2019**, 154, 139–150. [Google Scholar] [CrossRef] - Liu, W.; Wu, B. An integrated photogrammetric and photoclinometric approach for illumination-invariant pixel-resolution 3D mapping of the lunar surface. ISPRS J. Photogramm. Remote Sens.
**2020**, 159, 153–168. [Google Scholar] [CrossRef] - Zhang, H.; Ni, W.; Yan, W.; Xiang, D.; Bian, H. Registration of multimodal remote sensing image based on deep fully convolutional neural network. IEEE J. Stars
**2019**, 12, 3028–3042. [Google Scholar] [CrossRef] - Gruen, A. Development and status of image matching in photogrammetry. Photogramm. Rec.
**2012**, 27, 36–57. [Google Scholar] [CrossRef] - Song, W.; Jung, H.; Gwak, I.; Lee, S. Oblique aerial image matching based on iterative simulation and homography evaluation. Pattern Recognit.
**2019**, 87, 317–331. [Google Scholar] [CrossRef] - Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
**2004**, 60, 91–110. [Google Scholar] [CrossRef] - Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L. Speeded-Up robust features (SURF). Comput. Vis. Image Underst.
**2008**, 110, 346–359. [Google Scholar] [CrossRef] - Murartal, R.; Montiel, J.; Tardos, J. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot.
**2015**, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version] - Ma, W.; Wen, Z.; Wu, Y.; Jiao, L.; Gong, M.; Zheng, Y.; Liu, L. Remote sensing image registration with modified SIFT and enhanced feature matching. IEEE Geosci. Remote Sens. Lett.
**2017**, 14, 3–7. [Google Scholar] [CrossRef] - Morel, J.M.; Yu, G.S. ASIFT: A new framework for fully affine invariant image comparison. SIAM J. Imaging Sci.
**2009**, 2, 438–469. [Google Scholar] [CrossRef] - Mikolajczyk, K.; Schmid, C. Scale & affine invariant interest point detectors. Int. J. Comput. Vis.
**2004**, 60, 63–86. [Google Scholar] [CrossRef] - Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput.
**2004**, 22, 761–767. [Google Scholar] [CrossRef] - Liu, X.; Samarabandu, J. Multiscale Edge-Based Text Extraction from Complex Images. In Proceedings of the IEEE International Conference on Multimedia and Expo, Toronto, ON, Canada, 9–12 July 2006; pp. 1721–1724. [Google Scholar] [CrossRef] [Green Version]
- Tuytelaars, T.; Gool, L. Matching widely separated views based on affine invariant regions. Int. J. Comput. Vis.
**2004**, 59, 61–85. [Google Scholar] [CrossRef] - Yao, G.; Man, X.; Zhang, L.; Deng, K.; Zheng, G. Registrating oblique SAR images based on complementary integrated filtering and multilevel matching. IEEE J. Stars.
**2019**, 12, 3445–3457. [Google Scholar] [CrossRef] - Mikolajczyk, K.; Tuytelaars, T.; Schmid, C.; Zisserman, A.; Matas, J.; Schaffalitzky, F.; Kadir, T.; Gool, L. A comparison of affine region detectors. Int. J. Comput. Vis.
**2005**, 65, 43–72. [Google Scholar] [CrossRef] [Green Version] - Mikolajczyk, K.; Schmid, C. A performance evaluation of local descriptors. IEEE Trans. Pattern Anal.
**2005**, 27, 1615–1630. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Lenc, K.; Vedaldi, A. Learning covariant feature detectors. In Proceedings of the ECCV Workshop on Geometry Meets Deep Learning, Amsterdam, The Netherlands, 31 August–1 September 2016; pp. 100–117. [Google Scholar] [CrossRef] [Green Version]
- Zhang, X.; Yu, F.X.; Karaman, S. Learning discriminative and transformation covariant local feature detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4923–4931. [Google Scholar] [CrossRef]
- Doiphode, N.; Mitra, R.; Ahmed, S. An improved learning framework for covariant local feature detection. In Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, Australia, 2–6 December 2018; pp. 262–276. [Google Scholar] [CrossRef] [Green Version]
- Yi, K.M.; Verdie, Y.; Fua, P.; Lepetit, V. Learning to assign orientations to feature points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar] [CrossRef] [Green Version]
- Detone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar] [CrossRef] [Green Version]
- Tian, Y.; Fan, B.; Wu, F. L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6128–6136. [Google Scholar] [CrossRef]
- Mishchuk, A.; Mishkin, D.; Radenovic, F. Working hard to know your neighbor’s margins: Local descriptor learning loss. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4826–4837. [Google Scholar]
- Mishkin, D.; Radenovic, F.; Matas, J. Repeatability is not Enough: Learning Affine Regions via Discriminability. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 287–304. [Google Scholar] [CrossRef] [Green Version]
- Wan, J.; Yilmaz, A.; Yan, L. PPD: Pyramid patch descriptor via convolutional neural network. Photogramm. Eng. Remote Sens.
**2019**, 85, 673–686. [Google Scholar] [CrossRef] - Han, X.; Leung, T.; Jia, Y.; Sukthankar, R. MatchNet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3279–3286. [Google Scholar] [CrossRef] [Green Version]
- Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 84–92. [Google Scholar] [CrossRef] [Green Version]
- Tian, Y.; Yu, X.; Fan, B. SOSNet: Second order similarity regularization for local descriptor learning. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11016–11025. [Google Scholar] [CrossRef] [Green Version]
- Ebel, P.; Trulls, E.; Yi, K.M.; Fua, P.; Mishchuk, A. Beyond Cartesian Representations for Local Descriptors. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 253–262. [Google Scholar] [CrossRef] [Green Version]
- Luo, Z.; Shen, T.; Zhou, L.; Zhu, S.; Zhang, R.; Yao, Y.; Fang, T.; Quan, L. GeoDesc: Learning local descriptors by integrating geometry constraints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 170–185. [Google Scholar] [CrossRef] [Green Version]
- Luo, Z.; Shen, T.; Zhou, L.; Zhu, S.; Zhang, R.; Yao, Y.; Fang, T.; Quan, L. ContextDesc: Local descriptor augmentation with cross-modality context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2527–2536. [Google Scholar] [CrossRef] [Green Version]
- Yi, K.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 467–483. [Google Scholar] [CrossRef] [Green Version]
- Ono, Y.; Trulls, E.; Fua, P. LF-Net: Learning local features from images. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 6234–6244. [Google Scholar]
- Revaud, J.; Weinzaepfel, P.; De, S. R2D2: Repeatable and reliable detector and descriptor. arXiv
**2019**, arXiv:1906.06195. [Google Scholar] - Jin, Y.; Mishkin, D.; Mishchuk, A.; Matas, J.; Fua, P.; Yi, F.; Trulls, E. Image matching across wide baselines: From paper to practice. Int. J. Comput. Vis.
**2020**, 1–31. [Google Scholar] [CrossRef] - Sedaghat, A.; Mokhtarzade, M.; Ebadi, H. Uniform robust scale-invariant feature matching for optical remote sensing images. IEEE Trans. Geosci. Remote Sens.
**2011**, 49, 4516–4527. [Google Scholar] [CrossRef] - Zhu, Q.; Wu, B.; Xu, Z.X. Seed point selection method for triangle constrained image matching propagation. IEEE Geosci. Remote Sens. Lett.
**2006**, 3, 207–211. [Google Scholar] [CrossRef] - Brown, M.; Lowe, D.G. Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis.
**2007**, 74, 59–73. [Google Scholar] [CrossRef] [Green Version] - Paszke, A.; Gross, S.S.; Chintala, G.; Chanan, E.; Yang, Z.; DeVito, Z.; Lin, A.; Desmaison, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the Advances in Neural Information Processing Systems Workshop, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Podbreznik, P.; Potočnik, B. A self-adaptive ASIFT-SH method. Adv. Eng. Inform.
**2013**, 27, 120–130. [Google Scholar] [CrossRef]

**Figure 2.**Matching performances of multiple versions of AffNet with different configurations, where AffNet6 provides the best result.

**Figure 3.**Comparison between the AffNet and IHesAffNet on the Graf1-6 stereo image dataset. (

**a**,

**c**) are the detecting results by the AffNet and IHesAffNet, respectively; (

**b**,

**d**) are the verifiable tests based on the detection of (

**a**,

**c**), respectively. The yellow and cyan ellipses respectively denote the detected and matched features. Note that the bottom row that corresponds to our approach is better.

**Figure 4.**Schematic of the MTHardNets with K nearest negative samples using the EWLF, where (K + 2) image patches are parallelly fed into the same HardNet, then the matching descriptors $({\mathit{R}}_{i},{\mathit{P}}_{i})$ and non-matching descriptors $\left({\mathit{N}}_{i}^{1\mathrm{st}},\cdots ,{\mathit{N}}_{i}^{K\mathrm{th}}\right)$ are respectively outputted. The L

_{u}and w

_{u}represent the loss and weight, respectively, and the EWLF can be expressed by L given in Equations (6) and (7).

**Figure 6.**The discrimination comparison between proposed MTHardNets and HardNet descriptors by using five pairs of matching points. The PDD is the abbreviation for positive descriptor distance and NDD for negative descriptor distance.

**Figure 7.**Comparison of different matching method in terms of the (

**a**) correct ratio β (%) of matches (higher values are better) and (

**b**) root mean square error ${\epsilon}_{\mathrm{RMSE}}$ (pixels) of matches (lower values are better).

**Figure 8.**(

**a**–

**f**) Matching results of our approach on a group of stereo pairs. The ellipses represent the regions matching around each feature point. Lines represent the matches.

**Figure 9.**(

**a**–

**f**) Red points show the matching results of the proposed method with groups of stereo images. The yellow lines represent the epipolar lines estimated from matching points.

**Figure 10.**(

**a**–

**f**) Red points show the matching results of the Detone’s method with groups of stereo images. The yellow lines represent the epipolar lines estimated from matching points.

**Figure 11.**(

**a**–

**f**) Red points show the matching results of the Morel’s method with groups of stereo images. The yellow lines represent the epipolar lines estimated from matching points.

**Figure 12.**(

**a**–

**f**) Red points show the matching results of the Mata’s method with groups of stereo images. The yellow lines represent the epipolar lines estimated from matching points.

**Figure 13.**The chessboard registration results of four comparison approaches for image pair A. (

**a**) Registration image of proposed, (

**b**) registration image of Detone’s, (

**c**) registration image of Morel’s, and (

**d**) registration image of Matas’s. Subregions I and II display the detailed view.

**Figure 14.**Contrast of the affine transform error before and after DLT-LSM for the proposed approach. For intuition, both (

**a**) and (

**b**) present 10 random sample features of the matching, which are respectively extracted from image pairs A and D.

AffNet Version | Number of Dimensions in Each Layer | ||||||
---|---|---|---|---|---|---|---|

1st Layer | 2nd Layer | 3rd Layer | 4th Layer | 5th Layer | 6th Layer | 7th Layer | |

AffNet1 | 32 | 32 | 64 | 64 | 128 | 128 | 3 |

AffNet2 | 28 | 28 | 56 | 56 | 112 | 112 | 3 |

AffNet3 | 24 | 24 | 48 | 48 | 96 | 96 | 3 |

AffNet4 | 20 | 20 | 40 | 40 | 80 | 80 | 3 |

AffNet5 | 16 | 16 | 32 | 32 | 64 | 64 | 3 |

AffNet6 | 12 | 12 | 24 | 24 | 48 | 48 | 3 |

AffNet7 | 8 | 8 | 16 | 16 | 32 | 32 | 3 |

AffNet8 | 4 | 4 | 8 | 8 | 16 | 16 | 3 |

**Table 2.**Comparison of the AffNet and IHesAffNet in terms of feature distribution and repeatability on the Graf1-6 stereo image dataset.

Method | Left MDQ | Right MDQ | Matched Features |
---|---|---|---|

The HesAffNet | 1.19 | 1.32 | 93 |

Our IHesAffNet | 0.84 | 0.89 | 136 |

**Table 3.**Conjugate neighborhoods, affine transform error $\epsilon $ (pixels), and correlation coefficient $\rho $ are corrected by DLT-LSM iterations.

Left Neighborhood | Initial Right Neighborhood | First Iteration | Second Iteration | Third Iteration | Fourth Iteration | |
---|---|---|---|---|---|---|

Window border | ||||||

ε = 2.635 | ε = 2.244 | ε = 1.290 | ε = 0.837 | ε = 0.516 | ||

Correlation windows | ||||||

ρ = 0.569 | ρ = 0.638 | ρ = 0.805 | ρ = 0.919 | ρ = 0.957 | ||

Window border | ||||||

ε = 1.278 | ε = 0.860 | ε = 0.553 | ε = 0.375 | ε = 0.241 | ||

Correlation windows | ||||||

ρ = 0.832 | ρ = 0.915 | ρ = 0.949 | ρ = 0.957 | ρ = 0.965 |

**Table 4.**List of six different methods, namely, the AMD, the ISD, the IPD, the IHD, the IMN, and the proposed method.

Steps | AMD | ISD | IPD | IHD | IMN | Proposed Method |
---|---|---|---|---|---|---|

Feature extracting | AffNet [25] | IHesAffNet | IHesAffNet | IHesAffNet | IHesAffNet | IHesAffNet |

Descriptor generating | MTHardNets | SIFT [6] | PPD [26] | HardNet [24] | MTHardNets | MTHardNets |

Match optimizing | DLT-LSM | DLT-LSM | DLT-LSM | DLT-LSM | Null | DLT-LSM |

**Table 5.**Number of correct correspondences obtained by the AMD, ISD, IPD, IHD, IMN, and proposed method based on six groups of stereo images. Bold font denotes the best results.

Image Pair | AMD | ISD | IPD | IHD | IMN | Proposed Method |
---|---|---|---|---|---|---|

A | 78 | 21 | 86 | 92 | 41 | 152 |

B | 219 | 35 | 203 | 235 | 87 | 349 |

C | 313 | 85 | 409 | 402 | 164 | 662 |

D | 329 | 90 | 339 | 348 | 97 | 397 |

E | 912 | 120 | 937 | 921 | 196 | 972 |

F | 375 | 42 | 401 | 398 | 93 | 408 |

**Table 6.**The total test time (the unit is second) of the AMD, ISD, IPD, IHD, IMN, and proposed method based on six groups of stereo images. Bold font denotes the best results.

Image Pair | AMD | ISD | IPD | IHD | IMN | Proposed Method |
---|---|---|---|---|---|---|

A | 29.02 | 24.09 | 29.36 | 30.45 | 29.14 | 31.49 |

B | 32.79 | 27.49 | 33.45 | 34.58 | 33.20 | 36.55 |

C | 39.60 | 36.52 | 43.10 | 42.70 | 39.05 | 45.73 |

D | 40.14 | 35.60 | 42.08 | 41.83 | 40.57 | 44.61 |

E | 47.05 | 41.65 | 48.74 | 49.06 | 43.74 | 52.32 |

F | 40.33 | 35.94 | 40.16 | 41.63 | 40.92 | 43.80 |

**Table 7.**Quantitative comparison of four methods based on six groups of image pairs. In this table, ${n}_{{\epsilon}_{0}}$ is the number of correct matches (pair), β is the correct ratio (%) of matches, ${\epsilon}_{\mathrm{RMSE}}$ is the matching error (pixels), MDQ is the matching distribution quality, t is the total test time (second), and η is the matching efficiency (the unit is second per pair of matching points). The best score of each index is displayed in a bold number.

Method | Indexes | A | B | C | D | E | F |
---|---|---|---|---|---|---|---|

Proposed | ${n}_{{\epsilon}_{0}}$ | 152 | 349 | 662 | 397 | 972 | 408 |

β | 94.43 | 87.70 | 87.45 | 92.98 | 94.71 | 90.66 | |

${\epsilon}_{\mathrm{RMSE}}$ | 0.64 | 0.61 | 0.93 | 0.54 | 0.59 | 0.91 | |

MDQ | 0.893 | 1.021 | 1.130 | 0.988 | 0.764 | 1.104 | |

t | 31.49 | 36.55 | 45.73 | 44.61 | 52.32 | 43.80 | |

η | 0.196 | 0.092 | 0.060 | 0.105 | 0.051 | 0.097 | |

Detone’s | ${n}_{{\epsilon}_{0}}$ | 6 | 56 | 0 | 0 | 0 | 0 |

β | 46.15 | 43.75 | 0 | 0 | 0 | 0 | |

${\epsilon}_{\mathrm{RMSE}}$ | 3.32 | 2.28 | 221.83 | 53.62 | 50.25 | 247.41 | |

MDQ | 1.706 | 1.290 | 3.379 | 2.561 | 3.027 | 2.643 | |

t | 16.84 | 20.33 | 27.89 | 24.30 | 30.95 | 25.19 | |

η | 1.295 | 0.159 | 3.984 | 3.038 | 3.095 | 1.679 | |

Morel’s | ${n}_{{\epsilon}_{0}}$ | 257 | 408 | 86 | 66 | 532 | 31 |

β | 36.25 | 41.17 | 23.50 | 21.71 | 38.89 | 13.78 | |

${\epsilon}_{\mathrm{RMSE}}$ | 0.91 | 0.92 | 1.80 | 0.96 | 0.88 | 0.98 | |

MDQ | 0.974 | 1.083 | 1.384 | 1.920 | 1.005 | 1.766 | |

t | 31.38 | 33.20 | 33.90 | 29.13 | 32.07 | 33.85 | |

η | 0.044 | 0.034 | 0.093 | 0.096 | 0.023 | 0.150 | |

Matas’s | ${n}_{{\epsilon}_{0}}$ | 10 | 48 | 7 | 6 | 4 | 0 |

β | 34.48 | 78.69 | 41.18 | 42.86 | 66.67 | 0 | |

${\epsilon}_{\mathrm{RMSE}}$ | 2.74 | 1.39 | 3.61 | 8.35 | 1.93 | 25.26 | |

MDQ | 1.562 | 1.495 | 2.279 | 1.981 | 1.596 | 2.885 | |

t | 2.78 | 3.82 | 5.99 | 6.72 | 6.37 | 6.38 | |

η | 0.095 | 0.063 | 0.352 | 0.480 | 1.062 | 0.798 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yao, G.; Yilmaz, A.; Zhang, L.; Meng, F.; Ai, H.; Jin, F.
Matching Large Baseline Oblique Stereo Images Using an End-to-End Convolutional Neural Network. *Remote Sens.* **2021**, *13*, 274.
https://doi.org/10.3390/rs13020274

**AMA Style**

Yao G, Yilmaz A, Zhang L, Meng F, Ai H, Jin F.
Matching Large Baseline Oblique Stereo Images Using an End-to-End Convolutional Neural Network. *Remote Sensing*. 2021; 13(2):274.
https://doi.org/10.3390/rs13020274

**Chicago/Turabian Style**

Yao, Guobiao, Alper Yilmaz, Li Zhang, Fei Meng, Haibin Ai, and Fengxiang Jin.
2021. "Matching Large Baseline Oblique Stereo Images Using an End-to-End Convolutional Neural Network" *Remote Sensing* 13, no. 2: 274.
https://doi.org/10.3390/rs13020274