Matching Large Baseline Oblique Stereo Images Using an End-to-End Convolutional Neural Network

: The available stereo matching algorithms produce large number of false positive matches or only produce a few true-positives across oblique stereo images with large baseline. This undesired result happens due to the complex perspective deformation and radiometric distortion across the images. To address this problem, we propose a novel afﬁne invariant feature matching algorithm with subpixel accuracy based on an end-to-end convolutional neural network (CNN). In our method, we adopt and modify a Hessian afﬁne network, which we refer to as IHesAffNet, to obtain afﬁne invariant Hessian regions using deep learning framework. To improve the correlation between corresponding features, we introduce an empirical weighted loss function (EWLF) based on the negative samples using K nearest neighbors, and then generate deep learning-based descriptors with high discrimination that is realized with our multiple hard network structure (MTHardNets). Following this step, the conjugate features are produced by using the Euclidean distance ratio as the matching metric, and the accuracy of matches are optimized through the deep learning transform based least square matching (DLT-LSM). Finally, experiments on Large baseline oblique stereo images acquired by ground close-range and unmanned aerial vehicle (UAV) verify the effectiveness of the proposed approach, and comprehensive comparisons demonstrate that our matching algorithm outperforms the state-of-art methods in terms of accuracy, distribution and correct ratio. The main contributions of this article are: (i) our proposed MTHardNets can generate high quality descriptors; and (ii) the IHesAffNet can produce substantial afﬁne invariant corresponding features with reliable transform parameters.


Introduction
Large baseline oblique stereo images play a crucial role in achieving realistic threedimensional (3D) reconstruction [1], topographic mapping [2] and object extraction [3], owing to the advantages of stable image geometry, extensive coverage as well as abundant textures across the images. However, practitioners significantly change the viewpoints by increasing the baseline and oblique angle between the cameras, which lead to significant geometric deformation, radiometric distortion and local surface discontinuity across acquired images. Therefore, the large baseline oblique image matching still remains an open problem for practical use in both photogrammetry [4] and computer vision [5].
In the past few decades, researchers have proposed numerous feature matching algorithms invariant to certain transformations [6][7][8][9], most of which were targeting wide baseline stereo images. The popular feature matching methods, including but not limited to, scale invariant feature transform (SIFT) algorithm [6] and its modifications [7][8][9], mainly 1.
complex geometric and radiometric distortions inhibit these algorithms to extract sufficient invariant features with a good repetition rate, and thus it would increase the probability of outliers; 2.
Universally repetitive textures in images may result in numerous non-matching descriptors with very similar Euclidean distances, due to the fact that the minimized loss functions only consider the matching descriptor and the closest non-matching descriptor; 3.
Because of the fact that feature detection and matching are carried out independently, the feature points to be matched using above methods can only achieve pixel-level accuracy.
In order to address these problems, this article first generates affine invariant regions based on a modified version of the Hessian affine network (IHesAffNet). Following this step, we construct the MTHardNets and generate robust deep learning descriptors in 128 dimensions. Afterwards, identical regions are found using the nearest neighbor distance ratio (NNDR) metric. Furthermore, the positioning error of each match is effectively compensated by deep learning transform based least square matching (DLT-LSM), where the initial iterating parameters of DLT-LSM are provided based on the covariance matrix of deep learning regions. We conducted a comprehensive set of experiments on real large baseline oblique stereo image pairs to verify the effectiveness of our proposed end-to-end strategy, which outperforms the available state-of-the-art methods.
Our main contributions are summarized as follows. First, the improved IHesAffNet can obtain a sufficient number of affine regions with better repeatability and distribution. Second, the proposed MTHardNets can generate feasible descriptors with high discriminability. Third, the subpixel matching level can be achieved by DLT-LSM strategy for large baseline oblique images.
The remainder of this article is organized as follows. In Section 2, we present our approach in detail. In Section 3, we present the results. Discussion on the experimental results is given in Section 4. Section 5 concludes this article and presents future work.

Methodology
The purpose of this article is to automatically obtain a sufficient number of precise matches from large baseline oblique stereo images. Our proposed matching method is illustrated in Figure 1 and involves two main stages. In the first stage, the objective is to detect adequate affine invariant features with a uniform spatial distribution and generate distinctive descriptors. This stage is the basis and key for matching large baseline oblique images. In the second stage, the objective is to produce corresponding features and compensate for the position errors of the matches. This stage further verifies the matches achieved in first stage. We will detail the methodology in the following sections. detect adequate affine invariant features with a uniform spatial distribution and generate distinctive descriptors. This stage is the basis and key for matching large baseline oblique images. In the second stage, the objective is to produce corresponding features and compensate for the position errors of the matches. This stage further verifies the matches achieved in first stage. We will detail the methodology in the following sections.

IHesAffNet for Feature Extraction
The quantity and quality of matches can be directly determined by feature detection. Many different deep learning structures for feature extraction were studied in [36] which concludes that it is difficult to select a learned detector network that can adapt to all images in all cases. However, the AffNet has shown to be more robust than other methods especially in wide baseline matching. Therefore, the AffNet is exploited to detect local invariant features in this article. There are seven convolutional layers in AffNet architecture, which was inherited from HardNet [24], but the number of dimensions in six former layers is reduced by one half and the final 128-D descriptor output layer is replaced by a 3-D output layer. For more details, please refer to [25].
Based on our tests, we verified that the original AffNet can cope with viewpoint and illumination variation across stereo images. Despite this, we also found that it invariably did not generate well-distributed and consistent results when both the baseline and oblique angle between the image pair are large. To overcome this problem, the optimization strategy, namely IHesAffNet, is proposed as follows.
First, a moderate number of Hessian features are respectively extracted from each image grid, and we only keep the Hessian points in each grid cell that satisfy the local information entropy threshold. More specifically, the average information entropy for arbitrary one grid cell is estimated by where a is the number of Hessian features in the grid cell, b is the number of different pixel value in feature region, and is the proportion of different pixel in feature region to the whole image pixels. Then, we set the threshold to be 2 ⁄ for each grid and adaptively remove the features with relatively low information entropy, thus we obtain local Hessians with global uniform distribution. Second, we improve AffNet by constructing the optimal number of dimensions and obtain the affine invariant Hessian regions. Specifically, we search for best parameter set using multiple versions of AffNet with variations of dimensions (see Table 1) using the Graf1-6 stereo image dataset. The test results that represent these variations are plotted in Figure 2 which reveals that the AffNet5, namely the original AffNet, can reliably obtain a certain number of matches, but is slightly outperformed by AffNet6 in most epochs. Thus, in the remainder of the paper, we use AffNet6 as the improved version to extract affine invariant regions.

IHesAffNet for Feature Extraction
The quantity and quality of matches can be directly determined by feature detection. Many different deep learning structures for feature extraction were studied in [36] which concludes that it is difficult to select a learned detector network that can adapt to all images in all cases. However, the AffNet has shown to be more robust than other methods especially in wide baseline matching. Therefore, the AffNet is exploited to detect local invariant features in this article. There are seven convolutional layers in AffNet architecture, which was inherited from HardNet [24], but the number of dimensions in six former layers is reduced by one half and the final 128-D descriptor output layer is replaced by a 3-D output layer. For more details, please refer to [25].
Based on our tests, we verified that the original AffNet can cope with viewpoint and illumination variation across stereo images. Despite this, we also found that it invariably did not generate well-distributed and consistent results when both the baseline and oblique angle between the image pair are large. To overcome this problem, the optimization strategy, namely IHesAffNet, is proposed as follows.
First, a moderate number of Hessian features are respectively extracted from each image grid, and we only keep the Hessian points in each grid cell that satisfy the local information entropy threshold. More specifically, the average information entropy for arbitrary one grid cell is estimated by where a is the number of Hessian features in the grid cell, b is the number of different pixel value in feature region, and ψ v is the proportion of different pixel in feature region to the whole image pixels. Then, we set the threshold T i to be Y i /2 for each grid and adaptively remove the features with relatively low information entropy, thus we obtain local Hessians with global uniform distribution. Second, we improve AffNet by constructing the optimal number of dimensions and obtain the affine invariant Hessian regions. Specifically, we search for best parameter set using multiple versions of AffNet with variations of dimensions (see Table 1) using the Graf1-6 stereo image dataset. The test results that represent these variations are plotted in Figure 2 which reveals that the AffNet5, namely the original AffNet, can reliably obtain a certain number of matches, but is slightly outperformed by AffNet6 in most epochs. Thus, in the remainder of the paper, we use AffNet6 as the improved version to extract affine invariant regions.
(2) For one grid, extract Hessian points. Compute the average information entropy of the grid by Equation (1).
(3) Set the threshold to be 2 ⁄ , and remove all the Hessian points that lower than .
(4) Go to Step (2) and (3), until all the grids are processed. Then, save the Hessian points. In order to verify the superiority of the IHesAffNet compared to the original AffNet, we have conducted comparison tests on numerous pairs of large baseline oblique stereo images. The first aim of this section is to improve the distribution quality of features in image space; and the second aim is to increase the repeatability score of the detection.
(2) For one grid, extract Hessian points. Compute the average information entropy Y i of the grid by Equation (1).
(3) Set the threshold T i to be Y i /2, and remove all the Hessian points that lower than T i . (4) Go to Step (2) and (3), until all the grids are processed. Then, save the Hessian points. (5) For one Hessian point, use AffNet6 to extract affine invariant region, until all the Hessian points are processed. Then, save the Hessian affine invariant regions. (6) Select the stable regions by dual criteria as (W + H)/τ 1 ≤ ( + ϑ)/2 ≤ (W + H)/τ 2 and /ϑ ≤ e T . End In order to verify the superiority of the IHesAffNet compared to the original AffNet, we have conducted comparison tests on numerous pairs of large baseline oblique stereo images. The first aim of this section is to improve the distribution quality of features in image space; and the second aim is to increase the repeatability score of the detection. Therefore, the first criterion is the distribution quality, which is detailed in [37,38], and can be calculated by where m denotes the total number of Delaunay triangles that are generated based on feature points and recursive strategy, E i and max(θ i ) respectively represent the area and the maximum angle (radians) of ith triangle, and the lower MDQ value indicates the better distribution of features. results on Graf1-6 dataset in Figure 3 and Table 2. The results show that our IHesAffNet can produce more well-distributed features and higher repeatability than the original AffNet.
where m denotes the total number of Delaunay triangles that are generated based on feature points and recursive strategy, and max( ) respectively represent the area and the maximum angle (radians) of ith triangle, and the lower MDQ value indicates the better distribution of features. Normally, more matching features indicate higher repeatability score, thus we use the number of matched features as the second criterion. For a fair assessment, both AffNet and IHesAffNet employ SIFT descriptor and NNDR metric to generate matched features. Due to the limited space, we merely present the comparison results on Graf1-6 dataset in Figure 3 and Table 2. The results show that our IHesAffNet can produce more well-distributed features and higher repeatability than the original AffNet. and (c), respectively. The yellow and cyan ellipses respectively denote the detected and matched features. Note that the bottom row that corresponds to our approach is better.

Descriptor Generating by MTHardNets
In addition to the feature detection discussed above, the descriptor extraction is another key factor for obtaining a sufficiently large number of correct matches from stereo images. The experiments demonstrated in [24] have shown that the HardNet possesses the most reasonable network architecture and outperforms existing deep learning-based descriptors. However, the sampling strategy adopted in HardNet only pulls the descriptor of one negative patch away from the reference or positive patches in the feature space, and it thus results in descriptors of negative patches with very close distances Figure 3. Comparison between the AffNet and IHesAffNet on the Graf1-6 stereo image dataset. (a,c) are the detecting results by the AffNet and IHesAffNet, respectively; (b,d) are the verifiable tests based on the detection of (a,c), respectively. The yellow and cyan ellipses respectively denote the detected and matched features. Note that the bottom row that corresponds to our approach is better.

Descriptor Generating by MTHardNets
In addition to the feature detection discussed above, the descriptor extraction is another key factor for obtaining a sufficiently large number of correct matches from stereo images. The experiments demonstrated in [24] have shown that the HardNet possesses the most reasonable network architecture and outperforms existing deep learning-based descriptors. However, the sampling strategy adopted in HardNet only pulls the descriptor of one negative patch away from the reference or positive patches in the feature space, and it thus results in descriptors of negative patches with very close distances when there are extensive repetitive textures across stereo images. Therefore, to effectively avoid matching ambiguities, we design a schematic of multiple hard networks (MTHardNets) with K nearest negative samples using the empirical weighted loss function (EWLF), which is illustrated in Figure 4.
In the proposed network architecture, a batch of matching local patches τ = (r i , p i ) i=1...m is generated, where m is the number of samples in the batch, r i and p i stand for the reference and positive patches, respectively. A sequence of closest non-matching patches n 1st i , n 2nd i , n Kth i is selected for the current matching pair of (r i , p i ), where the superscripts for n 1st, 2nd and Kth respectively represent the first, second and Kth closest distances from (r i , p i ). The current (K + 2) image patches are passed through HardNet and transformed into unit descriptors with 128-D. Moreover, the distance D 1 between matching descriptors R i and P i can be calculated by when there are extensive repetitive textures across stereo images. Therefore, to effectively avoid matching ambiguities, we design a schematic of multiple hard networks (MTHardNets) with K nearest negative samples using the empirical weighted loss function (EWLF), which is illustrated in Figure 4. represent the loss and weight, respectively, and the EWLF can be expressed by L given in Equations (6) and (7).
In the proposed network architecture, a batch of matching local patches = ( , ) … is generated, where m is the number of samples in the batch, and stand for the reference and positive patches, respectively. A sequence of closest non-matching patches , , is selected for the current matching pair of ( , ), where the superscripts for 1st, 2nd and Kth respectively represent the first, second and Kth closest distances from ( , ). The current (K+2) image patches are passed through HardNet and transformed into unit descriptors with 128-D. Moreover, the distance D1 between matching descriptors Ri and Pi can be calculated by Similarly, distances D2 and D3 between non-matching descriptors can also be computed. The purpose of the multiple networks is to push the distance D1 as close as possible and simultaneously pull the distances D2 and D3 as far as possible to emphasize in class similarity and across class discrimination.
The proposed K nearest negative sampling strategy with EWLF proceeds by first estimating a distance matrix D from m pairs of corresponding descriptors ( , ) based on HardNet by using Equation (3): This matrix provides the structure to select the K nearest negative descriptors. For one pair of corresponding descriptors ( , ), the respective distance arrays and are computed. The computed distances provide the K nearest negative descriptors , ⋯ , according to values of , . The selection is achieved by considering all K nearest negative descriptors and generate as: parallelly fed into the same HardNet, then the matching descriptors (R i , P i ) and non-matching descriptors N 1st i , · · · , N Kth i are respectively outputted. The L u and w u represent the loss and weight, respectively, and the EWLF can be expressed by L given in Equations (6) and (7).
Similarly, distances D 2 and D 3 between non-matching descriptors can also be computed. The purpose of the multiple networks is to push the distance D 1 as close as possible and simultaneously pull the distances D 2 and D 3 as far as possible to emphasize in class similarity and across class discrimination.
The proposed K nearest negative sampling strategy with EWLF proceeds by first estimating a distance matrix D from m pairs of corresponding descriptors (R i , P i ) based on HardNet by using Equation (3): This matrix provides the structure to select the K nearest negative descriptors. For one pair of corresponding descriptors (R i , P i ), the respective distance arrays D RiP and D RPi are computed. The computed distances provide the K nearest negative descrip- The selection is achieved by considering all K nearest negative descriptors and generate S as: Remote Sens. 2021, 13, 274 8 of 21 Based on K nearest negative descriptors, the empirical weighted loss function (EWLF) is generated using: where D (R i , P i ), N Kth i is the distance between the non-matching and matching descriptors and can be computed by Finally, the EWLF model is established as: where w u represents the empirical weight, and ∑ K u=1 w u = 1. The key task of EWLF model is to enhance the discrimination of deep learning descriptors among repetitive patterns. In the following, we will empirically determine acceptable parameters for EWLF model based on the extensive tests. Both theory and experiments demonstrate that an increase in K improves discriminability of the descriptor for a large batch size. However, considering the limited GPU memory, we set K to 3 for EWLF calculations which simplifies the weight group set as Applying spatially uniform sampling, we obtain 564 groups of (w 1 , w 2 , w 3 ) and the descriptor discrimination of each weight group can be computed using: where represents the average distance of matching and nonmatching descriptors. MDD is the metric of descriptor discrimination, such that the higher MDD value, the better the discrimination of the descriptor is. For each weight group, we train the MTHardNets based on the EWLF and compute the MDD. The statistical and comparative result for total weight groups is presented in Figure 5.
Based on K nearest negative descriptors, the empirical weighted loss function (EWLF) is generated using: where ( , ), is the distance between the non-matching and matching descriptors and can be computed by ( , ), = min , , , . Finally, the EWLF model is established as: where represents the empirical weight, and ∑ = 1.
The key task of EWLF model is to enhance the discrimination of deep learning descriptors among repetitive patterns. In the following, we will empirically determine acceptable parameters for EWLF model based on the extensive tests. Both theory and experiments demonstrate that an increase in K improves discriminability of the descriptor for a large batch size. However, considering the limited GPU memory, we set K to 3 for EWLF calculations which simplifies the weight group set as = ( , , )| ≥ 0.50, > > > 0, + + = 1 . Applying spatially uniform sampling, we obtain 564 groups of ( , , ) and the descriptor discrimination of each weight group can be computed using: where ( , ), , , represents the average distance of matching and non-matching descriptors. MDD is the metric of descriptor discrimination, such that the higher MDD value, the better the discrimination of the descriptor is. For each weight group, we train the MTHardNets based on the EWLF and compute the MDD. The statistical and comparative result for total weight groups is presented in Figure 5. According to Figure 5, the serial number 366 of (0.68, 0.22, 0.10) would achieve the highest MDD. Therefore, the Equation (7) can be specifically written as  According to Figure 5, the serial number 366 of (0.68, 0.22, 0.10) would achieve the highest MDD. Therefore, the Equation (7) can be specifically written as Remote Sens. 2021, 13, 274 9 of 21 In this article, the EWLF is obtained based on extensive tests, which is also the reason we refer to it as the empirical weighted loss function (EWLF). The pseudo-code of the MTHardNets descriptor is outlined in Algorithm 2.

Begin
(1) Given τ = (r i , p i ), then select K closest non-matching patches n 1st i , n 2nd i , n Kth i for τ.
(3) Estimate a distance matrix D by Equation (4), then generate S by Equation (5). (4) Generate EWLF by using S and Equation (6), and build EWLF model by Equation (7). (5) For each weight group, train the MTHardNets and compute MDD by Equation (8). (6) Use the highest MDD to simplify EWLF model as Equation (9). End To further verify the superiority of our MTHardNets descriptor, we compare our approach with HardNet based on five pairs of matching points. The results of this experiment are shown in Figure 6 which reveals that proposed MTHardNets estimate smaller distance between the matching descriptors and larger distance between the non-matching descriptors. In other words, proposed MTHardNets descriptor has better discrimination when compared to HardNet.
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 22 In this article, the EWLF is obtained based on extensive tests, which is also the reason we refer to it as the empirical weighted loss function (EWLF). The pseudo-code of the MTHardNets descriptor is outlined in Algorithm 2.
(3) Estimate a distance matrix D by Equation (4), then generate by Equation (5). (4) Generate EWLF by using and Equation (6), and build EWLF model by Equation (7). (5) For each weight group, train the MTHardNets and compute MDD by Equation (8). (6) Use the highest MDD to simplify EWLF model as Equation (9). End To further verify the superiority of our MTHardNets descriptor, we compare our approach with HardNet based on five pairs of matching points. The results of this experiment are shown in Figure 6 which reveals that proposed MTHardNets estimate smaller distance between the matching descriptors and larger distance between the non-matching descriptors. In other words, proposed MTHardNets descriptor has better discrimination when compared to HardNet. In our pipeline, the IHesAffNet features are used as input to the MTHardNets which returns 128 dimensional descriptors. Following this step, the corresponding features are obtained using the NNDR metric and random sample consensus (RANSAC) is applied to remove outliers that does not satisfy underlying geometric relation between the images. Let = ( , ) and = ( , ) represent an arbitrary pair of matching features, where and denote the centroids of two corresponding affine invariant regions ( ) and ( ); and are the second moment affine matrices that can be learned by IHesAffNet, and the and respectively confirm ( ) and ( ). Thus, the geometric transformation between ( ) and ( ) can be expressed as:

Match Optimizing by DLT-LSM
The feature points to be matched across images by using the above deep learning pipeline can only achieve pixel-level accuracy. The reason behind this is attributed to the Figure 6. The discrimination comparison between proposed MTHardNets and HardNet descriptors by using five pairs of matching points. The PDD is the abbreviation for positive descriptor distance and NDD for negative descriptor distance.
In our pipeline, the IHesAffNet features are used as input to the MTHardNets which returns 128 dimensional descriptors. Following this step, the corresponding features are obtained using the NNDR metric and random sample consensus (RANSAC) is applied to remove outliers that does not satisfy underlying geometric relation between the images. Let τ = (x, A) and τ = (x , A ) represent an arbitrary pair of matching features, where x and x denote the centroids of two corresponding affine invariant regions E(x) and E (x ); A and A are the second moment affine matrices that can be learned by IHesAffNet, and the A and A respectively confirm E(x) and E (x ). Thus, the geometric transformation between E(x) and E (x ) can be expressed as:

Match Optimizing by DLT-LSM
The feature points to be matched across images by using the above deep learning pipeline can only achieve pixel-level accuracy. The reason behind this is attributed to the fact that feature detection and matching are carried out independently. In order to mitigate this shortcoming, we employ a deep learning transform based least square matching (DLT-LSM) strategy. Let the neighborhood centered at a feature point x contain (2λ + 1) × (2λ + 1) pixels, and suppose that the geometric deformation H between the corresponding small neighborhoods of x and x can be well represented by: represents the affine deformation, T = b 3 b 6 is the translation deformation. If we write a pair of correlation windows Ω and Ω , respectively centered at x and x as where I and I are the intensity values respectively of the left and right images. Using these definitions, the correlation coefficient ρ between Ω and Ω can be calculated by where µ(Ω) and µ(Ω ) are the mean pixel values of corresponding windows; the pixel value Ω ij is directly obtained from the left image, and the pixel value Ω ij is produced based on a local bilinear interpolation of right image, that is a good trade-off between efficiency and accuracy. According to Equation (12) and introducing linear radiometric distortion parameters h 0 and h 1 , we establish the affine transform model based LSM equation, which can be further linearized to be a LSM error equation as where X = dh 0 dh 1 db 1 · · · db 6 T , C = 1 g g u ug u vg u g v ug v vg v , g is the pixel value of x , and g u and g v represent the gradients at horizontal and vertical directions, respectively. This LSM error equations are applied for all pixels within the neighborhoods of x and x . By minimizing this cost, the error correction matrix X that includes eight parameter correction values, can be iteratively estimated. Compared with the affine distortion, the radiometric distortion and translation deformation are small such that we may set their initial values to be: h 0 0 = 0, h 0 1 = 1 and b 0 3 = b 0 6 = 0. Then, the initial parameters of affine deformation matrix B can be set according to the deep learning transform (DLT) of Equation (10), namely B = A A −1 . The algorithm of DLT-LSM is described as Algorithm 3.

Begin
(1) For one correspondence x and x , determine the initial affine transform B by Equation (10). Initialize H using B. Set the threshold of maximum iterations to be N T .
(3) Build LSM error equation as Equation (14), then compute X and update H. If the number of iterations N is less than N T , then go to Step (2); otherwise, correct x by Equation (15). End Using this formulation, the matching regions around the feature points can be optimized by DLT-LSM, provided it is supplied with good initial values for distortion parameters. In this section, given one pair of corresponding features, the affine transform matrix H is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point x can be precisely compensated bŷ The parameter that specifies the neighborhood around x to set to λ = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error ε which is estimated based on the ground truth affine matrix H 0 and following Equation The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39]  matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by = + . (15) The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39]  matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by = + . (15) The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39]  matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by = + . (15) The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39]  matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by = + . (15) The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] for training. It includes six subsets: Liberty, Liberty_harris, Notredame, Notredame_harris, Yosemite, Yosemite_harris, and there are 2 × 400k normalized 64 × 64 patches in each dataset. All of the matching patches are verified by 3D reconstruction model. Both the IHesAffNet and the MTHardNets are Correlation windows matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine  transform error which is estimated based on the ground truth affine matrix and following Equation   Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] for training. It includes six subsets: Liberty, Liberty_harris, Notredame, Notredame_harris, Yosemite, Yosemite_harris, and there are 2 × 400k normalized 64 × 64 patches in each dataset. All of the matching patches are verified by 3D reconstruction model. Both the IHesAffNet and the MTHardNets are matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine  transform error which is estimated based on the ground truth affine matrix and following Equation   Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] for training. It includes six subsets: Liberty, Liberty_harris, Notredame, Notredame_harris, Yosemite, Yosemite_harris, and there are 2 × 400k normalized 64 × 64 patches in each dataset. All of the matching patches are verified by 3D reconstruction model. Both the IHesAffNet and the MTHardNets are matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine  transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] for training. It includes six subsets: Liberty, Liberty_harris, Notredame, Notredame_harris, Yosemite, Yosemite_harris, and there are 2 × 400k normalized 64 × 64 patches in each dataset. All of the matching patches are verified by 3D reconstruction model. Both the IHesAffNet and the MTHardNets are matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine  transform error which is estimated based on the ground truth affine matrix and following Equation   Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] for training. It includes six subsets: Liberty, Liberty_harris, Notredame, Notredame_harris, Yosemite, Yosemite_harris, and there are 2 × 400k normalized 64 × 64 patches in each dataset. All of the matching patches are verified by 3D reconstruction model. Both the IHesAffNet and the MTHardNets are matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine  transform error which is estimated based on the ground truth affine matrix and following Equation   Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] for training. It includes six subsets: Liberty, Liberty_harris, Notredame, Notredame_harris, Yosemite, Yosemite_harris, and there are 2 × 400k normalized 64 × 64 patches in each dataset. All of the matching patches are verified by 3D reconstruction model. Both the IHesAffNet and the MTHardNets are matrix is initialized by the DLT strategy and is then iteratively updated. Furthermore, the original matching point can be precisely compensated by The parameter that specifies the neighborhood around to set to = 25 pixels, and a maximum of ten iterations are used in our DLT-LSM optimization. We randomly selected two pairs of conjugate neighborhoods, whose affine parameters have been obtained by our deep learning method, are presented in Table 3. The table lists the affine transform error which is estimated based on the ground truth affine matrix and following Equation  Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] Table 3 shows that the DLT-LSM iteration can converge very rapidly for two different image patches with significant geometric and radiometric deformations. It further indicates that our DLT strategy would provide good affine parameter values for DLT-LSM, and thus the corresponding feature points are optimized to be subpixel-level accuracy.

Training Dataset and Implementation Details
We used the open dataset, UBC Phototour [39] for training. It includes six subsets: Liberty, Liberty_harris, Notredame, Notredame_harris, Yosemite, Yosemite_harris, and there are 2 × 400 k normalized 64 × 64 patches in each dataset. All of the matching patches are verified by 3D reconstruction model. Both the IHesAffNet and the MTHardNets are trained with the models implemented using the Pytorch library [40]. In IHesAffNet training, there are 1024 triplet samples in a batch, and all image patches are resized to 32 × 32 pixels. Optimization is done by stochastic gradient descent strategy with learning rate 0.005, momentum 0.9, and weight decay 0.0001, and the model trained 20 epochs on RTX2080Ti GPU for every training data set. MTHardNets training is similar to IHesAffNet, but we use the proposed EWLF function to train the model with the learning rate of 10, and 1024 quintuple samples are prepared in a batch. Additionally, data augmentation is applied in both trainings to prevent over-fitting.

Test Data and Evaluation Criteria
In order to test the performance of the proposed method, we selected six groups (A-F) of stereo images with large differences in viewpoints. The image pairs (A-C) are large baseline images acquired from the ground, and (D-F) were large baseline oblique images taken from an unmanned aerial vehicle (UAV). All groups of image data have severe geometric and radiometric distortions and cover various scenes with poor or repetitive textures. The image-to-image affine transform matrix H 0 was estimated by the strategy presented in [41] for each groups of stereo images and used as a ground truth. Additionally, five indexes were chosen to comprehensively estimate the effectiveness of method: 1.
Number of correct correspondences n ε 0 (the unit is pair): The matching error is calculated using H 0 and Equation (16). If the error of a corresponding point is less than the given threshold ε 0 (1.5 pixels in our experiment), it was regarded as a inlier and used to compute n ε 0 .

2.
Correct ratio β (in percentage, %) of matches: It is computed by β = n ε 0 /num, where num (the unit is pair) is the number of total matches. 3.
Root mean square error ε RMSE (the unit is pixel) of matches: It is calculated by

4.
Matching distribution quality MDQ: Lower MDQ value indicates the geometric homogeneity of the Delaunay triangles, which are generated based on matched points. Thus, the MDQ can be a uniform distribution metric of matches. This index is estimated by previous Equation (2).

5.
Matching efficiency η (the unit is second per pair of matching points): We compute the η according to average run time for one pair of corresponding points, namely η = t/num, where t (the unit is second) denotes the total test time of the algorithm.

Experimental Results of the Key Steps of the Proposed Method
Considering that the proposed approach includes three key steps, e.g., feature extraction, descriptor generation, and matching, we conduct comparison experiments based on six different matching algorithms to check the validity of each key step for six different methods: AMD, ISD, IPD, IHD, IMN, and the proposed method.
For intuitive understanding, we present each key step of six different methods in Table 4, where "Null" denotes that the current step is not applicable. For a fair and objective evaluation, we exploit a unified strategy to reject inconsistent matches. Table 5 records the number of correct correspondences achieved by six different methods. Table 6 denotes the total test time of six different methods. Comparative results of correct ratio (%) and root mean square error (pixels) of matches are displayed in Figure 7 to inspect the effectiveness and accuracy of these methods.

Experimental Results of Comparison Methods
We further verify the performance of the proposed method and contrast four methods on oblique image matching as follows. The four methods include the following: The proposed method.

2.
Detone's method [26]: This approach uses a fully convolutional neural network (Magic-Point) trained on an extensive synthetic dataset which poses a liability to real scenarios. The homographic adaptation (HA) strategy is employed to transform MagicPoint into SuperPoint, which boosts the performance of the detector and generate repeatable feature points. This method also combines SuperPoint with a descriptor subnetwork that generates 256 dimensional descriptors. Matching is achieved using NNDR metric. While the use of HA outperforms classical detectors, the random nature of the HA step limits the invariance of this technique to geometric deformations.

3.
Morel's method [10]: This method samples stereo images by simulating discrete poses in the 3D affine space. It uses SIFT algorithm to simulated image pairs and transforms all matches to the original image pair. This method was shown to find correspondences from image pairs with large viewpoint changes. However, false positives often occur for repeating patterns.

4.
Matas's method [12]: The approach extracts features using MSER and estimates SIFT descriptors after normalizing the feature points. It uses the NNDR metric to obtain matching features.
In our comparisons, for fair and objective evaluation, a unified strategy was employed for all four methods to eliminate the controversial mismatches. The results are organized as separate figures for each technique. In Figure 8, we provide additional visualization to demonstrate the feature matching results using our method, where the corresponding features and regions around them are denoted by cyan ellipses and lines. Figures 9-12 show the final matching results respectively by the methods of proposed, Detone's, Morel's and Matas's, where the red points are matches, and the yellow lines are the estimated epipolar lines. To visually check the matching accuracy for all four methods, we superimpose stereo images based on affine transform, and present the chessboard registration of image pair A using four methods in Figure 13. In Figure 14, we show the contrast of matching error before and after DLT-LSM. Based on the four methods, Table 7 presents the quantitative contrast results, including the number of correct matches, correct ratio of matches, matching error, matching distribution quality, matching runtime, and matching efficiency. Considering the training and testing are independent of each other on deep learning, we thus only count the runtime of testing for our and Detone's method in Table 7.              Table 7. Quantitative comparison of four methods based on six groups of image pairs. In this table, n ε 0 is the number of correct matches (pair), β is the correct ratio (%) of matches, ε RMSE is the matching error (pixels), MDQ is the matching distribution quality, t is the total test time (second), and η is the matching efficiency (the unit is second per pair of matching points). The best score of each index is displayed in a bold number.   Table 7. Quantitative comparison of four methods based on six groups of image pairs. In this table, is the number of correct matches (pair), β is the correct ratio (%) of matches, is the matching error (pixels), MDQ is the matching distribution quality, t is the total test time (second), and η is the matching efficiency (the unit is second per pair of matching points). The best score of each index is displayed in a bold number.

Method
Indexes

Discussion on Experimental Results of the Key Steps of the Proposed Method
Our overall goal is to automatically produce adequate matches at a subpixel-level from large baseline oblique stereo images. The proposed method reaches the goal by three key steps. First, the proposed IHesAffNet can extract abundant and well-distributed affine invariant regions, laying the good foundation not only for the

Discussion on Experimental Results of the Key Steps of the Proposed Method
Our overall goal is to automatically produce adequate matches at a subpixel-level from large baseline oblique stereo images. The proposed method reaches the goal by three key steps. First, the proposed IHesAffNet can extract abundant and well-distributed affine invariant regions, laying the good foundation not only for the feature description but also for the DLT-LSM. Second, we design the MTHardNets descriptor based on HardNet, but our improved has better discrimination when compared to HardNet. Third, each correspondence is further optimized based on the DLT-LSM iteration, which is the key to achieving subpixle matching. As can be observed from Table 5, the proposed method outperforms all other methods including AMD and IHD. This is due to the fact that we apply the IHesAffNet in the feature detection stage, instead of using the AffNet. The quality of the feature descriptor also significantly affects the matching performance. The Table 5 also shows that the ISD produces the fewest correct correspondences among the compared six methods. This is attributed to the fact that ISD adopts the SIFT method to extract descriptors, while others generate deep learning-based descriptors; we can use this observation to conjecture that the deep learning strategy generally provides better descriptors when compared to the handcrafted SIFT method. According to the comparison between IHD and the proposed in Table 5, it reveals that our MTHardNets can outperform the HardNet in the process of generating correct matches. Table 6 reveals that the ISD takes the least time among the six methods. This is because the deep leaning-based descriptor would cost more time than handcrafted SIFT descriptor. Figure 7 verifies that the IMN without DLT-LSM step achieves more than 1.5 pixels accuracy and has the lowest correct ratio of matches. That means our DLT-LSM effectively decreases the matching error of deep learning from pixel-level to subpixel-level. In short, according to the contrasting results of the key steps, our proposed method can obtain the most number of correct matches and the highest correct ratio, and synchronously gain the best matching accuracy, which can be attributed to the fact that we have adopted the relatively effective strategy in each stage of the proposed pipeline.

Discussion on Experimental Results of Comparison Methods
A large number of correct corresponding features with better spatial distribution can be achieved by the proposed deep learning affine invariant feature matching. In the six groups of Large baseline oblique stereo images with repetitive patterns, the numbers of correct corresponding deep learning features are computed as 165, 404, 1504, 545, 1233, and 710, respectively, (see Figure 8). Therefore, these matches would lay a good foundation for estimating the geometry transform between the two images. However, the positional errors of these correspondences before applying DLT-LSM operation are generally more than one pixel (see Figure 14), so it is difficult to produce subpixel matches only by deep learning pipeline. Table 7 shows that our matching accuracy of six groups of image pairs is at a subpixellevel. As a result, the subpixel accuracy provides better registration between matching features (see Figure 13a). Figure 14 depicts that the feature matching error can be effectively compensated by our DLT-LSM iteration in spite of severe distortion between corresponding neighborhoods. This is because our deep learning feature correspondences would provide good initial transform for the DLT-LSM step. However, there are still a few feature-point matches that DLT-LSM does not work (see Figure 14). We investigate the main reason that these outliers are often located in image regions with poor texture.
The proposed method can obtain sufficient number of conjugate points with the best spatial distribution among the four methods. By the visual inspection of Figure 9, our method gains the most evenly distributed results than the other three methods. According to the quantitative comparison given in Table 7, we can see that our method has advantage in term of matching distribution quality. This is because we have integrated IHesAffNet with MTHardNets, which may contribute to a better matching distribution. The loss function used in Detone's method only considers the distance between positive samples, which limits the discrimination of the deep learning descriptors, and results in a not well-distributed feature matches (see Figure 10). Table 7 shows the proposed method has superiority in term of matching correct ratio. Moreover, the detailed view in Figure 13 reveals that our registration is more precise than the other three approaches. Table 7 shows Detone's method fails to obtain correct matches from image pairs C, D, E and F. The main reason may be that the homographic adaptation of SuperPoint is randomly, which limits its invariance to diverse geometric deformations.
Observing the matching efficiency in Table 7, our proposed is more efficient than the Detone's. The main reason is that our method would stably gain a large number of matches, while the Detone's method almost fails to obtain matches from large oblique stereo images. However, the matching efficiency of our proposed and Detone's are not as good as the handcrafted methods of Morel's and Matas's. This is because the deep leaning methods involve numerous convolution operations in the process of feature detection and description.
According to the aforementioned qualitative and quantitative results, the proposed approach is superior in terms of number of correct matches, correct ratio of matches, matching accuracy, and distribution quality. The contribution of our method includes three aspects. The first is the proposed IHesAffNet can detect more well-distributed features with higher repeatability than the original AffNet. The second is the proposed MTHardNets can generate higher discrimination descriptor when compared to HardNet especially for the poor texture image regions. The third is the advanced DLT-LSM can significantly improve the accuracy of corresponding points. In summary, our method is effective and stable for large baseline oblique stereo image matching.

Conclusions
In this paper, we presented a novel and effective end-to-end feature matching pipeline for large baseline oblique stereo images with complex geometric and radiometric distortions as well as repetitive patterns. The proposed introduces the IHesAffNet that can extract affine invariant features with well distribution and good repeatability. The output of this network inputs to the proposed MTHardNets that generates highly discriminatory descriptors providing increased accuracy for stereo feature matching. Furthermore, the proposed approach features a DLT-LSM based iterative step that compensates for the position errors of feature points. As a result, it can obtain a sufficient number of subpixel-level matches with uniform spatial distribution. The qualitative and quantitative comparisons on different large baseline oblique images verify that our method can outperforms the state-of-the-art methods for oblique image matching. Future research can include developing deep learning strategies for multiple image primitives, such as corner, edges, and regions. Moreover, the affine invariant matching approach can be extended from planar scenes to large baseline oblique 3D scenes. Meanwhile, the proposed end-to-end CNN can be extended from grayscale images to color images for application.