Review of Wide-Baseline Stereo Image Matching Based on Deep Learning

: Strong geometric and radiometric distortions often exist in optical wide-baseline stereo images, and some local regions can include surface discontinuities and occlusions. Digital photogrammetry and computer vision researchers have focused on automatic matching for such images. Deep convolutional neural networks, which can express high-level features and their correlation, have received increasing attention for the task of wide-baseline image matching, and learning-based methods have the potential to surpass methods based on handcrafted features. Therefore, we focus on the dynamic study of wide-baseline image matching and review the main approaches of learning-based feature detection, description, and end-to-end image matching. Moreover, we summarize the current representative research using stepwise inspection and dissection. We present the results of comprehensive experiments on actual wide-baseline stereo images, which we use to contrast and discuss the advantages and disadvantages of several state-of-the-art deep-learning algorithms. Finally, we conclude with a description of the state-of-the-art methods and forecast developing trends with unresolved challenges, providing a guide for future work.


Introduction
Wide-baseline image matching is the process of automatically extracting corresponding features from stereo images with substantial changes in viewpoint.It is the key technology for reconstructing realistic three-dimensional (3D) models [1][2][3] based on twodimensional (2D) images [4][5][6].Wide-baseline stereo images provide rich spectral, real texture, shape, and context information for detailed 3D reconstruction.Moreover, they have advantages with respect to spatial geometric configuration and 3D reconstruction accuracy [7].However, because of the significant change in image viewpoint, there are complex distortions and missing content between corresponding objects in regard to scale, azimuth, surface brightness, and neighborhood information, which make image matching very challenging [8].Hence, many scholars in the fields of digital photogrammetry and computer vision have intensely explored the deep-rooted perception mechanism [9] for wide-baseline images, and have successively proposed many classic image-matching algorithms [10].
Based on the recognition mechanism, existing wide-baseline image-matching methods can be divided into two categories [11][12][13]: Handcrafted matching and deep-learning matching.Inspired by professional knowledge and intuitive experience, several researchers have proposed handcrafted matching methods that can be implemented by intuitive computational models and their empirical parameters according to the image-matching task [14][15][16][17][18].This category of methods is also referred to as traditional matching, the classical representative of which is the scale invariant feature transform (SIFT) algorithm [14].Traditional matching has many problems [15][16][17][18] such as repetition in widebaseline image feature extraction or the reliability of the feature descriptors and matching measures.Using multi-level convolutional neural network (CNN) architecture, learningbased methods perform iterative optimization by back-propagation and model parameter learning from a large amount of annotated matching data to develop the trained imagematching CNN model [19].A representative deep-learning model under this category can be chosen, such as MatchNet [20].Methods under this category offer a different approach to solving the problem of wide-baseline image matching, but they are currently limited by the number and scope of training samples, and it is difficult to learn the optimal model parameters that are suitable for practical applications [21][22][23][24][25]. Learning-based image matching is essentially a method that is driven by prior knowledge.In contrast to the traditional handcrafted methods, it can avoid the need for many manual interventions with respect to feature detection [26], feature description [27], model design [28], and network parameter assignment [29].Moreover, it can adaptively learn the deep representation and correlation of the topographic features directly from large-scale sample data.According to the scheme used for model training, the wide-baseline matching methods can be further divided into two types [30]: Multi-stage training with (1) step-by-step [31] and (2) end-to-end training [32].The former focuses on the concrete issues of each stage, such as feature detection, neighborhood direction estimation, and descriptor construction, and it can be freely integrated with handcrafted methods [33]; whereas the latter considers the multiple stages of feature extraction, description, and matching as a whole and achieves the global optimum by jointly training with various matching stages [34].In recent years, with the growth of training datasets and the introduction of transfer learning [35], deep-learning-based image matching has been able to perform most wide-baseline image-matching tasks [36], and its performance can, in some cases, surpass that of traditional handcrafted algorithms.However, the existing methods still need to be further studied in terms of network structure [37], loss function [38], matching metric [39], and generalization ability [40], especially for typical image-matching problems such as large viewpoint changes [41], surface discontinuities [42], terrain occlusion [43], shadows [44], and repetitive patterns [45][46][47].
On the basis of a review of the image-matching process, we incrementally organize, analyze, and summarize the characteristics of proposed methods in the existing research, including the essence of the methods as well as their advantages and disadvantages.Then, the classical deep-learning models are trained and tested on numerous public datasets and wide-baseline stereo images.Furthermore, we compare and evaluate the state-of-the-art methods and determine their unsolved challenges.Finally, possible future trends in the key techniques are discussed.We hope that research into wide-baseline image matching will be stimulated by the review work of this article.
The main contributions of this article are summarized as follows.First, we conduct a complete review for the learning-based matching methods, from the feature detection to end-to-end matching, which involves the essences, merits, and defects of each method for wide-baseline images.Second, we construct various combined methods to evaluate the representative modules fairly and uniformly by using numerous qualitative and quantitative tests.Third, we reveal the root cause for struggling to produce high-quality matches across wide-baseline stereo images and present some feasible solutions for the future work.
In Section 2, this article reviews the most popular learning-based matching methods, including the feature detection, feature description, and end-to-end strategies.The results and discussion are presented in Section 3. The following summary and outlook are given in Section 4. Finally, Section 5 draws the conclusions of this article.

Deep-Learning Image-Matching Methodologies
At present, the research on deep-learning methods for wide-baseline image matching mainly focuses on three topics: Feature detection, feature description, and end-to-end matching (see Figure 1).Therefore, this section provides a review and summary of the related work in these research topics below.

Deep-Learning Image-Matching Methodologies
At present, the research on deep-learning methods for wide-baseline image matching mainly focuses on three topics: Feature detection, feature description, and end-to-end matching (see Figure 1).Therefore, this section provides a review and summary of the related work in these research topics below.

Deep-Learning-Based Feature Detection
Figure 2 summarizes the progress in deep-learning feature-detection methods.Based on the implemented learning mode, the mainstream deep-learning feature-detection algorithms can be divided into two types: Supervised learning [48] and unsupervised learning [49].Supervised learning feature detection takes the feature points extracted by traditional methods as "anchor points", and then trains a regression neural network to predict the location of more feature points; whereas the unsupervised learning strategy uses a neural network directly to train the candidate points and their response-values, and then takes the candidate points at the top or bottom of the ranking as the final feature points.The basis of wide-baseline image matching is the extraction of local invariant features, which are local features that remains table between the stereo images under geometric or radiometric distortions, such as viewpoint change or illumination variation.In recent years, researchers have focused on exploring feature detection schemes for deep learning with enhancing network [50].Using the supervised learning strategy as an example, Lenc et al. first proposed a local invariant feature loss function ( ) cov L x [51]. (1)

Deep-Learning-Based Feature Detection
Figure 2 summarizes the progress in deep-learning feature-detection methods.Based on the implemented learning mode, the mainstream deep-learning feature-detection algorithms can be divided into two types: Supervised learning [48] and unsupervised learning [49].Supervised learning feature detection takes the feature points extracted by traditional methods as "anchor points", and then trains a regression neural network to predict the location of more feature points; whereas the unsupervised learning strategy uses a neural network directly to train the candidate points and their response-values, and then takes the candidate points at the top or bottom of the ranking as the final feature points.

Deep-Learning Image-Matching Methodologies
At present, the research on deep-learning methods for wide-baseline image matching mainly focuses on three topics: Feature detection, feature description, and end-to-end matching (see Figure 1).Therefore, this section provides a review and summary of the related work in these research topics below.

Deep-Learning-Based Feature Detection
Figure 2 summarizes the progress in deep-learning feature-detection methods.Based on the implemented learning mode, the mainstream deep-learning feature-detection algorithms can be divided into two types: Supervised learning [48] and unsupervised learning [49].Supervised learning feature detection takes the feature points extracted by traditional methods as "anchor points", and then trains a regression neural network to predict the location of more feature points; whereas the unsupervised learning strategy uses a neural network directly to train the candidate points and their response-values, and then takes the candidate points at the top or bottom of the ranking as the final feature points.The basis of wide-baseline image matching is the extraction of local invariant features, which are local features that remains table between the stereo images under geometric or radiometric distortions, such as viewpoint change or illumination variation.In recent years, researchers have focused on exploring feature detection schemes for deep learning with enhancing network [50].Using the supervised learning strategy as an example, Lenc et al. first proposed a local invariant feature loss function ( ) cov L x [51].
(1) The basis of wide-baseline image matching is the extraction of local invariant features, which are local features that remains table between the stereo images under geometric or radiometric distortions, such as viewpoint change or illumination variation.In recent years, researchers have focused on exploring feature detection schemes for deep learning with enhancing network [50].Using the supervised learning strategy as an example, Lenc et al. first proposed a local invariant feature loss function L cov (x) [51].
where • 2 F is F-norm, x is the image block to be processed, g is the random geometric transformation, gx is the random transformation result of x, φ(•) is the transformation matrix output by the neural network, and q is the complementary residual transformation of g.On this basis, this algorithm employs the Siamese neural network DetNet to learn the invariant feature geometric transformations.Moreover, it uses image control points as anchor points and treats potential feature points as certain transformation forms of these anchor points.In the training phase, the images with anchor points are input to the regression neural network, and the optimal transformation is learned iteratively.Then, the weights of the regression neural network are adjusted according to the loss function and finally interpolated to obtain more feature positions, directions, and shapes.This method created a precedent for deep-learning invariant feature detection, and the detected features are equipped with good scale and rotation invariance.
Zhang et al. [52] used the illumination-invariant feature TILDE [30] of deep learning as an anchor point, which solved the problem of image matching under strong illumination changes; on this basis, Doiphode et al. [53] used a triple network [54] and introduced an affine invariant constraint to learn stable and reliable affine invariant features.The above methods give the target features a certain geometric and radiation invariance, but the geometric relationship between the image blocks must be roughly known before training the model; this invisibly increases the workload of the training dataset production.
Yi et al. [55] further studied the Edge Foci (EF) [56] and SIFT [14] features to detect the location of key points and learned the neighborhood direction of features based on a CNN; Mishkin et al. [57] used a multi-scale Hessian to detect the initial feature points and estimate the affine invariant region based on the triplet network AffNet.This method combines traditional feature extraction algorithms with deep-learning invariant features, which substantially improves the efficiency and reliability of feature detection.
In addition to the above-mentioned features for supervised learning, Savinov et al. [58] also proposed a classic feature-learning strategy with unsupervised idea.This method transforms the learning problem of feature detection into a learning problem of responsevalue sorting of image interest points.The response function of the image point is denoted by H(p |w ), where p represents the image point, and H and w represent the CNN to be trained and the weight vector of the network, respectively.The image point response-value sorting model is then expressed as follows: where d represents one scene target in the image and p is located on d; i and j are the indexes of p, and i = j; p i t(d) and p j t(d) are generated respectively by transformation t of p i d and p j d .Therefore, all points p on target d are sorted according to the response-value function and Equation (2), and the image points with the response-values in the top or bottom ranks are retained as feature points.The key purpose of this method is to learn the invariant response function of the image point using the neural network.The feature points maintain good invariance to the perspective transformation of the images; additional experiments in Reference [58] demonstrate that the proposed method may outperform the Difference of Gaussian (DoG) strategy [14] regarding feature repeatability for view-change images.However, the existing methods still have many shortcomings with respect to feature point detection repeatability and stability for wide-baseline images with large view changes.
As mentioned above, the most learning-based methods for feature detection are categorized as supervised learning achievements.Such mainstream methods can handily surpass the unsupervised strategies in invariant feature learning because the supervised methods may directly and separately produce the geometric covariant frames for widebaseline images, while the unsupervised methods need to simultaneously cope with the locations of interest points and their invariance during learning process.

Deep-Learning Feature Description
Deep-learning feature description [59] has been widely applied in professional tasks [60] such as image retrieval, 3D reconstruction, face recognition, interest point detection, and target positioning and tracking.Specific research on this topic mainly focuses on network structure construction and loss function design, as shown in Figure 3.Among them, the network structure of deep learning directly determines the discrimination and reliability of the feature descriptors, while the loss function affects the training performance of the model by controlling the iterative update frequency of the model parameters and optimizing the quantity and quality of the sample input.
Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 22 vised methods may directly and separately produce the geometric covariant frames for wide-baseline images, while the unsupervised methods need to simultaneously cope with the locations of interest points and their invariance during learning process.

Deep-Learning Feature Description
Deep-learning feature description [59] has been widely applied in professional tasks [60] such as image retrieval, 3D reconstruction, face recognition, interest point detection, and target positioning and tracking.Specific research on this topic mainly focuses on network structure construction and loss function design, as shown in Figure 3.Among them, the network structure of deep learning directly determines the discrimination and reliability of the feature descriptors, while the loss function affects the training performance of the model by controlling the iterative update frequency of the model parameters and optimizing the quantity and quality of the sample input.The key to high-quality feature description is to consider both similarity and discrimination."Similarity" refers to the ability of corresponding feature descriptors to maintain good invariance to signal noise, geometric distortion, and radiation distortion, thereby retaining a high degree of similarity.In contrast, "discrimination" refers to the idea that there should be a large difference between any non-matching feature descriptors.To generate high-quality descriptors, the learning-based method departs from the paradigm of traditional algorithms and builds Siamese network or triplet network, which emulates the cognitive structure of human visual nerves.The Siamese network, also known as the dual-channel network, is a coupled architecture based on a binary branch network, whereas the triple network has one more branch than the Siamese network, and thus it can be adapted to a scenario in which three samples are input simultaneously.
Figure 4 shows the evolution of several typical feature-description networks.Among them, a representative approach is MatchNet [20], which uses the original Siamese network and is composed of two main parts: A feature coding network and a similarity measurement network.The two branches of the feature network maintain dynamic weight sharing and extract the feature patches from stereo images through a convolution layer [58], a maximum pooling layer [61], and other layers.Furthermore, it calculates the similarity between image blocks though a series connecting to the top fully connected network [62], and then determines the matching blocks based on the similarity score.Subsequently, Zagoruyko et al. [63] further explored the role of the central-surround two-stream network (CSTSNet) [64] and the spatial pyramid pooling net (SPPNet) [65] in the feature description.CSTSNet combines a low-resolution surround stream with a high-resolution center stream, which not only use the multi-resolution information of the image, but also emphasize the information of the center pixels, thus substantially improving the matching performance.In contrast, SPPNet inherits the good characteristics of the Siamese network, then it enhances the adaption to image block data of different sizes by introducing a spatial pyramid pooling layer.To apply SPPNetto the description of features in satellite images, Fan et al. [66] designed a dual-channel de- The key to high-quality feature description is to consider both similarity and discrimination."Similarity" refers to the ability of corresponding feature descriptors to maintain good invariance to signal noise, geometric distortion, and radiation distortion, thereby retaining a high degree of similarity.In contrast, "discrimination" refers to the idea that there should be a large difference between any non-matching feature descriptors.To generate high-quality descriptors, the learning-based method departs from the paradigm of traditional algorithms and builds Siamese network or triplet network, which emulates the cognitive structure of human visual nerves.The Siamese network, also known as the dual-channel network, is a coupled architecture based on a binary branch network, whereas the triple network has one more branch than the Siamese network, and thus it can be adapted to a scenario in which three samples are input simultaneously.
Figure 4 shows the evolution of several typical feature-description networks.Among them, a representative approach is MatchNet [20], which uses the original Siamese network and is composed of two main parts: A feature coding network and a similarity measurement network.The two branches of the feature network maintain dynamic weight sharing and extract the feature patches from stereo images through a convolution layer [58], a maximum pooling layer [61], and other layers.Furthermore, it calculates the similarity between image blocks though a series connecting to the top fully connected network [62], and then determines the matching blocks based on the similarity score.Subsequently, Zagoruyko et al. [63] further explored the role of the central-surround two-stream network (CSTSNet) [64] and the spatial pyramid pooling net (SPPNet) [65] in the feature description.CSTSNet combines a low-resolution surround stream with a high-resolution center stream, which not only use the multi-resolution information of the image, but also emphasize the information of the center pixels, thus substantially improving the matching performance.In contrast, SPPNet inherits the good characteristics of the Siamese network, then it enhances the adaption to image block data of different sizes by introducing a spatial pyramid pooling layer.To apply SPPNetto the description of features in satellite images, Fan et al. [66] designed a dual-channel description network based on a spatial-scale convolutional layer to improve the accuracy of satellite image matching.These descriptor measurement networks belong to the fully connected category of networks, which consume a large amount of computing resources during training and testing, and hence have low matching efficiency.To address this, Tian et al. proposed a feature description model called L2-Net [67] with a full convolutional network representation.This method inherits the idea of SIFT descriptors, namely, it adjusts the dimension of network output to 128 and uses the L2 norm measure of Euclidean distance instead of a metric network to evaluate the similarity of the feature descriptors.The basic structure of the L2-Net network is shown in Figure 5.This network consists of seven convolutional layers and a local response normalization layer (LRN).In the figure, the term "3 × 3 Conv" in the convolutional layer refers to convolution, batch normalization, and linear activation operations in the series, and "8 × 8 Conv" represents the convolution and batch normalization processing operations.Moreover, "32" represents a 32-dimensional convolution with a step size of 1 and "64/2"refers to a 64-dimensional convolution operation with a step size of 2. The final output layer LRN is used to generate unit descriptor vectors while accelerating network convergence and enhancing model generalization.The results on the open-source dataset Brown [68], Oxford [10], and HPatches [69] training and testing datasets show that L2-Net has good generalization ability, and its performance is better than the existing traditional descriptors.Moreover, L2-Net performs well with respect to image feature classification as well as wide-baseline stereo image feature description and matching, and thus many researchers regard it as a classic feature description network and have extended it with improvements in network structure.Balntas et al. [34] found that one disadvantage of L2-Net is that it ignores the contribution of negative samples to the loss function value.Hence, they proposed the triplets and shallow CNN (TSCNN).This method simplifies the L2-Net network layer and the number of channels, then incorporates negative samples into the network training, and hence that the modified model can reduce the distance between matching feature descriptors while increasing the distance between non-matching feature descriptors.These descriptor measurement networks belong to the fully connected category of networks, which consume a large amount of computing resources during training and testing, and hence have low matching efficiency.To address this, Tian et al. proposed a feature description model called L2-Net [67] with a full convolutional network representation.This method inherits the idea of SIFT descriptors, namely, it adjusts the dimension of network output to 128 and uses the L2 norm measure of Euclidean distance instead of a metric network to evaluate the similarity of the feature descriptors.The basic structure of the L2-Net network is shown in Figure 5.This network consists of seven convolutional layers and a local response normalization layer (LRN).In the figure, the term "3 × 3 Conv" in the convolutional layer refers to convolution, batch normalization, and linear activation operations in the series, and "8 × 8 Conv" represents the convolution and batch normalization processing operations.Moreover, "32" represents a 32-dimensional convolution with a step size of 1 and "64/2" refers to a 64-dimensional convolution operation with a step size of 2. The final output layer LRN is used to generate unit descriptor vectors while accelerating network convergence and enhancing model generalization.These descriptor measurement networks belong networks, which consume a large amount of compu testing, and hence have low matching efficiency.To feature description model called L2-Net [67] with a sentation.This method inherits the idea of SIFT des mension of network output to 128 and uses the L2 no instead of a metric network to evaluate the similarity structure of the L2-Net network is shown in Figure convolutional layers and a local response normalizat term "3 × 3 Conv" in the convolutional layer refers to and linear activation operations in the series, and "8 tion and batch normalization processing operatio 32-dimensional convolution with a step size of 1 an convolution operation with a step size of 2. The final ate unit descriptor vectors while accelerating network generalization.The results on the open-source dataset Brown [68], Oxford [10], and HPatches [69] training and testing datasets show that L2-Net has good generalization ability, and its performance is better than the existing traditional descriptors.Moreover, L2-Net performs well with respect to image feature classification as well as wide-baseline stereo image feature description and matching, and thus many researchers regard it as a classic feature description network and have extended it with improvements in network structure.Balntas et al. [34] found that one disadvantage of L2-Net is that it ignores the contribution of negative samples to the loss function value.Hence, they proposed the triplets and shallow CNN (TSCNN).This method simplifies the L2-Net network layer and the number of channels, then incorporates negative samples into the network training, and hence that the modified model can reduce the distance between matching feature descriptors while increasing the distance between non-matching feature descriptors.However, the negative samples are input into TSCNNs using random sampling strategy, and as a result, most negative samples do not sufficiently contribute to the model training, which limits the improvements in descriptor discrimination.In view of this, HardNet [70] incorporates the most difficult negative sample, namely the nearest non-matching descriptor, into the training of the model, which substantially enhances the training efficiency and matching performance.The triplet margin loss (TML) function used by this model is as follows: where m is the batch size, d() is the Euclidean distance between two descriptors, a i and p i are an arbitrary pair of matching descriptors, and n j min and n k min represent the closest non-matching descriptors to a i and p i , respectively.On the basis of the L2-Net network structure, the HardNet descriptor model employs the nearest neighbor negative sample sampling strategy and the TML loss function, which is another important advance in the descriptor network model.Inspired by HardNet, some notable deep-learning models for feature description have been further explored.For example, LogPolarDesc [71] uses a polar transform network to extract corresponding image blocks with higher similarity to improve the quality and efficiency of model training; SOSNet [72] introduces the second-order similarity regularization into the loss function to prevent over-fitting of the model and substantially improve the utilization of the descriptors.To generate a descriptor with both global and local geometric invariance, some researchers have proposed making full use of the geometry or the visual context information of an image.The representative approach GeoDesc [73] employs cosine similarity to measure the matching degree of descriptors.It also sets self-adaptive distance thresholds to handle different training image blocks and then introduces a geometric loss function to enhance the geometric invariance of the descriptor, which is expressed by the following equation: s patch ≥ 0.5 0.5 0.2 ≤ s patch < 0.5 0.2 otherwise (4) where β represents the adaptive threshold; s i,i represents the cosine similarity between corresponding features descriptors; and s patch represents the similarity of the correspondingimage blocks.On this basis, ContextDesc [74] integrates geometry and visual context perception into the process of network model construction, thus improving the utilization of image geometry and visual context information.Finally, many data tests show that the ContextDesc adapts well to the geometric and radiation distortions of different scenes.
In short, feature description plays a vital role in image matching, as the high-quality descriptor can absorb the local and global information from the feature neighborhoods, which may provide adequate knowledge for recognizing the unique feature from extensive false candidates.Based on the aforementioned, the triple networks can perform better than the Siamese or sole model, because multi-branch networks can be efficient in learning the uniqueness of features and make full use of context information.

Deep-Learning End-to-End Matching
The end-to-end matching strategy integrates three different stages of image feature extraction, description, and matching into one system for training, which is beneficial for learning the globally optimal model parameters, and adaptively improves the performance of each stage [75].Figure 6 summarizes the development of end-to-end deep-learning matching.Most end-to-end methods focus on the design of training modes and the automatic acquisition of training data [76].The design of training modes is intended to obtain high-quality image features and descriptors in a more concise and efficient way; the aim of automatic acquisition of data is to achieve fully automatic training by means of a classical feature detection algorithm and spatial multi-scale sampling strategy.
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 22 and the automatic acquisition of training data [76].The design of training modes is intended to obtain high-quality image features and descriptors in a more concise and efficient way; the aim of automatic acquisition of data is to achieve fully automatic training by means of a classical feature detection algorithm and spatial multi-scale sampling strategy.Yi et al. proposed the learned invariant feature transform (LIFT) network structure [77].This network first integrates feature detection, direction estimation, and feature description into one pipeline based on the Transformer (ST) [78] and softargmax algorithm [79].The end-to-end training is carried out by back propagation.The complete training and testing process of this method is shown in Figure 7.The back propagation-based training process of LIFT can be briefly described as follows.First, the feature location and principal direction can be extracted using the structure from motion (SFM) algorithm [80], and then the feature descriptor is trained.Second, guided by the feature descriptor, the direction estimator is trained based on the feature location and its neighborhood ST.Finally, the feature descriptor and direction estimator are united to train the feature detector based on the training dataset.After the LIFT has been trained, the corresponding test process proceeds as follows.First, the feature score map of a multi-scale image is obtained based on the feature detector.Second, scale-space non-maximum suppression is performed using the softargmax function and then the scale invariant feature region is extracted.Finally, the feature region is further normalized and then the description vectors are extracted by the feature descriptor.
Although LIFT belongs to the category of end-to-end network models, a back propagation-based multi-stage training mode is adopted in the network training, which reduces the training efficiency and practicality of the model; additionally, LIFT employs an SFM strategy and random spatial transformation to provide matching image blocks for training, which limits the discrimination of descriptors.In view of this, DeTone et al. [81] proposed a self-supervised network model called MagicPoint instead of SFM to label training data.They then use the SuperPoint model to learn feature points and extract their descriptors for end-to-end training.
SuperPoint realizes the joint training of feature detection and description through the encoding structure [82] and decoding structure [83].The encoding structure is used for image feature extraction, whereas the decoding structure can not only output the position of the feature point, but also output the descriptor vector.Similarly, Revaud et al.Yi et al. proposed the learned invariant feature transform (LIFT) network structure [77].This network first integrates feature detection, direction estimation, and feature description into one pipeline based on the Transformer (ST) [78] and softargmax algorithm [79].The end-to-end training is carried out by back propagation.The complete training and testing process of this method is shown in Figure 7.
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 22 and the automatic acquisition of training data [76].The design of training modes is intended to obtain high-quality image features and descriptors in a more concise and efficient way; the aim of automatic acquisition of data is to achieve fully automatic training by means of a classical feature detection algorithm and spatial multi-scale sampling strategy.Yi et al. proposed the learned invariant feature transform (LIFT) network structure [77].This network first integrates feature detection, direction estimation, and feature description into one pipeline based on the Transformer (ST) [78] and softargmax algorithm [79].The end-to-end training is carried out by back propagation.The complete training and testing process of this method is shown in Figure 7.The back propagation-based training process of LIFT can be briefly described as follows.First, the feature location and principal direction can be extracted using the structure from motion (SFM) algorithm [80], and then the feature descriptor is trained.Second, guided by the feature descriptor, the direction estimator is trained based on the feature location and its neighborhood ST.Finally, the feature descriptor and direction estimator are united to train the feature detector based on the training dataset.After the LIFT has been trained, the corresponding test process proceeds as follows.First, the feature score map of a multi-scale image is obtained based on the feature detector.Second, scale-space non-maximum suppression is performed using the softargmax function and then the scale invariant feature region is extracted.Finally, the feature region is further normalized and then the description vectors are extracted by the feature descriptor.
Although LIFT belongs to the category of end-to-end network models, a back propagation-based multi-stage training mode is adopted in the network training, which reduces the training efficiency and practicality of the model; additionally, LIFT employs an SFM strategy and random spatial transformation to provide matching image blocks for training, which limits the discrimination of descriptors.In view of this, DeTone et al. [81] proposed a self-supervised network model called MagicPoint instead of SFM to label training data.They then use the SuperPoint model to learn feature points and extract their descriptors for end-to-end training.
SuperPoint realizes the joint training of feature detection and description through the encoding structure [82] and decoding structure [83].The encoding structure is used for image feature extraction, whereas the decoding structure can not only output the position of the feature point, but also output the descriptor vector.Similarly, Revaud et al.The back propagation-based training process of LIFT can be briefly described as follows.First, the feature location and principal direction can be extracted using the structure from motion (SFM) algorithm [80], and then the feature descriptor is trained.Second, guided by the feature descriptor, the direction estimator is trained based on the feature location and its neighborhood ST.Finally, the feature descriptor and direction estimator are united to train the feature detector based on the training dataset.After the LIFT has been trained, the corresponding test process proceeds as follows.First, the feature score map of a multi-scale image is obtained based on the feature detector.Second, scale-space non-maximum suppression is performed using the softargmax function and then the scale invariant feature region is extracted.Finally, the feature region is further normalized and then the description vectors are extracted by the feature descriptor.
Although LIFT belongs to the category of end-to-end network models, a back propagationbased multi-stage training mode is adopted in the network training, which reduces the training efficiency and practicality of the model; additionally, LIFT employs an SFM strategy and random spatial transformation to provide matching image blocks for training, which limits the discrimination of descriptors.In view of this, DeTone et al. [81] proposed a self-supervised network model called MagicPoint instead of SFM to label training data.They then use the SuperPoint model to learn feature points and extract their descriptors for end-to-end training.
SuperPoint realizes the joint training of feature detection and description through the encoding structure [82] and decoding structure [83].The encoding structure is used for image feature extraction, whereas the decoding structure can not only output the position of the feature point, but also output the descriptor vector.Similarly, Revaud et al. [84] proposed the Siamese decoding structure R2D2, which focuses more on the repetitive and discriminative expression of training features than SuperPoint.
The learning-based method of MagicPoint can replace the handcrafted labeling of feature points, but a small amount of handcrafted labeling data is still required when obtaining the pre-trained model.Ono et al. [85] proposed LF-Net, which is an end-toend model that uses unsupervised training.This method directly uses the stereo images obtained by a metric camera, an image depth map, the camera position, and orientation data, and other prior information to complete the end-to-end model training, which greatly reduces the need for manual intervention and promotes the automated process of deep-learning matching.In addition, Dusmanu et al. proposed a combination of feature detection and descriptor extraction that can make more effective use of high-level semantic information.They then proposed a simplified end-to-end model D2Net [86].The difference between this model and the traditional model is depicted in Figure 8. Figure 8a shows the traditional "detect-then-describe" model, that is, SuperPoint [81], which is a representative model of this type, and Figure 8b shows the D2Net "describe-and-detect" model.In contrast to a Siamese or multi-branch network structure [87], D2Net adopts a single-branch network architecture, and the feature location and descriptor information of the image are stored in high-dimensional feature channels, which is thus more conducive to obtaining stable and efficient matches.However, D2Net must extract dense descriptors in the process of using high-level semantic information, which reduces the accuracy and efficiency of feature detection.
obtaining the pre-trained model.Ono et al. [85] proposed LF-Net, wh model that uses unsupervised training.This method directly uses th tained by a metric camera, an image depth map, the camera positi data, and other prior information to complete the end-to-end mod greatly reduces the need for manual intervention and promotes the au deep-learning matching.In addition, Dusmanu et al. proposed a com detection and descriptor extraction that can make more effective u mantic information.They then proposed a simplified end-to-end mo difference between this model and the traditional model is depicted in shows the traditional "detect-then-describe" model, that is, SuperPo representative model of this type, and Figure 8b shows the D2Net "d model.In contrast to a Siamese or multi-branch network structure [8 single-branch network architecture, and the feature location and des of the image are stored in high-dimensional feature channels, which ducive to obtaining stable and efficient matches.However, D2Net descriptors in the process of using high-level semantic information, accuracy and efficiency of feature detection.All in all, the end-to-end strategy is prone to train the optimal pa matching.Multi-networks with complex architecture need to input ples than a single network.Considering the available scale of trai self-supervised learning mode is the best choice for current practical a

Representative Algorithms and Experimental Data
To evaluate the performance, advantages, and disadvantages o reo matching algorithms, we selected a total five categories of 10 w rithms for the experiments, including deep-learning end-to-end matc feature detection and description, deep-learning feature detection an ture description, handcrafted feature detection and deep-learning f and handcrafted image matching, as shown in Table 1.In addition, t of each algorithm can be obtained from the corresponding link in th methods were selected due to the following reasons.First, as the deep-learning end-to-end matching, SuperPoint [81] and D2Net [86] cently, and they have been widely applied [88][89][90] in the fields of ph All in all, the end-to-end strategy is prone to train the optimal parameters for image matching.Multi-networks with complex architecture need to input more training samples than a single network.Considering the available scale of training data [76], the selfsupervised learning mode is the best choice for current practical applications.

Representative Algorithms and Experimental Data
To evaluate the performance, advantages, and disadvantages of deep-learning stereo matching algorithms, we selected a total five categories of 10 well-performed algorithms for the experiments, including deep-learning end-to-end matching, deep-learning feature detection and description, deep-learning feature detection and handcrafted feature description, handcrafted feature detection and deep-learning feature description, and handcrafted image matching, as shown in Table 1.In addition, the key source code of each algorithm can be obtained from the corresponding link in this table.The above methods were selected due to the following reasons.First, as the representatives of deep-learning end-to-end matching, SuperPoint [81] and D2Net [86] were published recently, and they have been widely applied [88][89][90] in the fields of photogrammetry and computer vision.Second, the deep-learning feature detectors AffNet [57] andDetNet [51], deep-learning feature descriptors HardNet [70], SOSNet [72], and ContextDesc [74], were all proposed for wide-baseline image matching, and were often used as benchmarks [12].Third, the classical handcrafted methods are used here to verify the strength of deep-learning methods.Finally, all selected methods are effective and well-performed in previous reports, and the source codes are open to public.
Table 1.Representative algorithms and their references.

Algorithms Code links
Deep learning end-to-end matching ws the matching errors of each algorithm, where the matching error e following equations [95]: x Fx Fx Fx (5) ber of matches, j x and j  x are an arbitrary pair of matching point and F are the known true perspective transformation matrix and true x, respectively.The matching errors of test data (a)-(f),which consist of ately planar scenes, are evaluated by εH (pixel), and the matching er-)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
e image-matching results of each algorithm.Because of the limited nly exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ (a), (f), (g), (i), (j), and (m), where the matching points are indicated by by yellow lines, and the most matches in each row of the figure are frame.For each algorithm, Figure 12 shows the matching distribution is estimated by the following equation [96]: s the total number of Delaunay triangles generated by the matching the area of the i-th triangle, max( ) i J represents the radian value of rnal angle, and A represents the average area of the triangle.The veal the consistency and uniformity of the spatial distribution of the nd a smaller Dis value indicates that the matches have a higher spaality.
SuperPoint [81] https://github.com/rpautrat/SuperPointRemote Sens. 2021, 13, x FOR PEER REVIEW 10 of 22 of deep-learning methods.Finally, all selected methods are effective and well-performed in previous reports, and the source codes are open to public.
Table 1.Representative algorithms and their references.

Categories Algorithms
The datasets used to train each deep-learning algorithm are as follows: SuperPoint using MSCOCO [91]; D2Net using MegaDepth [92]; AffNet using UBC Phototour [68]; both HardNet and SOSNet using HPatches [69]; ContextDesc using GL3D [93]; DetNet using DTU-Robots [94].According to their corresponding literatures [68,69,[91][92][93][94], the characters of each dataset would be discussed and summarized as follows.MSCOCO was proposed with the goal of advancing the state-of-the-art in scene understanding and object detection.In contrast to the popular datasets, MSCOCO involves fewer common object categories but more instances per category, which would be useful for learning complex scenes.MegaDepth was created to exploit multi-view internet images and produce a large amount of training data by the SFM method.It performs well for challenging environments such as offices and close-ups, but MegaDepth is biased towards outdoor scenes.UBC Phototour initially proposed patch verification as an evaluation protocol.There is large number of patches available in this dataset, which is particularity suited for deep-learning-based detectors.The images in this dataset have notable variations in illumination and view changes, but most of these images only focus on three scenes: Liberty, Notre-Dame, and Yosemite.HPatches presented a new large-scale dataset special for training local descriptors, aiming to eliminate the ambiguities and inconsistencies in scene understanding.It has the superiorities of diverse scenes and notable viewpoint changes.GL3D designed a large-scale database for 3D surface reconstruction and geometry-related learning issues.This dataset covered many different scenes, including rural area, urban, and scenic spots taken from multiple scales and viewpoints.The DTU-Robots dataset involves real images of 3D scenes, shot using a robotic arm in rigorous laboratory conditions, which is suitable for certain application but of limited size and variety in the data.
The representative wide-baseline test data are presented in Figure 9, and the corresponding data descriptions are listed in Table 2. Algorithms ①, ②, and ⑩ can directly output the corresponding features, and for the descriptors output by algorithms ③-⑨, we adopt the nearest-neighbor and second nearest-neighbor distance ratio to obtain the matches.Finally, each algorithm employs the random sample consensus (RANSAC) strategy to eliminate outliers.The performance of the algorithms is objectively evaluated according to the number of matching points, matching accuracy, and matching spatial distribution indexes.

Deep learning feature detection and description
Table 3 presents the number of image matches obtained by each d number is the maximum number of matches in each group of test the matching errors of each algorithm, where the matching error ollowing equations [95]: x Fx Fx Fx (5) r of matches, j x and j  x are an arbitrary pair of matching point d F are the known true perspective transformation matrix and true spectively.The matching errors of test data (a)-(f),which consist of ly planar scenes, are evaluated by εH (pixel), and the matching er-), which consist of non-planar scenes, are evaluated by εF (pixel).mage-matching results of each algorithm.Because of the limited exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ (f), (g), (i), (j), and (m), where the matching points are indicated by yellow lines, and the most matches in each row of the figure are me.For each algorithm, Figure 12 shows the matching distribution stimated by the following equation [96]: e total number of Delaunay triangles generated by the matching e area of the i-th triangle, max( ) i J represents the radian value of l angle, and A represents the average area of the triangle.The l the consistency and uniformity of the spatial distribution of the a smaller Dis value indicates that the matches have a higher spa-.
AffNet [57] + HardNet [70] https://github.com/DagnyT/hardnete-baseline stereo images and 10 representative algorithms, the le 3 presents the number of image matches obtained by each mber is the maximum number of matches in each group of test matching errors of each algorithm, where the matching error wing equations [95]: x Fx Fx Fx (5) matches, j x and j  x are an arbitrary pair of matching point are the known true perspective transformation matrix and true ctively.The matching errors of test data (a)-(f),which consist of lanar scenes, are evaluated by εH (pixel), and the matching erhich consist of non-planar scenes, are evaluated by εF (pixel).e-matching results of each algorithm.Because of the limited ibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ (g), (i), (j), and (m), where the matching points are indicated by llow lines, and the most matches in each row of the figure are For each algorithm, Figure 12 shows the matching distribution ated by the following equation [96]: tal number of Delaunay triangles generated by the matching ea of the i-th triangle, max( ) i J represents the radian value of gle, and A represents the average area of the triangle.The e consistency and uniformity of the spatial distribution of the aller Dis value indicates that the matches have a higher spa-AffNet [57] + SOSNet [72] https://github.com/scape-research/SOSNetaseline stereo images and 10 representative algorithms, the presents the number of image matches obtained by each er is the maximum number of matches in each group of test tching errors of each algorithm, where the matching error g equations [95]: x Fx Fx Fx (5) tches, j x and j  x are an arbitrary pair of matching point the known true perspective transformation matrix and true ely.The matching errors of test data (a)-(f),which consist of ar scenes, are evaluated by εH (pixel), and the matching erh consist of non-planar scenes, are evaluated by εF (pixel).atching results of each algorithm.Because of the limited s the matching results of algorithms ①, ③, ④, ⑤, and ⑩ (i), (j), and (m), where the matching points are indicated by lines, and the most matches in each row of the figure are each algorithm, Figure 12 shows the matching distribution d by the following equation [96]:

Categories Algorithms
The datasets used to train each deep-learning algorithm are as follows: SuperPoint using MSCOCO [91]; D2Net using MegaDepth [92]; AffNet using UBC Phototour [68]; both HardNet and SOSNet using HPatches [69]; ContextDesc using GL3D [93]; DetNet using DTU-Robots [94].According to their corresponding literatures [68,69,[91][92][93][94], the characters of each dataset would be discussed and summarized as follows.MSCOCO was proposed with the goal of advancing the state-of-the-art in scene understanding and object detection.In contrast to the popular datasets, MSCOCO involves fewer common object categories but more instances per category, which would be useful for learning complex scenes.MegaDepth was created to exploit multi-view internet images and produce a large amount of training data by the SFM method.It performs well for challenging environments such as offices and close-ups, but MegaDepth is biased towards outdoor scenes.UBC Phototour initially proposed patch verification as an evaluation protocol.There is large number of patches available in this dataset, which is particularity suited for deep-learning-based detectors.The images in this dataset have notable variations in illumination and view changes, but most of these images only focus on three scenes: Liberty, Notre-Dame, and Yosemite.HPatches presented a new large-scale dataset special for training local descriptors, aiming to eliminate the ambiguities and inconsistencies in scene understanding.It has the superiorities of diverse scenes and notable viewpoint changes.GL3D designed a large-scale database for 3D surface reconstruction and geometry-related learning issues.This dataset covered many different scenes, including rural area, urban, and scenic spots taken from multiple scales and viewpoints.The DTU-Robots dataset involves real images of 3D scenes, shot using a robotic arm in rigorous laboratory conditions, which is suitable for certain application but of limited size and variety in the data.
The representative wide-baseline test data are presented in Figure 9, and the corresponding data descriptions are listed in Table 2. Algorithms ①, ②, and ⑩ can directly output the corresponding features, and for the descriptors output by algorithms ③-⑨, we adopt the nearest-neighbor and second nearest-neighbor distance ratio to obtain the matches.Finally, each algorithm employs the random sample consensus (RANSAC) strategy to eliminate outliers.The performance of the algorithms is objectively evaluated according to the number of matching points, matching accuracy, and matching spatial distribution indexes.
DetNet [51] + HardNet [70] https://github.com/lenck/ddetDeep learning feature detection and handcrafted feature description the use of image context information.The test results show that it is particulare for image matching in scenes with cluttered background (data (c)) or large ic distortion (data (e))., we further discuss the strengths and weaknesses of integrating methods for ult test data.Although algorithms ③, ④, and ⑦ all adopt AffNet to extract ariant features, the test results of algorithms ③ and ④ are substantially beteculate the reason is that the deep-learning descriptors of algorithms ③ and m better than the handcrafted descriptor SIFT of algorithm 7. Figure 10 shows atching of both the deep learning and handcrafted algorithms is not able to ub-pixel accuracy.The main reason is that the two stages of feature detection re matching are relatively independent, which makes it difficult for the correpoints to be accurately aligned.The complete UAV dataset, which is larger in esolution (data (j) and (k)) was also used for testing.It should theoretically be for each algorithm to obtain more matches; however, Table 3 shows that the f matches did not increase substantially as a result.We believe that a high resill exacerbate the lack of local texture in the image, and larger images tend to more occluded regions.Specifically, data (j) contain more occluded scenes and ous textures, whereas data (k) involve a large area of water and scenes with t changes.Additionally, the ratio of the corresponding regions in the larger lower.Thus, it would become more difficult to obtain the corresponding feahe absence of prior knowledge or initial matches.For satellite wide-baseline ith various mountainous and urban areas, both the deep-learning approach nt and the handcrafted ASIFT method can obtain a significant number of ry and Outlook ide-baseline image-matching problems, this paper systematically organized, and summarized the existing deep-learning image invariant feature detection, n, and end-to-end matching models.In addition, the matching performances resentative algorithms were evaluated and compared through comprehensive nts on wide-baseline images.According to the above test results and discusre research and challenges can be summarized as follows.he current deep-learning invariant feature detection approach continues to reotential, and the research on invariant features and their applications has indeveloped, from the scale invariant feature learning of Reference [52] to the ariant feature training of Reference [53].Experiments have shown that learnmethods have better potential than handcrafted detection algorithms such as and pixel watershed [17].In addition, the strategy of combining handcrafted with learning-based methods [55] to extract invariant features has become a ion, but this type of method obviously depends on the accurate extraction of features by the handcrafted algorithms.In short, although the feature detecods based on deep learning tend to show abilities beyond the traditional this approach is not yet fully mature, especially for the matching problem of line images with complex scenes and various textures, and it still faces great s.Therefore, extracting invariant features with high repeatability and stability ther study.eep-learning feature description is essentially metric learning; this kind of s mainly focused on network model construction and loss function design.MatchNet Siamese network [20] to the SOSNet triplet structure [72], the model

Categories Algorithms
The datasets used to train each deep-learning algorithm are as follows: SuperPoint using MSCOCO [91]; D2Net using MegaDepth [92]; AffNet using UBC Phototour [68]; both HardNet and SOSNet using HPatches [69]; ContextDesc using GL3D [93]; DetNet using DTU-Robots [94].According to their corresponding literatures [68,69,[91][92][93][94], the characters of each dataset would be discussed and summarized as follows.MSCOCO was proposed with the goal of advancing the state-of-the-art in scene understanding and object detection.In contrast to the popular datasets, MSCOCO involves fewer common object categories but more instances per category, which would be useful for learning complex scenes.MegaDepth was created to exploit multi-view internet images and produce a large amount of training data by the SFM method.It performs well for challenging environments such as offices and close-ups, but MegaDepth is biased towards outdoor scenes.UBC Phototour initially proposed patch verification as an evaluation protocol.There is large number of patches available in this dataset, which is particularity suited for deep-learning-based detectors.The images in this dataset have notable variations in illumination and view changes, but most of these images only focus on three scenes: Liberty, Notre-Dame, and Yosemite.HPatches presented a new large-scale dataset special for training local descriptors, aiming to eliminate the ambiguities and inconsistencies in scene understanding.It has the superiorities of diverse scenes and notable viewpoint changes.GL3D designed a large-scale database for 3D surface reconstruction and geometry-related learning issues.This dataset covered many different scenes, including rural area, urban, and scenic spots taken from multiple scales and viewpoints.The DTU-Robots dataset involves real images of 3D scenes, shot using a robotic arm in rigorous laboratory conditions, which is suitable for certain application but of limited size and variety in the data.
The representative wide-baseline test data are presented in Figure 9, and the corresponding data descriptions are listed in Table 2. Algorithms ①, ②, and ⑩ can directly output the corresponding features, and for the descriptors output by algorithms ③-⑨, we adopt the nearest-neighbor and second nearest-neighbor distance ratio to obtain the matches.Finally, each algorithm employs the random sample consensus (RANSAC) strategy to eliminate outliers.The performance of the algorithms is objectively evaluated according to the number of matching points, matching accuracy, and matching spatial distribution indexes.

Categories Algorithms
The datasets used to train each deep-learning algorithm are as follows: SuperPoint using MSCOCO [91]; D2Net using MegaDepth [92]; AffNet using UBC Phototour [68]; both HardNet and SOSNet using HPatches [69]; ContextDesc using GL3D [93]; DetNet using DTU-Robots [94].According to their corresponding literatures [68,69,[91][92][93][94], the characters of each dataset would be discussed and summarized as follows.MSCOCO was proposed with the goal of advancing the state-of-the-art in scene understanding and object detection.In contrast to the popular datasets, MSCOCO involves fewer common object categories but more instances per category, which would be useful for learning complex scenes.MegaDepth was created to exploit multi-view internet images and produce a large amount of training data by the SFM method.It performs well for challenging environments such as offices and close-ups, but MegaDepth is biased towards outdoor scenes.UBC Phototour initially proposed patch verification as an evaluation protocol.There is large number of patches available in this dataset, which is particularity suited for deep-learning-based detectors.The images in this dataset have notable variations in illumination and view changes, but most of these images only focus on three scenes: Liberty, Notre-Dame, and Yosemite.HPatches presented a new large-scale dataset special for training local descriptors, aiming to eliminate the ambiguities and inconsistencies in scene understanding.It has the superiorities of diverse scenes and notable viewpoint changes.GL3D designed a large-scale database for 3D surface reconstruction and geometry-related learning issues.This dataset covered many different scenes, including rural area, urban, and scenic spots taken from multiple scales and viewpoints.The DTU-Robots dataset involves real images of 3D scenes, shot using a robotic arm in rigorous laboratory conditions, which is suitable for certain application but of limited size and variety in the data.
The representative wide-baseline test data are presented in Figure 9, and the corresponding data descriptions are listed in Table 2. Algorithms ①, ②, and ⑩ can directly output the corresponding features, and for the descriptors output by algorithms ③-⑨, we adopt the nearest-neighbor and second nearest-neighbor distance ratio to obtain the matches.Finally, each algorithm employs the random sample consensus (RANSAC) strategy to eliminate outliers.The performance of the algorithms is objectively evaluated according to the number of matching points, matching accuracy, and matching spatial distribution indexes.
x Hx x Fx Fx Fx (5) j and j


x are an arbitrary pair of matching point n true perspective transformation matrix and true matching errors of test data (a)-(f),which consist of s, are evaluated by εH (pixel), and the matching ert of non-planar scenes, are evaluated by εF (pixel).results of each algorithm.Because of the limited tching results of algorithms ①, ③, ④, ⑤, and ⑩ d (m), where the matching points are indicated by nd the most matches in each row of the figure are orithm, Figure 12 shows the matching distribution following equation [96]: of Delaunay triangles generated by the matching h triangle, max( ) i J represents the radian value of represents the average area of the triangle.The y and uniformity of the spatial distribution of the alue indicates that the matches have a higher spa-ASIFT [18] https://github.com/search?q=ASIFT The datasets used to train each deep-learning algorithm are as follows: SuperPoint using MSCOCO [91]; D2Net using MegaDepth [92]; AffNet using UBC Phototour [68]; both HardNet and SOSNet using HPatches [69]; ContextDesc using GL3D [93]; DetNet using DTU-Robots [94].According to their corresponding literatures [68,69,[91][92][93][94], the characters of each dataset would be discussed and summarized as follows.MSCOCO was proposed with the goal of advancing the state-of-the-art in scene understanding and object detection.In contrast to the popular datasets, MSCOCO involves fewer common object categories but more instances per category, which would be useful for learning complex scenes.MegaDepth was created to exploit multi-view internet images and produce a large amount of training data by the SFM method.It performs well for challenging environments such as offices and close-ups, but MegaDepth is biased towards outdoor scenes.UBC Phototour initially proposed patch verification as an evaluation protocol.There is large number of patches available in this dataset, which is particularity suited for deep-learningbased detectors.The images in this dataset have notable variations in illumination and view changes, but most of these images only focus on three scenes: Liberty, Notre-Dame, and Yosemite.HPatches presented a new large-scale dataset special for training local descriptors, aiming to eliminate the ambiguities and inconsistencies in scene understanding.It has the superiorities of diverse scenes and notable viewpoint changes.GL3D designed a large-scale database for 3D surface reconstruction and geometry-related learning issues.This dataset covered many different scenes, including rural area, urban, and scenic spots taken from multiple scales and viewpoints.The DTU-Robots dataset involves real images of 3D scenes, shot using a robotic arm in rigorous laboratory conditions, which is suitable for certain application but of limited size and variety in the data.
The representative wide-baseline test data are presented in Figure 9, and the corresponding data descriptions are listed in

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, results are as follows: Table 3 presents the number of image matches obtained by e algorithm, and the bold number is the maximum number of matches in each group of data; Figure 10 shows the matching errors of each algorithm, where the matching er  is estimated by the following equations [95]: where N is the number of matches, j x and j  x are an arbitrary pair of matching po coordinates, and H and F are the known true perspective transformation matrix and t fundamental matrix, respectively.The matching errors of test data (a)-(f),which consis planar or approximately planar scenes, are evaluated by εH (pixel), and the matching rors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pix Figure 11 shows the image-matching results of each algorithm.Because of the limi space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated red dots and joined by yellow lines, and the most matches in each row of the figure marked by a green frame.For each algorithm, Figure 12 shows the matching distribut quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the match points, i A denotes the area of the i-th triangle, max( ) , we adopt the nearest-neighbor and second nearest-neighbor distance ratio to obtain the matches.Finally, each algorithm employs the random sample consensus (RANSAC) strategy to eliminate outliers.The performance of the algorithms is objectively evaluated according to the number of matching points, matching accuracy, and matching spatial distribution indexes.

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error ε is estimated by the following equations [95]: where N is the number of matches, x j and x j are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by ε H (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by ε F (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j x and j  x are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j x and j  x are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( ) i J represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spa-based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, A i denotes the area of the i-th triangle, max(J i ) represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spatial distribution quality.Table 1.Representative algorithms and their references.

Algorithms
The datasets used to train each deep-learning algorithm are as follows: SuperPoint using MSCOCO [91]; D2Net using MegaDepth [92]; AffNet using UBC Phototour [68]; both HardNet and SOSNet using HPatches [69]; ContextDesc using GL3D [93]; DetNet using DTU-Robots [94].According to their corresponding literatures [68,69,[91][92][93][94], the characters of each dataset would be discussed and summarized as follows.MSCOCO was proposed with the goal of advancing the state-of-the-art in scene understanding and object detection.In contrast to the popular datasets, MSCOCO involves fewer common object categories but more instances per category, which would be useful for learning complex scenes.MegaDepth was created to exploit multi-view internet images and produce a large amount of training data by the SFM method.It performs well for challenging environments such as offices and close-ups, but MegaDepth is biased towards outdoor scenes.UBC Phototour initially proposed patch verification as an evaluation protocol.There is large number of patches available in this dataset, which is particularity suited for deep-learning-based detectors.The images in this dataset have notable variations in illumination and view changes, but most of these images only focus on three scenes: Liberty, Notre-Dame, and Yosemite.HPatches presented a new large-scale dataset special for training local descriptors, aiming to eliminate the ambiguities and inconsistencies in scene understanding.It has the superiorities of diverse scenes and notable viewpoint changes.GL3D designed a large-scale database for 3D surface reconstruction and geometry-related learning issues.This dataset covered many different scenes, including rural area, urban, and scenic spots taken from multiple scales and viewpoints.The DTU-Robots dataset involves real images of 3D scenes, shot using a robotic arm in rigorous laboratory conditions, which is suitable for certain application but of limited size and variety in the data.
The representative wide-baseline test data are presented in Figure 9, and the corresponding data descriptions are listed in Table 2. Algorithms ①, ②, and ⑩ can directly )      the 13 sets of wide-baseline stereo images and 10 representative algorithms, the re as follows: Table 3 presents the number of image matches obtained by each , and the bold number is the maximum number of matches in each group of 10 shows the matching errors of each algorithm, where the matching error ated by the following equations [95]: x Fx Fx Fx (5) is the number of matches, j For each algorithm, Figure 12 shows the matching distribution is, which is estimated by the following equation [96]: the 13 sets of wide-baseline stereo images and 10 representative algorithms, the e as follows: Table 3 presents the number of image matches obtained by each , and the bold number is the maximum number of matches in each group of test ure 10 shows the matching errors of each algorithm, where the matching error ated by the following equations [95]: x Fx Fx Fx (5) is the number of matches, j shows the image-matching results of each algorithm.Because of the limited is figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by and joined by yellow lines, and the most matches in each row of the figure are y a green frame.For each algorithm, Figure 12 shows the matching distribution is, which is estimated by the following equation [96]: he 13 sets of wide-baseline stereo images and 10 representative algorithms, the e as follows: Table 3 presents the number of image matches obtained by each , and the bold number is the maximum number of matches in each group of test re 10 shows the matching errors of each algorithm, where the matching error ated by the following equations [95]: x Fx Fx Fx (5) is the number of matches, j shows the image-matching results of each algorithm.Because of the limited s figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by and joined by yellow lines, and the most matches in each row of the figure are y a green frame.For each algorithm, Figure 12 shows the matching distribution is, which is estimated by the following equation [96]: he 13 sets of wide-baseline stereo images and 10 representative algorithms, the e as follows: Table 3 presents the number of image matches obtained by each , and the bold number is the maximum number of matches in each group of test re 10 shows the matching errors of each algorithm, where the matching error ated by the following equations [95]: x Fx Fx Fx (5) is the number of matches, j shows the image-matching results of each algorithm.Because of the limited s figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by nd joined by yellow lines, and the most matches in each row of the figure are y a green frame.For each algorithm, Figure 12 shows the matching distribution is, which is estimated by the following equation [96]: , and e 13 sets of wide-baseline stereo images and 10 representative algorithms, the as follows: Table 3 presents the number of image matches obtained by each and the bold number is the maximum number of matches in each group of test e 10 shows the matching errors of each algorithm, where the matching error ted by the following equations [95]: x Fx Fx Fx (5) the number of matches, j For each algorithm, Figure 12 shows the matching distribution , which is estimated by the following equation [96]: on test data (a), (f), (g), (i), (j), and (m).

Analysis and Discussion
First, we discuss the test results of compared methods as a whole.The results in Table 3 and Figures 11 and 12 show that no single algorithm always obtains the best performance on stereo images with different platform types, different viewpoint changes, and various texture structures.As a typical representative of handcrafted algorithms, ASIFT can achieve affine invariant stereo matching through a multi-level sampling strategy in 3D space; however, compared with the deep-learning algorithms, the test results of ASIFT show that its advantageis in the number of matches in close-range images with planar scene or satellite images.In contrast, the deep-learning algorithms DetNet + Contexdesc, AffNet + SOSNet, and SuperPointcan perform better on close-range stereo images with large rotations and scale changes, low-altitude stereo images with approximately planar scenes, and highaltitude stereo images with complex 3D scenes.This is because handcrafted algorithms tend to adopt the global spatial geometry rectification or a single segmentation model, which is more suitable for simple stereo images with planar scenes; whereas deep-learning algorithms build deep convolutional layers or fully connected neural network models from the perspective of emulating human visual cognition, and they iteratively learn the optimal network model parameters based on a large number of training samples, which can theoretically approximate any complex geometric or radiometric transform model, and hence this type of algorithm is more suitable for matching wide-baseline images with complex scenes.For test data (a), (b), (h), and (j), ASIFT yields better matching distribution quality; the algorithms DetNet + Contexdesc and AffNet + respectively perform well on data (c) (d) with respect to matching distribution, whereas SuperPoint performs well on data (g) and (m) with respect to matching distribution.All compared algorithms consistently achieve poor matching distribution quality for data (c), (j), (k), and (l).This is mainly because the traditional problems of digital photogrammetry, such as large-scale deformation of images, lack of texture, terrain occlusion, and surface discontinuity, are still difficult for the available algorithms to handle.On this topic, we suggest that handcrafted algorithms may expand the search range of geometric transform parameters to enhance adaptability to large-scale deformation data, whereas deep-learning algorithms may also improve the matching compatibility of complex terrain by increasing the number of samples in such areas.
Second, we discuss the CNN architectures combining the used training datasets.Deep-learning wide-baseline image matching is mainly limited by the structure of the neural network model and the size of the training dataset.Table 3 and Figure 11 show that the SuperPoint algorithm can obtain the most matches from the complex 3D scene (data (g)-(j) and (m)) for UAV oblique stereo images (data (d)-(i), (j), and (k)) or satellite wide-baseline images (data (l) and (m)), but it almost fails on simple ground scenes (data (d)-(f)).Although the MSCOCO training dataset used by SuperPoint contains large-scale independent structural objects, it lacks ground scene annotation instances with a single texture, and hence this training dataset limits the matching performance of SuperPoint on the ground scenes.The AffNet+SOSNet algorithm can obtain a sufficient number of matches from wide-baseline images (data (d)-(f)) with ground scenes and poor texture, where the spatial distribution of the matches is relatively uniform, as presented in Figures 11 and 12.The reason is that the UBC Phototour and HPatches datasets cover a large number of homogeneous structures such as ground, wall, and sculpture structures, which enables the algorithm to enhance its perception of some scenes with a single texture.A comparison of the matching results of algorithms 3 and 4 shows that, even with the same training dataset, the feature description performance of SOSNet is substantially better than that of HardNet.Reviewing the structures of the two networks shows that on the basis of the HardNet, SOSNet embeds a second-order similarity regularization term in the loss function to avoid over-fitting problems in the model training and further improve the similarity and discrimination of the descriptors.The ContextDesc algorithm integrates visual and geometric context encoding structures into the network model to improve the use of image context information.The test results show that it is particularly suitable for image matching in scenes with cluttered background (data (c)) or large radiometric distortion (data (e)).
Third, we further discuss the strengths and weaknesses of integrating methods for the difficult test data.

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j x and j  x are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j x and j  x are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( ) i J represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spatial distribution quality.
, and improve the use of image context information.The test results show that it is particularly suitable for image matching in scenes with cluttered background (data (c)) or large radiometric distortion (data (e)).Third, we further discuss the strengths and weaknesses of integrating methods for the difficult test data.Although algorithms ③, ④, and ⑦ all adopt AffNet to extract affine invariant features, the test results of algorithms ③ and ④ are substantially better.We speculate the reason is that the deep-learning descriptors of algorithms ③ and ④ perform better than the handcrafted descriptor SIFT of algorithm 7. Figure 10 shows that the matching of both the deep learning and handcrafted algorithms is not able to achieve sub-pixel accuracy.The main reason is that the two stages of feature detection and feature matching are relatively independent, which makes it difficult for the corresponding points to be accurately aligned.The complete UAV dataset, which is larger in size and resolution (data (j) and (k)) was also used for testing.It should theoretically be beneficial for each algorithm to obtain more matches; however, Table 3 shows that the number of matches did not increase substantially as a result.We believe that a high resolution will exacerbate the lack of local texture in the image, and larger images tend to introduce more occluded regions.Specifically, data (j) contain more occluded scenes and homogenous textures, whereas data (k) involve a large area of water and scenes with viewpoint changes.Additionally, the ratio of the corresponding regions in the larger images is lower.Thus, it would become more difficult to obtain the corresponding features in the absence of prior knowledge or initial matches.For satellite wide-baseline images with various mountainous and urban areas, both the deep-learning approach SuperPoint and the handcrafted ASIFT method can obtain a significant number of matches.

Summary and Outlook
For wide-baseline image-matching problems, this paper systematically organized, analyzed, and summarized the existing deep-learning image invariant feature detection, description, and end-to-end matching models.In addition, the matching performances of the representative algorithms were evaluated and compared through comprehensive experiments on wide-baseline images.According to the above test results and discussion, future research and challenges can be summarized as follows.
(1) The current deep-learning invariant feature detection approach continues to reveal its potential, and the research on invariant features and their applications has increasingly developed, from the scale invariant feature learning of Reference [52] to the affine invariant feature training of Reference [53].Experiments have shown that learning-based methods have better potential than handcrafted detection algorithms such as DoG [14] and pixel watershed [17].In addition, the strategy of combining handcrafted methods with learning-based methods [55] to extract invariant features has become a good option, but this type of method obviously depends on the accurate extraction of the image features by the handcrafted algorithms.In short, although the feature detec-all adopt AffNet to extract affine invariant features, the test results of algorithms

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j x and j  x are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( )

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j x and j  x are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( ) i J represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spatial distribution quality.
are substantially better.We speculate the reason is that the deep-learning descriptors of algorithms

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, the results are as follows: Table 3 presents the number of image matches obtained by each algorithm, and the bold number is the maximum number of matches in each group of test data; Figure 10 shows the matching errors of each algorithm, where the matching error  is estimated by the following equations [95]: x Fx Fx Fx (5) where N is the number of matches, j x and j  x are an arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).
Figure 11 shows the image-matching results of each algorithm.Because of the limited space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by red dots and joined by yellow lines, and the most matches in each row of the figure are marked by a green frame.For each algorithm, Figure 12 shows the matching distribution quality Dis, which is estimated by the following equation [96]: )-1 ( 1) ( -1) ( 1) , , 3 m ax ( ) where n represents the total number of Delaunay triangles generated by the matching points, i A denotes the area of the i-th triangle, max( ) i J represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spatial distribution quality.

Experimental Results
For the 13 sets of wide-baseline stereo images and 10 representative algorithms, results are as follows: Table 3 presents the number of image matches obtained by e algorithm, and the bold number is the maximum number of matches in each group of data; Figure 10 shows the matching errors of each algorithm, where the matching er  is estimated by the following equations [95]: where N is the number of matches, j x and j  x are an arbitrary pair of matching po coordinates, and H and F are the known true perspective transformation matrix and t fundamental matrix, respectively.The matching errors of test data (a)-(f),which consis planar or approximately planar scenes, are evaluated by εH (pixel), and the matching rors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pix Figure 11 shows the image-matching results of each algorithm.Because of the limi space, this figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and based on test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated red dots and joined by yellow lines, and the most matches in each row of the figure marked by a green frame.For each algorithm, Figure 12 shows the matching distribut quality Dis, which is estimated by the following equation [96]: )-1 ( 1) ( -1) ( 1) , , 3 m ax ( ) where n represents the total number of Delaunay triangles generated by the match points, i A denotes the area of the i-th triangle, max( ) i J represents the radian valu the maximum internal angle, and A represents the average area of the triangle.value of Dis can reveal the consistency and uniformity of the spatial distribution of triangle network, and a smaller Dis value indicates that the matches have a higher s tial distribution quality.
perform better than the handcrafted descriptor SIFT of algorithm 7. Figure 10 shows that the matching of both the deep learning and handcrafted algorithms is not able to achieve sub-pixel accuracy.The main reason is that the two stages of feature detection and feature matching are relatively independent, which makes it difficult for the corresponding points to be accurately aligned.The complete UAV dataset, which is larger in size and resolution (data (j) and (k)) was also used for testing.It should theoretically be beneficial for each algorithm to obtain more matches; however, Table 3 shows that the number of matches did not increase substantially as a result.We believe that a high resolution will exacerbate the lack of local texture in the image, and larger images tend to introduce more occluded regions.Specifically, data (j) contain more occluded scenes and homogenous textures, whereas data (k) involve a large area of water and scenes with viewpoint changes.Additionally, the ratio of the corresponding regions in the larger images is lower.Thus, it would become more difficult to obtain the corresponding features in the absence of prior knowledge or initial matches.For satellite wide-baseline images with various mountainous and urban areas, both the deep-learning approach SuperPoint and the handcrafted ASIFT method can obtain a significant number of matches.

Summary and Outlook
For wide-baseline image-matching problems, this paper systematically organized, analyzed, and summarized the existing deep-learning image invariant feature detection, description, and end-to-end matching models.In addition, the matching performances of the representative algorithms were evaluated and compared through comprehensive experiments on wide-baseline images.According to the above test results and discussion, future research and challenges can be summarized as follows.
(1) The current deep-learning invariant feature detection approach continues to reveal its potential, and the research on invariant features and their applications has increasingly developed, from the scale invariant feature learning of Reference [52] to the affine invariant feature training of Reference [53].Experiments have shown that learning-based methods have better potential than handcrafted detection algorithms such as DoG [14] and pixel watershed [17].In addition, the strategy of combining handcrafted methods with learning-based methods [55] to extract invariant features has become a good option, but this type of method obviously depends on the accurate extraction of the image features by the handcrafted algorithms.In short, although the feature detection methods based on deep learning tend to show abilities beyond the traditional methods, this approach is not yet fully mature, especially for the matching problem of wide-baseline images with complex scenes and various textures, and it still faces great challenges.Therefore, extracting invariant features with high repeatability and stability needs further study.
(2) Deep-learning feature description is essentially metric learning; this kind of method is mainly focused on network model construction and loss function design.From the MatchNet Siamese network [20] to the SOSNet triplet structure [72], the model parameters are gradually simplified, and the performance is correspondingly improved.However, most network backbones still inherit the structure of the classic L2-Net [67].Especially for the affine invariant feature description network structure, we suggest introducing a viewpoint transform module, which could enhance the transparency, perception, and generalization capabilities of existing models for wide-baseline images.Moreover, the loss function design is mainly used to select reasonable training samples.Although the existing functions focus on traditional problems such as the selection of positive and negative samples, they do not consider the inherent characteristics of wide-baseline images.Therefore, to improve the performance of the descriptors, future work could involve the construction of novel wide-baseline network structures or the design of universal loss functions.
(3) Recently, end-to-end learning-based methods such as back propagation-trained LIFT [77] and feature description-and-detection D2Net [86] have received increasing attention.This type of method has led to numerous innovations in terms of training mode and the automatic acquisition of training data.The research shows that the end-to-end method has a faster computation speed than other learning-based methods and can meet the performance requirements for simultaneous localization and mapping (SLAM) [68], SFM [80], and other real-time vision tasks.However, for wide-baseline image-matching tasks, it is difficult for this type of method to extract sufficient feature points.Therefore, in the field of wide-baseline image matching, we should further explore the end-to-end learning of unconventional and complex distortions as well as the image features of various textures and structures.
(4) Image matching based on deep learning is a data-driven image-matching method that must automatically learn the deep expression mechanism of ground surface features from a large amount of image data.Therefore, the key for deep-learning image matching is to build a diverse and massive training dataset.At present, the main training datasets for deep-learning wide-baseline image matching are UBC Phototour [68] and HPatches [69].The UBC Phototour dataset contains a large number of artificial statues, whereas the HPatches dataset mainly consists of simple facade configurations.These available training datasets are very different from the data captured by aerial photography or satellite remote sensing, which stop the affine invariant network models, such as AffNet [57], HardNet [70], and SOSNet [72], from achieving the optimal matches in wide-baseline remote sensing images.Therefore, it is an urgent task to establish a large wide-baseline dataset of multi-source, multi-platform, and multi-spectrum data through crowd sourcing or sharing mechanisms.
(5) The existing studies have shown that the comprehensive performance of a model can be substantially improved through transfer learning, which has been widely applied in the fields of target recognition, image classification, and change detection.However, there are few published reports on transfer learning in the field of deep-learning wide-baseline image matching, specifically for feature detection, feature description, and end-to-end methods.Therefore, on the basis of establishing a wide-baseline image dataset, further work should focus on training a network for wide-baseline matching using a transfer learning strategy to achieve high-quality matching for wide-baseline images.In addition, for the original observation of matching points, the positioning errors must be fully considered in the field of digital photogrammetry.However, the corresponding points across widebaseline images cannot be registered precisely by learning-based methods.Hence, the matching accuracy could be improved by some optimization strategies, such as leastsquares image matching or the Newton iteration method, which remains as future work.

Conclusions
In this paper, based on a review of the image-matching stages, we organized and summarized the development status and trends of existing learning-based methods.Moreover, the matching performance, advantages, and disadvantages of typical algorithms were evaluated through comprehensive experiments on the representative wide-baseline images.
The results reveal that there is no algorithm that can adapt to all types of wide-baseline images with various viewpoint changes and texture structures.Therefore, the currently urgent task is to enhance the generalization ability of the models by combining a mixed model with more extensive training datasets.Moreover, it was suggested that a critical task is to construct deep network models with elastic receptive field and self-adaptive loss functions based on wide-baseline imaging properties and typical problems in image matching.It is our hope that the review work of this paper will act as a reference for future research.

Figure 1 .
Figure 1.Focus of this review: Topics of deep-learning methods for wide-baseline stereo image matching.

Figure 2 .
Figure 2. Development of feature detection with deep learning.

Figure 1 .
Figure 1.Focus of this review: Topics of deep-learning methods for wide-baseline stereo image matching.

Figure 1 .
Figure 1.Focus of this review: Topics of deep-learning methods for wide-baseline stereo image matching.

Figure 2 .
Figure 2. Development of feature detection with deep learning.

Figure 2 .
Figure 2. Development of feature detection with deep learning.

Figure 3 .
Figure 3. Development of deep-learning feature description.

Figure 3 .
Figure 3. Development of deep-learning feature description.
Remote Sens. 2021, 13, x FOR PEER REVIEW 6 of 22 scription network based on a spatial-scale convolutional layer to improve the accuracy of satellite image matching.

Figure 4 .
Figure 4. Evolution of representativefeature description net

Figure 6 .
Figure 6.Development of end-to-end matching with learning-based methods.

Figure 7 .
Figure 7. Training and testing of the LIFT pipeline.

Figure 6 .
Figure 6.Development of end-to-end matching with learning-based methods.

Figure 6 .
Figure 6.Development of end-to-end matching with learning-based methods.

Figure 7 .
Figure 7. Training and testing of the LIFT pipeline.

Figure 7 .
Figure 7. Training and testing of the LIFT pipeline.

Figure 8 .
Figure 8. Difference between D2Net and the traditional model.

Figure 8 .
Figure 8. Difference between D2Net and the traditional model.

x and j  x are an
arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

x and j  x are an
arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

Figure 9 .
Figure 9. Wide-baseline test data.These data are carefully selected from different platforms of ground close-range, UAV, and satellite, respectively.They cover various terrains and have significant viewpoint changes.

Figure 9 .
Figure 9. (a-m) Wide-baseline test data.These data are carefully selected from different platforms of ground close-range, UAV, and satellite, respectively.They cover various terrains and have significant viewpoint changes.

iJ
represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spa-, Remote Sens. 2021, 13, x FOR PEER REVIEW 12 with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work j 5472 × 3468 5472 × 3468 UAV stereo images with significant view change, surface discontinuity, object occlusion, and rare texture F is estimated by manual work k 4200 × 3154 4200 × 3154 UAV stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work Sallite data l 2316 × 2043 2316 × 2043 Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work m 2872 × 2180 2872 × 2180 Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work

iJ
represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spa-, Remote Sens. 2021, 13, x FOR PEER REVIEW 12 with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work j 5472 × 3468 5472 × 3468 UAV stereo images with significant view change, surface discontinuity, object occlusion, and rare texture F is estimated by manual work k 4200 × 3154 4200 × 3154 UAV stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work Sallite data l 2316 × 2043 2316 × 2043 Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work m 2872 × 2180 2872 × 2180 Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work

x and j  x are an
arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

iJ
represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spa-, Remote Sens. 2021, 13, x FOR PEER REVIEW 12 with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work j 5472 × 3468 5472 × 3468 UAV stereo images with significant view change, surface discontinuity, object occlusion, and rare texture F is estimated by manual work k 4200 × 3154 4200 × 3154 UAV stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work Sallite data l 2316 × 2043 2316 × 2043 Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work m 2872 × 2180 2872 × 2180 Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work

x and j  x are an
arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

iJ
represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spa-with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work j 5472 × 3468 5472 × 3468 UAV stereo images with significant view change, surface discontinuity, object occlusion, and rare texture F is estimated by manual work k 4200 × 3154 4200 × 3154 UAV stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work Sallite data l 2316 × 2043 2316 × 2043 Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work m 2872 × 2180 2872 × 2180 Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work

x and j  x are an
arbitrary pair of matching point coordinates, and H and F are the known true perspective transformation matrix and true fundamental matrix, respectively.The matching errors of test data (a)-(f),which consist of planar or approximately planar scenes, are evaluated by εH (pixel), and the matching errors of test data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

( 5 )
bitrary pair of matching point ransformation matrix and true est data (a)-(f),which consist of H (pixel), and the matching eres, are evaluated by εF (pixel).rithm.Because of the limited orithms ①, ③, ④, ⑤, and ⑩ tching points are indicated by s in each row of the figure are ows the matching distribution 96]: the matching represents the radian value of rage area of the triangle.The the spatial distribution of the e matches have a higher spa-learning methods.Finally, all selected methods are effective and well-performed in previous reports, and the source codes are open to public.
ary pair of matching point sformation matrix and true ata (a)-(f),which consist of ixel), and the matching erare evaluated by εF (pixel).m.Because of the limited thms ①, ③, ④, ⑤, and ⑩ ing points are indicated by each row of the figure are s the matching distribution matching resents the radian value of e area of the triangle.The spatial distribution of the atches have a higher spa-point mation matrix and true (a)-(f),which consist of l), and the matching erevaluated by εF (pixel).Because of the limited s ①, ③, ④, ⑤, and ⑩ points are indicated by h row of the figure are e matching nts the radian value of ea of the triangle.The atial distribution of the hes have a higher spa-point ion matrix and true -(f),which consist of nd the matching erluated by εF (pixel).ause of the limited , ③, ④, ⑤, and ⑩ nts are indicated by w of the figure are atching distribution matching the radian value of of the triangle.The l distribution of the have a higher spa-

Figure 10 .
Figure 10.Comparison of the matching error results of the ten algorithms.

Figure 10 .
Figure 10.Comparison of the matching error results of the ten algorithms.

Figure 10 . 22 Figure 11 .
Figure 10.Comparison of the matching error results of the ten algorithms.

x
and j  x are an arbitrary pair of matching point tes, and H and F are the known true perspective transformation matrix and true ntal matrix, respectively.The matching errors of test data (a)-(f),which consist of r approximately planar scenes, are evaluated by εH (pixel), and the matching erst data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel). 1 shows the image-matching results of each algorithm.Because of the limited is figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ test data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by and joined by yellow lines, and the most matches in each row of the figure are by a green frame.

of 22 4 UAVF is estimated by manual work 54 UAV
stereo images with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work 68 UAV stereo images with significant view change, surface discontinuity, object occlusion, and rare texture stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work 43 Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work 80 Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work imental Results

x
and j  x are an arbitrary pair of matching point tes, and H and F are the known true perspective transformation matrix and true ntal matrix, respectively.The matching errors of test data (a)-(f),which consist of approximately planar scenes, are evaluated by εH (pixel), and the matching erst data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

of 22 4 UAVF is estimated by manual work 54 UAV
stereo images with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work 68 UAV stereo images with significant view change, surface discontinuity, object occlusion, and rare texture stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work 43 Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work 80 Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work imental Results

x
and j  x are an arbitrary pair of matching point es, and H and F are the known true perspective transformation matrix and true ntal matrix, respectively.The matching errors of test data (a)-(f),which consist of approximately planar scenes, are evaluated by εH (pixel), and the matching erst data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

of 22 4 UAV 8 UAVF is estimated by manual work 4 UAV
stereo images with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work stereo images with significant view change, surface discontinuity, object occlusion, and rare texture stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work 3 Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work 0 Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work imental Results

x
and j  x are an arbitrary pair of matching point es, and H and F are the known true perspective transformation matrix and true tal matrix, respectively.The matching errors of test data (a)-(f),which consist of approximately planar scenes, are evaluated by εH (pixel), and the matching ert data (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).

12 of 22 UAV
stereo images with significant oblique view change, radiometric distortion, and complex 3D scenes F is estimated by manual work UAV stereo images with significant view change, surface discontinuity, object occlusion, and rare texture F is estimated by manual work UAV stereo images with about 90 deg rotation, significant oblique view change, single texture, and large area of water F is estimated by manual work Satellite optical stereo image with notable rotation, significant topography variation, and rare texture F is estimated by manual work Satellite optical stereo images with significant surface discontinuity, radiometric distortion, dense 3D buildings, and single texture F is estimated by manual work ental Results

x
and j  x are an arbitrary pair of matching point s, and H and F are the known true perspective transformation matrix and true al matrix, respectively.The matching errors of test data (a)-(f),which consist of pproximately planar scenes, are evaluated by εH (pixel), and the matching erdata (g)-(m), which consist of non-planar scenes, are evaluated by εF (pixel).shows the image-matching results of each algorithm.Because of the limited figure only exhibits the matching results of algorithms ①, ③, ④, ⑤, and ⑩ st data (a), (f), (g), (i), (j), and (m), where the matching points are indicated by d joined by yellow lines, and the most matches in each row of the figure are a green frame.

Figure 12 .
Figure 12.Comparison of the matching distribution quality of the ten algorithms.

Figure 12 .
Figure 12.Comparison of the matching distribution quality of the ten algorithms.

Table 1 .
Finally, all selected methods are effective and well-performed in previous reports, and the source codes are open to public.Representative algorithms and their references.
number of Delaunay triangles generated by the matching f the i-th triangle, max( ) i J represents the radian value of , and A represents the average area of the triangle.The nsistency and uniformity of the spatial distribution of the er Dis value indicates that the matches have a higher spa-

Table 1 .
Finally, all selected methods are effective and well-performed in previous reports, and the source codes are open to public.Representative algorithms and their references.

Table 1 .
://github.com/doomie/HessianFreeHandcrafted matching Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 22 of deep-learning methods.Finally, all selected methods are effective and well-performed in previous reports, and the source codes are open to public.Representative algorithms and their references.

Table 1 .
Representative algorithms and [18]represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spatial distribution quality., Remote Sens. 2021, 13, x FOR PEER REVIEW of deep-learning methods.Finally, all sel in previous reports, and the source codes Handcrafted feature detection and deep learning feature description ⑧Hessian [16] + HardNet [70] Handcrafted matching ⑨MSER [17] + SIFT [14]⑩ASIFT[18]The datasets used to train each dee

Table 1 .
Represen i J represents the radian valu the maximum internal angle, and A represents the average area of the triangle.value of Dis can reveal the consistency and uniformity of the spatial distribution of triangle network, and a smaller Dis value indicates that the matches have a higher s -

Table 2 .
Description of the wide-baseline test data.

Table 2 .
Description of the wide-baseline test data.

Table 3 .
The contrast of ten algorithms in the aspect of number of matches.Bold font denotes the best results.
contrast of ten algorithms in the aspect of number of matches.Bold font denotes the best results. e represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spatial distribution quality.
i J represents the radian value of the maximum internal angle, and A represents the average area of the triangle.The value of Dis can reveal the consistency and uniformity of the spatial distribution of the triangle network, and a smaller Dis value indicates that the matches have a higher spatial distribution quality.
i J