Efﬁcient Discrimination and Localization of Multimodal Remote Sensing Images Using CNN-Based Prediction of Localization Uncertainty

: Detecting similarities between image patches and measuring their mutual displacement are important parts in the registration of multimodal remote sensing (RS) images. Deep learning approaches advance the discriminative power of learned similarity measures (SM). However, their ability to ﬁnd the best spatial alignment of the compared patches is often ignored. We propose to unify the patch discrimination and localization problems by assuming that the more accurately two patches can be aligned, the more similar they are. The uncertainty or conﬁdence in the localization of a patch pair serves as a similarity measure of these patches. We train a two-channel patch matching convolutional neural network (CNN), called DLSM, to solve a regression problem with uncertainty. This CNN inputs two multimodal patches, and outputs a prediction of the translation vector between the input patches as well as the uncertainty of this prediction in the form of an error covariance matrix of the translation vector. The proposed patch matching CNN predicts a normal two-dimensional distribution of the translation vector rather than a simple value of it. The determinant of the covariance matrix is used as a measure of uncertainty in the matching of patches and also as a measure of similarity between patches. For training, we used the Siamese architecture with three towers. During training, the input of two towers is the same pair of multimodal patches but shifted by a random translation; the last tower is fed by a pair of dissimilar patches. Experiments performed on a large base of real RS images show that the proposed DLSM has both a higher discriminative power and a more precise localization compared to existing hand-crafted SMs and SMs trained with conventional losses. Unlike existing SMs, DLSM correctly predicts translation error distribution ellipse for different modalities, noise level, isotropic, and anisotropic structures.


Introduction
A popular strategy for solving the image registration problem involves two steps: finding a set of putative correspondences (PC) between patches of registered images and estimating geometrical transform parameters between these images on basis of the found PCs [1,2]. The basic element of this strategy is detecting similarity between two image patches and finding the exact spatial alignment between them. Therefore, two features of an SM are of equal importance in RS: the ability to distinguish similar and dissimilar pairs of patches and the ability to accurately estimate the position alignment of the compared patches. In RS, subpixel localization accuracy is a prerequisite for reaching high The mixed loss transforms the patch matching problem from initially a pure binary classification problem to a classification-regression problem.
Experiments on a large set of multimodal images show that DLSM simultaneously improves the quality of discrimination and accuracy of patch localization. An important feature, which is lacking in existing SMs, is that the localization accuracy is predicted for each pair of patches, including isotropic and anisotropic textures, patches with low and high SNR, and patches with different modalities in the form of an error covariance matrix. This value can be used advantageously to set a proper PC weighting during multimodal image registration [1,2,26].
This paper is organized as follows. Section 2 reviews existing learned SMs. Section 3 introduces requirements and performance criteria for assessing SMs in remote sensing. Then, the CNN structure and loss function for learning an SM with high discriminative power and high localization accuracy are described in detail. Section 4 compares the proposed DLSM SM with existing hand-crafted SMs and learned SMs with conventional loss functions. Finally, conclusions and remarks on future work are given in Section 5.

Overview of Existing Patch Matching CNN Structure and Loss Functions
The distinction between the two general types of CNN for patch matching, i.e., CNNs with and without metric learning [27] is similar to the distinction between area-based and feature-based SMs. Similarly to area-based methods, CNNs with metric learning (Figure 1a) compare a pair of patches jointly, whereas CNNs without metric learning or descriptor CNNs ( Figure 1b) calculate a feature vector for each patch in the pair separately as in the case of feature-based SMs.
CNNs with metric learning are typically implemented as a two-stream CNN, whereas descriptor CNNs are trained using a Siamese architecture [18]. Mixed architectures are a compromise between the two-stream and Siamese architectures, where feature extraction is done by a Siamese CNN first, and evaluation of the metric by a two-stream CNN (Figure 1c) [4,28,29].
(a) Two-stream structure (b) Siamese structure (c) Mixed structure CNNs with metric learning are considered to have better accuracy [27] at the expense of a higher computational complexity. In descriptor CNNs, the most computationally complex part of the calculation of the feature vector, thereafter used to compare patches, is applied to individual patches, whereas the comparison is made with all patch pairs using a metric learning CNN.
Simo-Serra et al. utilized a Siamese network architecture for training a descriptor CNN that extracts 128-D descriptors whose euclidean distances reflect patch similarity [30]. As a step forward, Han et al. proposed a CNN denoted MatchNet representing a feature vector extraction CNN followed by a feature vector matching CNN or metric network [28]. The metric network comprising three fully connected (FC) layers aims to substitute simpler metrics for the comparison of feature vectors such as SSD or NCC. The MatchNet mixes Siamese and two-stream architectures for feature extraction and metric learning stages. The TS-Net CNN proposed in [4] combines Siamese and Pseudo-Siamese towers for feature extraction and metric learning layers on top of each tower.
A typical choice for training a descriptor CNN is a combination of the Siamese architecture and contrastive loss [4,19,31,32] also called hinge embedding loss [17,20,30,33,34]. The contrastive loss assumes that each sample (patch pair) provides either a positive or negative correspondence example. During training, the distance between the descriptors should decrease for positive examples and increase for negative ones. The main drawback of this loss is that it contains a thresholding operation resulting in an unstable gradient problem inherent for the hinge loss. In addition, in the training process, the majority of negative examples do not contribute to updating of CNN weight gradients because the distance between them exceeds the threshold. Therefore, contrastive loss should be used with a hard negative mining [30].
The contrastive loss considers only one pair of positive and one pair of negative patches each time. The correspondence contrastive loss [35] uses simultaneously many positive and negative correspondences between two images. A similar idea has been put behind the N-tuple loss recently proposed in [36]. The N-tuple loss takes into account the N-combination of descriptors for a single scene in a multiple to multiple way.
Y. Tian et al. proposed the descriptor CNN called L2-Net [27] and used a new error term as a similarity measure between descriptors. It works at the batch level and requires positive pairs in the batch to be closer to their matching counterparts in the Euclidean space than negative pairs. This loss does not contain thresholding operations and thus avoids unstable gradient problem inherent for the hinge loss. E. Hoffer and N. Ailon proposed to use Siamese triplet network with three branches: one for each of the anchor, similar, and dissimilar patches [37]. Combination of the anchor with similar and dissimilar patches form positive and negative examples, respectively. The triplet network is trained with the ratio loss function representing the mean square error (MSE) on the soft-max result, with respect to the vector (0, 1) of the class labels (0 for negative examples and 1 for positive examples). Balntas et al. [34] extended the SoftMax ratio loss using an additional negative example between similar and different patches. Extending this idea even further, Aguilera et al. proposed in [6] a quadruplet CNN for matching patches from visible (VIS) and Near-Infrared (NIR) spectra. During training, the quadruplet CNN takes two matching VIR-NIR pairs. The loss function takes into account two positive examples (VIR-NIR) and four negative ones (two VIR-NIR, VIR-VIR, and NIR-NIR). In extensions of the triplet loss ratio, the loss function remains the same, but the score for the positive example is replaced by the maximum value of the positive scores, and the score for the negative example is replaced by the minimum value of all negative scores. Khoury et al. utilized Siamese triplet network and margin ranking loss to learn features for matching 3D point clouds [38]. The same loss was utilized in [39] for the matching problem of omnidirectional images. The margin ranking loss can be viewed as an adaptation of the hinge loss for tripled network structure [40].
The loss functions mentioned above operate on distances between descriptors of positive and negative samples. Here, a positive sample corresponds to a pair of patches that correspond exactly to each other and a negative example is a pair of patches that have no spatial link (representing different scenes or different locations that are very distant from each other). Therefore, training a patch matching CNN is viewed as training a binary classifier with one class representing similar pairs and another one representing dissimilar pairs. Metric learning CNNs directly provide a scalar similarity/distance between two patches and omit learning the patch descriptors. In the latter case, their analogy with a binary classifier is even more obvious. The metric learning CNNs proposed in the literature use classical losses adapted to binary classifiers for training: the hinge [5,18,29,41], square [42], and binary cross-entropy [4,28,29] losses.
Other approaches can be considered to complement the previous set of losses to further boost the patch matching performance. For example, the authors of [18] proposed to make use of the central-surround input field with two different spatial scales. Kumar et al. in [43] mixed the triplet loss with a global loss. The latter minimizes the overall classification error in the training set by pushing distributions of positive and negative examples away from each other. For an overview of loss functions useful for the patch matching problem and other additional references, we refer an interested reader to [43].
Both metric learning and descriptor CNNs could be a part of a more complex "correspondence network" [35] that furthermore detects image keypoints, learns regions of interest, and geometrical transforms (e.g., scaling factor and rotation) [17,32,35]. In this case, the patch matching CNN could be learned separately and freezed during training of the correspondence network (for example Altwaijry et al. in [23] used pretrained MatchNet CNN within a hybrid architecture for predicting likely region matches of ultra-wide baseline aerial images) or learned in end-to-end fashion [32]. Yang et al. demonstrated in [44] that general purpose classification VGG CNN [45] pretrained in Imagenet dataset [46] could provide robust descriptors for multi-temporal RS image registration. In correspondence networks, the loss for measuring patch similarity is generally combined with a detector loss to form a multi-term loss [23,32].
In [38], M. Khoury et al. considered the localization accuracy of features learned for unstructured point clouds. The features are obtained with a CNN trained with the triplet margin loss. To obtain a high matching accuracy, positive examples are generated from a neighborhood of the anchor with radius τ, and negative examples from a more challenging neighborhood with radius from τ to 2τ. However, under these settings, the triplet loss does not force the CNN to discriminate positive examples withing τ radius. This discrimination is necessary for precise localization. Another drawback is that the threshold τ is application-specific and no recommendations for its optimal setting have been provided.
Another approach for an improved patch localization accuracy was proposed in [24] for registering optical images to Synthetic Aperture Radar (SAR) images. In this work, each translation vector value is considered as a separate class, and patch matching is transformed from a binary to a multiclass classification problem. The corresponding CNN is trained with the cross-entropy loss and it predicts the probability of true matching between a 201 × 201 optical image and a 221 × 221 SAR image on a 21 × 21 pixel grid. The ground truth distribution of the translation vector is a Gaussian function with σ = 1 centered around the ground truth location.
The same approach was used in [47] for the image stereo matching problem. The shortcoming of this approach is that a Gaussian function with a fixed shape is used for CNN training irrespective of the registered images content. However, a fixed Gaussian shape cannot describe both isotropic and anisotropic textures, or image pairs with a different degree of similarity.

Discrimination and Localization Ability of Existing Patch Matching CNNs
Patch matching CNNs are trained to be at some extent invariant to spatial transformations-translation, rotation, perspective distortion, and viewpoint change-of sensed objects [48]. The binary nature of these CNNs leads to the following consequence; a pair of patches is recognized as similar with possibly a small spatial transformation between the two patches, and different when a large spatial transformation is involved. The value of the similarity between the patches or the distance between the corresponding descriptors seen as a function of a spatial parameter (here for example, the translation magnitude) changes slowly within the neighborhood of the zero value. At the limit value that exceeds the robustness of a particular CNN, the similarity drops sharply (Figure 2a,b). We will refer to this kind of SM profile as "step-like". On the other hand, for multicalss loss (Figure 2c), triplet losses, or MIND SM (Figure 2d), the SM profile changes gradually with the translation value. This SM shape will be later referred to as "smooth". Both shapes have its own pros and cons. The first one does not argue in favor of an ability to find the exact position of the spatial correspondence between the two patches. A limited localization accuracy of learned descriptors has been reported in the literature, for example, in [20]: "Surprisingly, LIFT produces the largest reprojection error and relatively short tracks for all datasets, indicating inferior keypoint localization performance as compared to the hand-crafted DoG method." However, the step-like SM profile simplifies finding the true PC position as SM values could be calculated on a coarser translation grid thus reducing computational complexity. The second profile favorizes localization accuracy, but complicates finding the true PC position (SM values should be calculated on a finer grid to detect PC position). The proposed patch matching CNN have advantages of both SM profiles and is free from their disadvantages: it has both step-like SM profile and good localization accuracy.

Training Convolutional Neural Network for Measuring Similarity between Multimodal Images with Enhanced Localization Accuracy
This section begins by introducing the criteria that can be considered to assess multimodal SMs. We will next reformulate the concept of similarity between image patches so as to make it encompass an ability to accurately align them. Such an SM is implemented as a two-channel CNN that predicts the translation vector between two patches as well as its prediction error covariance matrix. For training the CNN with the desired properties, we have selected a Siamese CNN architecture and the joint regression-classification loss function. Finally, we discuss different PC localization approaches on the basis of DLSM.

Requirements to Complexity of Geometrical Transform Between Patches in RS
In the RS field, images acquired with a sensing platform are initially georeferenced using the platform orbital parameters [49]. After initial registration, the geometric transform between a pair of RS images can be locally approximated by a pure translation [1,50]. Therefore, structural changes between different modalities are the main source of difference between the compared patches in RS. Taking this into account, we focus next on a pure translation geometric transform model. However, the proposed approach could certainly be extended to more complex transforms, e.g., rotation-scaling-translation transform.

SM Performance Criteria
An SM takes two image patches-one from a reference image (RI) and the other one from a template image (TI)-as input and outputs a scalar value SM v that measures the similarity between these patches. A binary decision is then made to decide whether these patches are similar or dissimilar by comparing the SM value with a threshold SM th : where labels "1" and "−1" correspond to similar and dissimilar patches, respectively. Here, we assume that a higher SM value corresponds to higher similarity. The opposite case, when a lower SM value corresponds to higher similarity, can be reduced to the first one by changing the SM value sign. Two well-known criteria, namely, the Receiver Operating Characteristic (ROC) and Area Under the Curve (AUC), are generally used to evaluate the quality of such a binary classifier [51].
In image registration, an SM is typically used to localize a PC between registered images [1,2]. A PC can be found as a local maximum of the SM value with respect to a mutual translation vector or a more complex geometrical transform between the compared image patches: where t PC = (x PC , y PC ) is the coordinates of the putative correspondence, and Ω is a search zone within which the true correspondence is expected to be found. For example, a correspondence between images can be found by calculating the NCC absolute value in a sliding scanning window manner and finding the local maximum of the obtained correlation map. SM values are calculated on a regular grid with unit pixel spacing by gradually shifting RP with respect to TP. Subpixel accuracy can be reached by approximating the output SM v in the maximum neighborhood (x PC , y PC ) typically by a quadratic function and finding the maximum of the approximation function: where the second term in (4) is the coordinate of the maximum of the approximating second-order polynomial. The localization accuracy is characterized by its bias ∆ t and its standard deviation (SD) σ t with respect to the components of the translation vector. We will also use the robust SD σ t.MAD calculated as 1.48 · MAD, where MAD is Median Absolute Deviation [52]. Notice, that a smaller σ t or σ t.MAD value corresponds to a better localization accuracy.

Patch Matching as Deep Regression with Uncertainty
For an accurate localization, an SM is required to discriminate between not only similar and dissimilar patch pairs, but also slightly shifted versions of similar patches. To better illustrate this idea, let us analyze the approximation approach of the SM value more in detail. For a PC with coordinates (x PC , y PC ), the SM value takes a maximum value at (x PC , y PC ) and decreases in its neighborhood of width ∆ max pixels: Therefore, in the local neighborhood, an SM value can be factorized into two terms as SM v (x PC , y PC )g(∆ x , ∆ y ), where g(∆ x , ∆ y ) ≤ 1, g(0, 0) = 1. The first term SM v (x PC , y PC ) is responsible for similarity discrimination and is insensitive to spatial misalignment within a local neighborhood of the true correspondence. In turn, the second term g(∆ x , ∆ y ) is sensitive only to spatial misalignment of the compared patches. Loss functions leading to step-like SM profile do not consider the g(∆ x , ∆ y ) term during training and focus on SM v (x PC , y PC ) only. Loss function leading to smooth SM profile estimate SM value without factorizing it into SM v (x PC , y PC ) from g(∆ x , ∆ y ).
Let us represent SM v · g(∆ x , ∆ y ) as a two-dimensional normal distribution 1 2π We propose to formulate similarity measuring as a regression problem with uncertainty. Patch matching CNN estimates the translation vector value t = (∆ x , ∆ y ) and the uncertainty of this estimate in the form of a translation vector estimation error covariance matrix C = where σ x and σ y are SDs of horizontal and vertical errors in estimating the components of the translation vector, respectively, and k xy denotes the correlation coefficient between these components. As an SM value, we propose to use SM v = SM det = |C| = σ x σ y 1 − k 2 xy . With this formulation, patch localization and patch discrimination become unified.
Our CNN is trained with the following loss function, where θ = (∆ x , ∆ y , σ x , σ y , k xy ) is parameter vector predicted by the DLSM, (∆ x0 , ∆ y0 ) denotes ground truth shift between RP and TP. The last three terms force the determinant |C| of the covariance matrix of the translation vector estimation error to decrease, thus reducing the uncertainty of estimates. The first term requires the translation estimation error to agree with the predicted covariance matrix. The loss function (5) corresponds to the maximization of the likelihood function for a two-dimensional normal distribution. It can be seen as a two-dimensional version of the loss function for learning regression with uncertainty utilized in [53,54].

Siamese ConvNet Structure and Training Process Settings
The base structure of the two-stream CNN for measuring multimodal patches similarity is shown in Figure 3a. It consists of three groups of convolutional and pooling layers followed by two fully-connected layers. We selected the input patch size of 32 by 32 pixels. It was noted in [27] that a larger patch size does not provide performance improvement. Both RP and TP are normalized to zero mean and unity variance. Feature vector size for the first block of convolutional layers is N f eatures = 48 and increases twofold for each following block (96 and 192, respectively). Image size after the first, second, and third pooling layers is 15 × 15 × 48, 6 × 6 × 96, and 2 × 2 × 192, respectively. Feature vector size after the flattening layer is therefore equal to 768. CNN outputs two elements of translation vector (∆ x , ∆ y ) and three elements describing covariation matrix estimate (σ x , σ y , and k xy ). To enforce the usual properties of covariation matrix elements, the following constraints should be satisfied; σ x > 0, σ y > 0 and |k xy | < 1. Consequently, ReLu activation is applied to σ x , σ y , and tanh to k xy .
The structure of two-stream CNN was optimized with respect to two parameters: N f eatures and the kernel size of the first convolutional layer for extracting features. We found that value N f eatures above 48 does not impact patch matching accuracy but increases inference time. We thus selected N f eatures = 48. Changing kernel size from (3,3) to (5,5) does not have an effect on the CNN performance. Therefore, it was set to its minimum value (3,3).   (5) over ∆ x − ∆ x0 and ∆ y − ∆ y0 yields mean loss value as For a negative example, the DLSM output minimizing mean value of loss (6) can be directly calculated: k xy = 0, σ 2 x = σ 2 y = D neg . Therefore, for negative examples, the SM value is forced towards the positive constant D neg . On the contrary, for positive examples, the SM value is expected to decrease towards zero. For training, we set a = 1.5...2.5 pixels and D neg = 7pixels 2 .
For the DLSM training, we decided, similar to triplet losses, to consider simultaneously several patch pairs within a Siamese training architecture as shown in Figure 3b. It comprises three branches. For all branches, the same TP is used. For the first two branches, random positive examples are generated by randomly shifting RP, with a = 1.5 for the first branch and a = 2.5 for the second branch. For the last branch, a negative example is generated. The DLSM is trained with joint loss comprising uncertainty with regression and triplet ratio losses: (7) where (∆ x0i , ∆ y0i ) is the ground truth translation vector for the ith branch, i = 1...3, w(∆ x , ∆ y ) =  (7) combines the proposed loss (5) for each branch and triplet ratio loss for two pairs of positive and negative examples. By combining two different losses, we seek to ensure that CNN learns with the goal to both discriminate multimodal patches and correctly estimate translation between them. The main difference of DLSM lies in extended number of outputs (translation vector and covariance matrix elements instead of just one SM value) and usage of joint loss function. Applied to one patch pair, the DLSM has the same computational complexity as CNNs with other losses: on NVIDIA GeForce GTX 960M GPU inference time is~0.5 ms (2000 pairs per second). DLSM does not depend on a particular two-stream CNN structure: each two-stream CNN suitable for hidge, triplet, or other existing losses can be transformed to DLSM.

Patch Pair Alignment with Subpixel Accuracy
Apart from the similarity value, DLSM predicts translation vector between the compared patches. Similar to other SMs, DLSM can be applied to a spatial neighborhood of the given patch pair by shifting TP relative to RP by an integer number of pixels (u, v). For each (u, v), DLSM predicts a slightly different position of the true correspondence ( Figure 4 shows an example of (∆ x , ∆ y ) field). Having these multiple measurements, different strategies for the true correspondence localization are possible with DLSM. Let us discuss them in order of increased accuracy established experimentally.
The first strategy is to detect the minimum value of SM det with coordinates (u min , v min ) and estimate the true correspondence as (u min + ∆ x (u min , v min ), v min + ∆ y (u min , v min )).
By design, translation vectors predicted by the DLSM are reliable only in a small neighborhood of the true translation of about two-three pixels in radius. Taking this into account, the second strategy is to repeat the first one two times: localize SM det global minimum, obtain the true correspondence position predicted by DLSM, and use the DLSM output at the refined position as the second and final refinement.
Both strategies were found to have similar localization accuracy comparable to that obtained with triplet losses. Their main drawback is that information from neighboring DLSM outputs is not used. Therefore, the third strategy is to localize the true correspondence as a consensus between DLSM predictions in the 3 by 3 neighborhood. For this, the translation vector values are averaged by a box filter with 3 pixels width. The smoothed translation vector (∆ x.avg , ∆ y.avg ) is close to zero for those patch pairs where DLSM prediction for neighboring pairs (shifted by −1, 0, or 1 pixel) points to this position. The integer value of the position of the true correspondence is localized as the minimum value of SM det · ∆ 2 x.avg + ∆ 2 y.avg + C, where C is a constant experimentally set to 0.5 pixels. Subpixel coordinates are obtained by applying the corresponding subpixel shift (∆ x.avg , ∆ y.avg ).
The fourth strategy extends the previous one even further and takes into account the distribution ellipse of each DLSM prediction: where L(u − x, v − y) is the loglikelihood function (5) value calculated using DLSM parameters for a patch pair shifted by (u, v) pixels. We found the latter strategy to have the best localization accuracy. All results in the experimental part of the paper are obtained with this localization strategy.

Experimental Part
In this section, both the two-stream CNN-based SMs trained with different loss functions (referred to as DSM, Deep SM) and Siamese CNN-based DLSM are compared to five existing multimodal SMs: (1) an SM which includes two terms, the Mutual Information and a gradient term, which highlights the large gradients with orientations in both modalities (GMI, Gradient with Mutual Information) [55]; (2) SIFT-OCT [15]; (3) MIND [16]; (4) HOPC [13]; and (5) L2-Net descriptor CNN [27]. The proposed loss function is compared to hinge, L 2 , binary cross-entropy (bce), triplet ratio (tr), triplet margin (tm), and multiclass losses. For all variants, we use only single scale input leaving experiments with central-surround input [18] for future work.

Multimodal Image Dataset
For training all patch matching CNNs considered in this study, 18 pairs of multimodal images were collected covering visible-to-infrared, optical-to-radar, optical-to-DEM, and radar-to-DEM cases. We use the term "optical" for both visual and infrared modalities. In the following, we group all these cases as the general case. Data from optical modality come from Sentinel 2, Landsat 8 and Hyperion platforms, radar modality from SIR-C and Sentinel 1 platforms, DEM from ASTER Global DEM 2, and ALOS World 30m global DEMs. Each image pair was registered in advance using the previously developed RAE registration method [1]. One example of optical-radar pair is shown in Figure 5.
In total, an amount of 2,700,000 32 × 32 patch pairs was collected from the above mentioned registration cases in the following proportions: 75% for visible-to-infrared, 9% for optical-to-radar, 8% for optical-to-DEM, and 8% for radar-to-DEM. These pairs were randomly split between training (75%) and validation (25%) sets. The proposed CNNs are trained with Adam optimizer [56], initial learning rate 2 · 10 −4 and decay 10 −5 . Training takes 800 epochs, with each epoch comprising 5000 steps (mini-batches). Batch size is set to 32. The parameters of all existing hand-crafted SMs are chosen according to the recommendations given in their respective papers.
(a) Reference radar image (b) Template optical image Test data are collected from another set of 16 registered multimodal pairs covering the same registration cases. From this extra set of pairs, 100,000 patch pairs (50% similar and 50% dissimilar) were uniformly collected among the considered registration cases.

Discriminative Power Analysis
For the general case, the ROC curves of the SMs selected for comparison are illustrated in Figure 6. ROC curves for the DLSM and DSM trained with the six different losses are close to each other. To avoid figure cluttering, only ROC for DLSM and DSM with the triplet ratio loss are shown. Numerical results for all SMs in comparison and each registration case are given in Table 1.  Among the considered hand-crafted SMs, the MIND has the highest AUC in the general case (72.32%). It also shows the best performance in optical-to-DEM and visible-to-infrared cases. However, in optical-to-radar and radar-to-DEM cases, SIFT-OCT achieves the best performance among hand-crafted SMs. L2-Net performs poorly in multimodal case and has low AUC in all cases. The proposed DLSM shows the best performance among the compared SMs, with AUC higher by 0.5% than the second highest result by the triplet ratio loss. The DLSM provides significant advantage over considered hand-crafted SMs in the general case (gain:~11.7%) and each particular case (the gain is~11.8% in optical-to-DEM case, 5% in visible-to-infrared case, 13% in optical-to-radar case, and 17.2% in radar-to-DEM case). Apart from the general case, the DLSM has the highest AUC in optical-to-radar, and radar-to-DEM cases. However, in the optical-to-DEM case, the best AUC is obtained with the triplet ratio loss, and in the optical-to-optical case, by the multiclass loss.

Patch Matching Uncertainty Analysis
Unlike the majority of prior hand-crafted and learning-based SMs, the proposed DLSM has the ability to predict the distribution of the estimated translation vector between the compared patches. To the best of our knowledge, the only SM with this ability is the logLR (log-likelihood ratio) published by the authors in [12]. However, the logLR can only be applied to isotropic textures, it has a lower discriminative power than that of the MIND and impractically high computational complexity for patch size 32 by 32 pixel. Therefore, we decided not to include it in the comparison.
For each patch pair, the DLSM estimates the covariance matrix C of the translation estimation error. Let us define eigenvalues of C as λ max and λ min , where λ max > λ min . The eigenvector corresponding to λ max is defined as v max = (cos(α cov ), sin(α cov )), where −90 • < α cov ≤ 90 • . The values λ max and λ min characterize the semi-axis of the estimation error distribution ellipse, the angle α cov characterizes its orientation, and the ratio r λ = λ max /λ min ≥ 1 characterizes its elongation.
Large values of the r λ ratio indicate that the pair of matched patches has a dominant direction, for example, representing an anisotropic texture. Let us analyze what the patches that lead to high predicted values of r λ by the DLSM look like. We selected patches with λ max < 1 pixels and r λ > 3. These patches are collected into four groups according to the value of α cov : (1) −15 • < α cov < 15 • ; (2) −60 • < α cov < −30 • ; (3) 75 • < α cov < 90 • or −90 • < α cov < −75 • ; (4) 30 • < α cov < 60 • . For each group, Figure 7 shows the translation estimation error scatterplot and displays two pairs of patches with the highest ratio r λ . The estimate of the translation error distribution ellipse obtained by DLSM is overlaid on the corresponding patch.  Figure 7, it is seen that the DLSM correctly detects the patches with correlated translation error components: the error for patches grouped according to the value of α cov has a distribution with a pronounced orientation aligned with α cov . Patches with the highest r λ have strongly anisotropic linear oriented structure. Interestingly, the number of patches for the two cases α cov = ±45 • are significantly lower than for the cases α cov = 0, 90 • . This effect could be related to the content of the test images. Another possibility is that the distribution ellipse of the estimation errors by DLSM may be biased.
The analysis above is essentially qualitative. To quantitatively characterize the DLSM translation estimation error = (∆ x ,∆ x ) − (∆ x0 , ∆ y0 ), let us normalize the translation vector error as norm = L −1 · , where C = LL T is the Cholesky decomposition of the covariance matrix. Two particular elements of norm , denoted as norm.major and norm.minor , correspond to the errors along semi-major and semi-minor axes of the distribution ellipse, respectively. If the estimated covariance matrix C is correct, the normalized error should posses the following properties; norm.major and norm.minor should follow a standard normal distribution N(0, 1) and be uncorrelated.
The experimental distributions of norm.major and norm.minor are shown in Figure 8 in comparison to the normal distribution. The shape of both pdfs is close to normal, but the parameters differ from standard ones. Both norm.major and norm.minor are slightly biased and have an SD of~1.3 instead of 1. This deviation can be, at least partially, explained by a not perfect registration of the train and test data. Let us assume, that registration error of the test data is normal with SD σ test and bias b test . For a translation error with SD σ true , the normalized error will have SD 1 + (σ test /σ true ) 2 and bias b test /σ true . For the considered set of test images, we checked that the observed bias of the normalized translation error can be caused by b test = 0.1...0.15 and that of SD by σ test = 0.15 pixels. This level of registration error is quite normal for multimodal images registration. Given the characteristics of the test data, the DLSM prediction of the covariance matrix of translation vector estimation errors for multimodal patches is very accurate.

Localization Accuracy Analysis
In the analysis of the localization accuracy of SMs, we pursue two goals: comparing the localization accuracy for the same set of patches for different SMs and studying the localization accuracy dependence on the SM value. For the first experiment, we selected the hand-crafted SM with the highest AUC-MIND-as a reference SM and ordered all patches in order of decreasing similarity established by MIND. For each value of MIND descriptor, the corresponding False Positive Rate (FPR) value is calculated. The FPR value ranges from 10 −6 to 10 −2 and is split into 30 intervals using a logarithmic scale. All patches with MIND value falling into the same FPR interval are grouped together. The translation vector error SD and robust SD are calculated for each interval and for each SM.
To calculate the translation error for each patch, a random subpixel shift in the range −3...3 pixels is first applied in both directions with respect to TP getting a modified TP. This shift represents the ground truth value. For each SM, its value is calculated by translating the modified TP in the interval from −5 to 5 pixels in horizontal and vertical directions. The coordinates of the SM main extrema are found and then refined to subpixel values according to (4). For the proposed DLSM, the translation vector is estimated according to the fourth strategy described in Section 4.3. The translation estimation error is calculated as the difference between the estimated and ground truth translation vectors.
The SD and robust SD of the localization error as a function of the MIND FPR is shown in Figure 9a,b, respectively, for the compared set of SMs. If the localization of a PC is implemented with integer precision, the best reachable error SD corresponds to the SD of a uniform distribution in the interval [−0.5, 0.5] pixels equal to 0.2887. This value is marked for reference as the black thick line in Figure 9a,b. Below we will refer to it as subpixel accuracy level.  For all SMs, the SD of the localization accuracy increases when the FPR decreases, that is, for more similar SMs. This is a natural observation, as very similar patches produce a delta-function-like SM shape. SIFT-OCT has the worst localization accuracy exceeding 1.5 pixels even for the most similar patches. The MIND SM is characterized by the best localization accuracy among the hand-crafted SMs. However, MIND accuracy never reaches subpixel accuracy. This is caused by outlying estimates. The SD calculated in a robust manner (Figure 9b) exceeds the subpixel accuracy level for FPR less than 2 · 10 −4 .
Note that among the considered learning-based SMs, four have a step-like SM profile: DSM with hinge, L 2 , binary cross-entropy losses, and the proposed DLSM. Three SMs have the smooth profile: DSM with triplet ratio, triplet margin, and multiclass losses. DSMs with the stepwise profile have localization accuracy worse than MIND; DSM with the smooth profile has a better localization accuracy than MIND. DLSM possesses advantages of both groups: the stepwise SM profile (see Figure 4) and the best localization accuracy over the compared SMs. In average, the DLSM improves SD by about 1.86 times as compared to MIND, by 1.28 as compared to the multiclass loss, and by 1.36 as compared to the triplet loss. For multiclass and triplet losses, subpixel accuracy for FPR is~5 · 10 −5 and for DLSM for FPR is~7 · 10 −5 . For robust SD, DLSM has also the highest accuracy but the gain is less visible: about 1.63 times as compared to MIND, by 1.15 as compared to the multiclass loss, and by 1.20 as compared to the triplet loss.
As discussed above, DLSM is able to predict its localization accuracy. This prediction is shown in both Figure 9a,b. For robust SD, predicted accuracy follows closely the measured values. For non-robust SD, measured accuracy is lower due to outliers influence.
In the next experiment, we compare the absolute number of reliable PCs provided by each SM. For this, SMs in comparison are applied to all corresponding test patches (50,000 in total). For each SM, patches are sorted in decreasing order of similarity. The SD and robust SD are calculated for successive groups of 500 patches. For the first 10,000 pairs, the dependencies of (robust) SD on the increasing number of patches are shown in Figure 10 for MIND, triplet ratio loss and DLSM.  In contrast to the previous experiment, both SM discriminative power and localization accuracy are important. According to SD measure, DLSM detects~2400 patches with subpixel localization accuracy, whereas triplet ratio loss, multiclass loss, and MIND detect none of them. For the first group of 500 patches, DLSM results in SD of 0.195, multiclass loss of 0.35, triplet loss of SD of 0.4 pixels, and MIND-SD of 0.8 pixels. According to the robust SD measure, DLSM detects 3500 patches with subpixel localization accuracy, triplet loss of 1840, multiclass loss of 3090, and MIND of 590 patches. For the first group of 500 patches, DLSM results in an SD of 0.156, multiclass loss of 0.165, triplet loss of SD of 0.2, and MIND-SD of 0.244 pixels. In this case, the DLSM SM value also accurately describes the observed translation vector estimation error of the DLSM.
In practice, multimodal registration requires few, but accurate and reliable matching points [24]. For this, the ability of DLSM to detect and localize high quality correspondences is important.

Conclusions
In this paper, we have proposed a new CNN structure for training a multimodal similarity measure that satisfies two properties: a high discriminative power and accurate localization of the compared patches.
Analysis of the existing patch matching CNNs and loss functions commonly used for their training revealed that the accurate localization is not a property explicitly considered. Therefore, subpixel localization accuracy can only be obtained for losses such as triplet ratio, triplet margin, and multi-class cross entropy, and only for a limited number of patches.
We have chosen to consider the discrimination and location of two patches not as different problems but as two sides of the same problem. We assume that a pair of patches are easier to align when they become more similar. Or conversely, uncertainty of a patches' pair localization can serve as a measure of their similarity. The proposed CNN, called DLSM, solves a regression task with the uncertainty taken into account and predicts the translation vector between two patches as well as the covariance matrix of the prediction error of this translation vector. The determinant of the predicted covariance matrix is a measure of localization uncertainty and we use it as a similarity value. The proposed CNN is trained with a specific joint regression-classification loss.
The experiments performed on 16 multimodal image pairs representing visual-infrared, optical-radar, optical-to-DEM, and radar-to-DEM cases have shown that the DLSM achieves both superior discriminating power and localization accuracy. The DLSM has desired step-like SM profile, but with localization accuracy better than SMs with the smooth SM profile. Thanks to the stepwise profile, a PC between reference and template images can be found by calculating DLSM values on a coarse translation grid. However, unlike hinge, L 2 , and binary cross entropy losses, PC can be accurately localized.
In addition to a high discrimination power and high localization accuracy, another aspect of DLSM is important in practice. Unlike existing SMs, DLSM is able to predict the covariance matrix of the translation vector prediction error. We found that the DLSM correctly predicts the covariance matrix of localization errors for different modalities, isotropic, and anisotropic patches and different noise levels. This property is essential for the selection and proper weighting of putative correspondences in advanced image registration methods.
Author Contributions: M.U. conceived of the paper, designed the experiments, generated the dataset, wrote the source code, performed the experiments, and wrote the paper. B.V. performed the experiments and revised the manuscript. V.L. and K.C. provided detailed advice during the writing process and revised the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.