Matching RGB and Infrared Remote Sensing Images with Densely-Connected Convolutional Neural Networks

: We develop a deep learning-based matching method between an RGB (red, green and blue) and an infrared image that were captured from satellite sensors. The method includes a convolutional neural network (CNN) that compares the RGB and infrared image pair and a template searching strategy that searches the correspondent point within a search window in the target image to a given point in the reference image. A densely-connected CNN is developed to extract common features from di ﬀ erent spectral bands. The network consists of a series of densely-connected convolutions to make full use of low-level features and an augmented cross entropy loss to avoid model overﬁtting. The network takes band-wise concatenated RGB and infrared images as the input and outputs a similarity score of the RGB and infrared image pair. For a given reference point, the similarity scores within the search window are calculated pixel-by-pixel, and the pixel with the highest score becomes the matching candidate. Experiments on a satellite RGB and infrared image dataset demonstrated that our method obtained more than 75% improvement on matching rate (the ratio of the successfully matched points to all the reference points) over conventional methods such as SURF, RIFT, and PSO-SIFT, and more than 10% improvement compared to other most recent CNN-based structures. Our experiments also demonstrated high performance and generalization ability of our method applying to multitemporal remote sensing images and close-range images. PSO-SIFT, and RIFT matched more points, but the matching rates were still very low with respect to the number of extracted feature points. The LSD detected some line features on the input image pairs, but none of the correspondences were found. The lack of linear or blocky structures in our test sets made it hard to apply the line-based matching methods.


Introduction
Image matching, as an important topic in computer vision and image processing, has been widely applied in image registration, fusion, geo-localization, and 3D reconstruction. With the development of sensors, multimodal remote sensing data captured on the same scene commonly require to be matched for the complementary information to be used. Specifically, visible and infrared sensors are two commonly and simultaneously used sensors in different kinds of tasks, such as face detection, land cover classification, object tracking, and driverless cars. The panchromatic or RGB imaging is the closest to human vision but is seriously affected by the lighting and atmospheric conditions; the near-infrared image has a relatively lower contrast but is more robust against weather conditions. To make the full use of them, image matching and registration are the first step. However, there are great radiometric and geometric differences between visible and infrared images, as they collect spectral reflectance from different wavelengths with different imaging mechanisms. The visual differences can prevent successful application of conventional matching methods that heavily rely on intensity and gradient.
feature-based methods search matches from local features that have been extracted from the reference 48 and the search image respectively by comparing their similarity. The patch-based methods have been 49 comprehensively studied and applied to photogrammetry [2][3][4]. The feature-based methods are 50 currently more widely used for their robustness against geometric transformations (e.g., rotation, 51 scale). For example, SIFT is a widely used local feature, 2D similarity invariant and stable with noise 52 [5]. Many local features are developed from the SIFT, such as the Affine-SIFT [6], the UC (Uniform  The third variation, called a channel-stacked network (Figure 1c), takes the channel-wise (band-wise) concatenation of the image pair as the monocular input of the CNN network [29] and requires only one-branch convolution layers. Aguilera et al. [31] used a 2-ch network to compare the similarity between different spectra close-range images. Suarez et al. [24] reduced the number of convolutional layers in [31] to alleviate hardware cost in landscape matching of cross-spectral images. Saxena et al. [32] applied a channel-stacked network on gallery images and probe images from different sensors for face recognition. Alba et al. [33] used a channel-stacked network to match visible and depth images to obtain 3D correspondent points. Perol et al. [34] used a channel-stacked network to discriminate seismic noise from earthquake signals.
The Siamese networks have been applied to close-range image matching between visible and infrared images. The TS-Net is a combination of the Siamese network and the pseudo-Siamese network for matching optical and infrared landscape images [35]. The Hybrid-Siamese network utilized the Siamese structure along with an auxiliary loss function to improve the matching performance between optical and long wavelength infrared (LWIR) close-range images [36]. Developed from a triplet network [37] for image retrieval, the Q-net took a positive pair and a negative pair consisting of optical and infrared photos as inputs to minimize the distance between positive samples and maximize the distance between negative samples [38]. Aguilera et al. [31] showed that the channel-stacked network ( Figure 1c) is superior to the Siamese networks in close-range visible and infrared image matching. The experiments of [36] also showed that the channel-stacked networks are better than other variations of Siamese networks.
However, three problems exist in the recent CNN-based optical image matching and specific visible and infrared image matching. The first problem is the limitation of the applied convolutional building blocks that affect the learning ability of a CNN. A series of plain convolution layers have been widely accepted in feature extraction for both dense image matching such as the MC-CNN [39] and sparse matching such as the 2-ch network [31]. However, conventional methods and empirical expertise indicate image matching depends highly on the low-level features of an image. For example, the classic cross correlation is calculated on pixel values and the gradient is used in the SIFT. Whereas the performance of a concatenation of plain convolutional layers is mainly conditioned on the last layer with high semantic features. Making full use of the low-level features in a CNN may complement the deep learning with human expertise.
The second problem is the lack of the searching ability of current networks for visible and infrared image matching. These networks [31,36] for close-range images only calculate the similarity score of a pair of input images. They are incapable of searching corresponding matches pixel-by-pixel or feature-by-feature, whereas correspondence point searching is the key characteristic of image matching. Rigidly, these methods should be classified into image retrieval instead of image matching. Wang et al. [40] proposed a network for matching optical and infrared remote sensing images including the searching process. They used the CNN features to replace the SIFT descriptors in a feature matching scheme. However, this strategy has a critical problem: insufficient SIFT correspondences can be retrieved due to the enormous dissimilarity between the RGB and infrared images, which may cause the algorithm to fail.
The third problem is the overfitting problem from the widely used cost functions in image matching. Popular cost functions of the CNN-based image matching are cross entropy loss and hinge loss [27,35]. They obey the single rule of maximizing the disparity between negative samples and minimizing the distance between positive samples, which may lead to model overfitting or being overly confident [41].
To tackle the three problems above, a channel-stacked network with densely-connected convolutional building blocks and an augmented loss function is developed for visible and infrared satellite image matching. The main work and contributions are summarized as follows. First, we develop an innovative densely-connected CNN structure to enhance the matching capability between RGB and infrared images. The dense connections in several previous convolutional layers ensure information of lower features being directly passed to the higher layers, which can significantly improve the performance of a series of convolutional blocks. Second, a complete CNN-based template matching framework for optical and infrared images is introduced in contrast to the recent CNN structures that are only designed to compare visible and infrared images [31,36]. Through replacing the feature-based matching scheme [40] to a template matching scheme, a large number of correspondences can be found. Third, we apply an augmented cross entropy loss function to enhance the learning ability and stability of the network under various data sets. Lastly, our method can be directly extended to other matching tasks such as multitemporal image matching and close-range image matching. We show our method is effective and outperforms all the other conventional and CNN-based methods on various satellite images with different geometric distortions as well as on close-range images. Source code is available at http://study.rsgis.whu.edu.cn/pages/download/.

Network
We developed a channel-stacked CNN structure for RGB and infrared image matching with concatenated images as input. The channel-stacked structure has proven to be better than the Siamese networks in previous studies [31,36]; the difference is that we introduced densely-connected convolutional building blocks to enhance the learning ability of a CNN, as lower features learned from previous layers is also critical for an image matching problem.
Our structure is shown in Figure 2. The input consists of concatenated channels from the red, green, blue, and infrared bands to be matched. The network consists of seven convolution layers, two max-pooling layers and two fully connected (FC) layers. Every convolutional layer is activated by the rectified linear unit (ReLU) function. The densely-connected structure, i.e., a current layer taking the concatenation of the outputs of all previous layers as input, is applied from the first convolutional layer to the fifth one. For example, the input of the fifth layer is X 5 = concat(X 1 , X 2 , X 3 , X 4 ), where X i represents the output of the i-th layer and concat() is the channel-wise concatenation operation. As the feature maps from different layers are reused, parameters and computational burden are reduced. More importantly, the features from multilevel layers provide later layers with more information, which ultimately improves the image matching performance.

146
We developed a channel-stacked CNN structure for RGB and infrared image matching with 147 concatenated images as input. The channel-stacked structure has proven to be better than the Siamese 148 networks in previous studies [31,36]; the difference is that we introduced densely-connected 149 convolutional building blocks to enhance the learning ability of a CNN, as lower features learned 150 from previous layers is also critical for an image matching problem.

151
Our structure is shown in Figure 2. The input consists of concatenated channels from the red,

159
More importantly, the features from multilevel layers provide later layers with more information,

165
The matrix is initialized with the normalized initialization [42] and updated during training. The

168
The first four convolutional layers have 64 channels and a kernel size of 3 × 3, respectively. At 169 the fifth layer, a 1 × 1 kernel is used to fuse multilevel features to output feature maps with 256 170 channels. The last two convolution layers are used to further adjust the interdependencies of these 171 multilevel semantics. A FC layer is used to compress the 2D feature maps into a 1D vector and the Our network takes four concatenated bands as input and predicts a matching score through a series of densely-connected convolutional layers and max-pooling layers. 64@3×3 means the convolution operation is performed with 64 convolutional kernels and each kernel is a 3 × 3 matrix.
The matrix is initialized with the normalized initialization [42] and updated during training. The convolution outputs a 64-channel map representing various features of the current layer. Finally, 65,536 and 256 are the lengths of the two fully connected (FC) layers, respectively.
The first four convolutional layers have 64 channels and a kernel size of 3 × 3, respectively. At the fifth layer, a 1 × 1 kernel is used to fuse multilevel features to output feature maps with 256 channels. The last two convolution layers are used to further adjust the interdependencies of these multilevel semantics. A FC layer is used to compress the 2D feature maps into a 1D vector and the last FC layer translates the vector to a scalar (similarity score) by the sigmoid function. The number of channels, the kernel size of each layer, and the length of the FC layers are also shown in Figure 2.

Augmented Loss Function
The widely used binary cross entropy attempts to maximize the distance between a negative pair and minimize the distance between a positive pair, which potentially makes the model overly confident and leads to overfitting [41]. In image matching this problem is more severe. We observed that almost all similarity scores learned from the cross entropy are very close to either 0 or 1, which is unrealistic as the similarity curve should be much smoother.
To improve the inlier rate and matching accuracy, we introduced an augmented loss function [43] which is a combination of the original cross entropy and an uniform distribution to make the network more general, In Equation (1), L(t , p) is the final cross entropy loss where t is the regularized distribution of the label and p is the distribution of the prediction of the network, L(t, p) represents the conventional cross-entropy loss where t is the original distribution of the label. The second loss L(u, p), i.e., the smoothing term, measures the deviation between the uniform distribution u and the prediction distribution p. By weighting the two losses with a smoothing parameter ε, the final loss softens the model to be less confident and makes the distribution of the similarity scores smoother.
In our binary classification (i.e., matched and non-matched), we used the uniform distribution u(k) = 1/k where the category number k is 2.

Template Matching
We applied the template matching strategy to search candidates in the target image when given a reference point in the source image. First, the reference points are determined by a feature extractor or regularized pixel intervals, for example, picking one point across every 100 pixels row-wise and column-wise. Second, a search window in the target image is estimated according to the initial registration accuracy of the two images. Third, within the search window a sliding window with the same size of the reference patch is utilized to calculate the matching score between it and the reference pixel-by-pixel. The pixel with the maximum score in the search window is the candidate.
In the template matching process (Figure 3), a reference patch (centered at the reference point) is given in the RGB image, the yellow rectangle in the NIR (near infrared) image represents the search scope, and the orange rectangle is the sliding window. Every sliding window is concatenated with the reference window and input into the CNN to produce a matching score. We also considered whether the geometric distortions exist between the RGB and infrared images due to different initial registration accuracy. The NIR image with and without distortions is separately matched with the RGB image.
Without using the template matching strategy, the recent CNN-based methods either lack searching ability and reduces to an image comparison or retrieval method [31], or adopt an alternative feature-based matching strategy [40], which result in poor performance due to the difficulties of extracting correspondent SIFT or other features simultaneously from two multimodal images. registration accuracy of the two images. Third, within the search window a sliding window with the 197 same size of the reference patch is utilized to calculate the matching score between it and the reference 198 pixel-by-pixel. The pixel with the maximum score in the search window is the candidate.  The process of template matching. Given a reference patch, the corresponding patch should be found within a given search window. We consider the cases with or without geometric distortions between the reference image and the search image.

Data and Experimental Design
The experimental data consists of five RGB and infrared image pairs captured from Landsat 8 images at tile index 29 North 113 East, with a size of 1024 × 679 pixels. To evaluate the performance of the trained model in different situations, the five image pairs contained different acquisition seasons and times (Table 1). To demonstrate the generalization ability of our method, only pair 1 and pair 2 contained training samples, that is, predictions and accuracy assessment on the five pairs used the models pretrained either on pair 1 or pair 2 with 496 training samples and 124 validation samples. The training sample set was generated by evenly cropping corresponding image patches from pair 1 and pair 2. In total, 620 64 × 64 image patches with 30 × 30 pixels interval on pair 1 or 2 were cropped out. For each positive sample, a negative sample was generated by randomly shifting the parallax. For the test data, a 50 × 50 pixels interval was used and 228 test patches were cropped out. To compare with feature-based methods, we additionally produced a test set with feature extraction, and the location of a feature was the center point of a 64 × 64 patch pair.
The RGB and infrared images were accurately registrated. To simulate other sensors' images with less accurate registration, we manually added 2D similarity distortions: the translation parameter was randomly sampled from [−10, 10] with an interval of 1 pixel, the rotation parameter was randomly sampled from [−5 • , +5 • ] with an interval of 1 • , the scale parameter was randomly sampled in a range of [0.9, 1.1].
We trained all the networks for 30 epochs using a batch size of 128 and the stochastic gradient descending (SGD) optimizer with an initial learning rate of 1 × 10 −3 and a momentum of 0.9. The learning rate was reduced by a factor of 0.1 every 10 epochs. Weights of all the networks were initialized via the Xavier uniform distribution [42]. The pixel values of the training and test images were normalized to 0~1 before they were fed into the networks. A Windows PC with an Intel i5-8400 CPU and a GeForce GTX 1080 TI GPU was used for executing all the experiments and all the codes were implemented in the Keras environment.
In our experiment, the root-mean-square error (RMSE) between the locations of matched points and ground truth and the matching rate, i.e., the ratio between the number of correct matches and the number of all the reference points, were used to evaluate our method.

Results
In the first experiment, our method was compared to the 2-ch network [31] for image-wise comparison without the searching step. In addition, the advantage of channel-stacked structures was tested. Table 2 lists the matching rate of our method and our method without the augmented loss function. The threshold was set to 0.5 by default, if the matching score predicted by an algorithm was above the threshold, the RGB-infrared patch pair was considered as "matched" and was validated by the ground truth. The accuracy obtained from our method was obviously better than that of the 2-ch network. After replacing the plain convolution blocks with densely-connected blocks, the matching rate improved 6.1% on average, and an additional 2.5% improvement was obtained when the augmented loss function was introduced.  [44], the pseudo-Siamese network, the 2-ch network [31], our network, and our network with the augmented loss function on all pairs without distortions. All the models were trained only on pair 1. The "wo/" is short for without. AMR: Average Matching Rate. The bolded number indicates the best result. We also show the results of a pseudo-Siamese network without weight sharing as the RGB and infrared images are composites of different number of bands. Except the difference of inputs, the pseudo-Siamese network and the 2-ch network share the same structure. From Table 2 it is observed that the structure with channel-concatenated images as input is about 5% better than the Siamese network structure, clearly demonstrating the advantage of channel-stacked structures.

Image
A very recent work named SFcNet [44] used the Siamese structure with shared weights for multimodel image matching. Due to the improper structure, it performed the worst and was 22.6% lower than ours on matching rate. In addition, the network was hard to train.
The average runtime (in milliseconds) of each method is recorded in the last row of Table 2. For each RGB/NIR patch pair, the processing time of our proposed method is nearly 1 ms, i.e., it will take 1 s to compare 1000 points.
In the second experiment, we compared the performances of different matching methods with pixel-wise searching. We applied the same searching strategy to [31]. Tables 3 and 4 show the matching rate and RMSE of the different methods on all the data sets with distortions. Distortions simulated from random parameters were added to the 64 × 64 infrared patches. In Table 3, the model was pre-trained on pair 1 and in Table 4 the model was pre-trained on pair 2. The search window (including the half width of sliding window) was set to 62 × 62 pixels to cover the scope of potential candidates (30 × 30 pixels). The maximum value from the 900 similarity scores predicted by the algorithm corresponded to the matching candidate. If the distance between the candidate and the ground truth was within a given threshold (1 or 2 pixels), it was treated as a correctly matched point.  Table 3 shows our methods were comprehensively better than the 2-ch network, and the one with augmented loss outperformed the 2-ch network 14.11% and 5.35% on 1-pixel-error and 2-pixel-error, respectively. The corresponding improvement was 16.40% and 10.88% when the model was pre-trained on pair 2 ( Table 4). The RSME of our method is also slightly smaller than the 2-ch network. Our method improved more on 1-pixel-error than on 2-pixel-error, which may be due to the additional use of low features such as color, edges, and gradients, which usually exhibit finer structures benefitting the geometric localization than pure high semantic features. Figure 4 shows the matching results of the five image pairs using our method, which was pre-trained on pair 1. It is observed that the RGB and the infrared images differ greatly in appearance. However, our method found more than 95% matches (crosses in the right image) of the reference points (crosses in the left image) on average. A total of 86.4% matches were found in pair 1 (Figure 4a). When the same model was directly applied to pair 2 (one-year interval with respective to pair 1, Figure 4b Figure 4e) with significant appearance changes to pair 1, the matching rates were 94%, 100%, 100%, and 99%, respectively. These matching rates are very high even compared to a common optical stereo matching, demonstrating our CNN based method has not only an excellent performance but also a powerful generalization ability in matching the RGB and infrared image pairs in spite of the temporal mismatch and appearance disparity.

314
We used pair 1 to train the models and pair 4 for the test. Pair 4 was resampled with two simulated

317
In Table 5  In the third experiment, we compared our network with conventional methods including SIFT [5], SURF [45], Affine-SIFT [6], PSO-SIFT [8], RIFT [10], LSD (Line Segment Detector) [46], and a feature-based method utilizing CNN [40]. In our method and [40], SIFT points (keeping only one at those position-repeated ones) on the RGB image were used as the center points of reference patches. We used pair 1 to train the models and pair 4 for the test. Pair 4 was resampled with two simulated distortion parameters: (1) a rotation of 0.3 • and a scale factor of 0.98, and (2) a rotation of 3 • and scale of 0.95.
In Table 5, the six conventional methods performed extremely poorly: thousands of features were extracted but few of them matched. This demonstrates the incompetence of conventional methods in processing the matching of visible and infrared images, where huge appearance and spectral differences exist.  For the feature-based CNN [40], we failed to reproduce their work as the network did not converge. However, we can demonstrate our method is superior than the feature-based searching strategy. First, only 2658 correspondences out of 7948 and 12,008 SIFT points were less than the 2-pixel-error threshold. Second, in the SIFT matching only a dozen points were matched. Finally, we used our network to compute the similarity score between the 7948 and 12,008 points, and matched 1423 points, i.e., our network found half of the point pairs from all the SIFT correspondences. In contrast, our method found 7399 and 6973 matches (93.1% and 88.0% in matching rate) on 2-pixel-error in slightly distorted and largely distorted image pairs, respectively. Figure 5 shows the matched points on pair 4 from the SIFT, SURF, Affine-SIFT, PSO-SIFT, RIFT, LSD, and our method, respectively. Note that in the SIFT, SURF, Affine-SIFT, PSO-SIFT, and RIFT matching we translated the RGB image to gray scale image as all these features are designed for gray scale images. Very few correspondent points could be matched by the SIFT and SURF. The Affine-SIFT, PSO-SIFT, and RIFT matched more points, but the matching rates were still very low with respect to the number of extracted feature points. The LSD detected some line features on the input image pairs, but none of the correspondences were found. The lack of linear or blocky structures in our test sets made it hard to apply the line-based matching methods.

352
The experimental results demonstrated the effectiveness and superiority of our method 353 compared to the conventional and the recent CNN-based methods. In this section, we focus on the 354 extensions of our method. First, we discuss whether our method can be applied to multitemporal 355 image matching. Second, we evaluate our method on close-range images, and compare it with [31] 356 which was developed for matching close-range visible and infrared images. In addition, the 357 performance of conventional template matching methods on matching RGB and infrared images are 358 examined.
359 Table 6 lists two sets of multitemporal images to be matched. Images acquired at different times 360 could introduce false changes due to disparities in season, illumination, and atmosphere correction,

361
etc. It is difficult (and even meaningless) to train a "multitemporal model" with multitemporal 362 images, as the sample space of multitemporal pairs is almost infinite. In contrast, we trained the 363 model on the images acquired at the same time, namely, the pair 1 of Table 1, and checked whether 364 the model is robust and has enough transfer learning ability on multitemporal images. Table 7 shows 365 that our methods outperformed the 2-ch network [31] 10.3% and 9.6% on average on 1-pixel-error 366 and 2-pixel-error, respectively. As more than 50% reference points could be matched at 2-pixel-error, In contrast, our CNN-based structure, through a series of densely-connected convolution layers, learned identical feature representation between the RGB and the infrared images. This capacity leads to a large number of correspondences (more than 80%) being accurately identified.

Discussion
The experimental results demonstrated the effectiveness and superiority of our method compared to the conventional and the recent CNN-based methods. In this section, we focus on the extensions of our method. First, we discuss whether our method can be applied to multitemporal image matching. Second, we evaluate our method on close-range images, and compare it with [31] which was developed for matching close-range visible and infrared images. In addition, the performance of conventional template matching methods on matching RGB and infrared images are examined. Table 6 lists two sets of multitemporal images to be matched. Images acquired at different times could introduce false changes due to disparities in season, illumination, and atmosphere correction, etc. It is difficult (and even meaningless) to train a "multitemporal model" with multitemporal images, as the sample space of multitemporal pairs is almost infinite. In contrast, we trained the model on the images acquired at the same time, namely, the pair 1 of Table 1, and checked whether the model is robust and has enough transfer learning ability on multitemporal images. Table 7 shows that our methods outperformed the 2-ch network [31] 10.3% and 9.6% on average on 1-pixel-error and 2-pixel-error, respectively. As more than 50% reference points could be matched at 2-pixel-error, it implies that our model has very good generalization ability to be directly and effectively transferred to multitemporal images. Figure 6 shows two examples of the image matching results of our method and the 2-ch network on 2-pixel-error, our method successfully matched more points in both images. it implies that our model has very good generalization ability to be directly and effectively transferred 368 to multitemporal images. Figure 6 shows two examples of the image matching results of our method 369 and the 2-ch network on 2-pixel-error, our method successfully matched more points in both images.    ((a,b) in the first row) and the infrared images (the second and the third rows) at 2-pixel-error threshold using 2-ch network ((c,d) in the second row) and our method ((e,f) in the third row). The RGB images and infrared images in Pair 6 (the first column) and Pair 7 (the second column) were acquired at different times.
To evaluate the performance of our method on close-range images, the VIS-NIR dataset [47] was selected for the test. The settings were the same as described in [31], where 80% images selected from the "country" category were used to train a model, the model was then applied to the rest of 20% country images and all the other eight categories. The threshold was set to 0.5, i.e., if the prediction probability of a visible-infrared image pair was above 0.5, it was regarded as matched. From Table 8, it was observed that our method was marginally better compared to [31]. Considering the 2-ch network was specially developed for close-range images and adjusted on the same dataset [47] and our method was directly applied to the dataset without any structure and parameter tuning, it could be confirmed that our method is superior in matching both satellite and close-range images. Table 8. The results of the 2-ch network and our method on discovering similar visible-infrared images from a close-range dataset [47]. The models were respectively pre-trained on 80% images labelled with "country".  Figure 7 shows the results of matching on different scenes. Both the 2-ch and our methods can distinguish most of the positive samples (green crosses). However, the 2-ch network made more mistakes both in terms of false negatives, i.e., the positive samples (red points without connections) were predicted as "non-matched", and false positives, i.e., the negative samples (red points connected with lines) were predicted as "matched".
Remote Sens. 2019, 11, x FOR PEER REVIEW 13 of 17 it was observed that our method was marginally better compared to [31]. Considering the 2-ch 385 network was specially developed for close-range images and adjusted on the same dataset [47] and 386 our method was directly applied to the dataset without any structure and parameter tuning, it could 387 be confirmed that our method is superior in matching both satellite and close-range images.   forest, indoor, and mountain scenes respectively from the close-range dataset [47]. The green crosses 399 in images represent the correct matching points, the individual red circles are the positive samples 400 been wrongly classified to "non-matched", and the connected red circles represent negative samples 401 which were wrongly classified to "matched".

402
We tested the performances of two classic similarity measures, NCC (Normalized Cross

403
Correlation) [48] and SSIM (Structure Similarity Index) [49], which are widely-used in template image 404 matching, on pair 1 and 7. For the NCC, we used 0.5 as the threshold; for the SSIM, we used 0.3 as 405 the threshold. The matching rate in Table 9   . The matching results of the 2-ch network (a,c,e), and our method (b,d,f) on forest, indoor, and mountain scenes respectively from the close-range dataset [47]. The green crosses in images represent the correct matching points, the individual red circles are the positive samples been wrongly classified to "non-matched", and the connected red circles represent negative samples which were wrongly classified to "matched".
We tested the performances of two classic similarity measures, NCC (Normalized Cross Correlation) [48] and SSIM (Structure Similarity Index) [49], which are widely-used in template image matching, on pair 1 and 7. For the NCC, we used 0.5 as the threshold; for the SSIM, we used 0.3 as the threshold. The matching rate in Table 9 shows the NCC performed extremely poor on both sets, where only about 1% points can be matched. Although the SSIM performed a little better on pair 1 (RGB and infrared images captured at the same time), it performed extremely poor in pair 7 where the RGB and infrared images were captured at different times. The few matched points are shown in Figure 8. Compared to the results of our method where respectively more than 90% and 50% of reference points could be correctly matched, respectively, these conventional template matching methods, as well as the feature-based methods, obtained less satisfactory results in matching RGB and infrared images.

417
The parameter ε in Equation (1) indicates the weight between the binary cross entropy loss and 418 the smoothing term. Empirically, our network obtained optimal performance when the smoothing 419 parameter ε is set to 0.05, which was determined by observing the accuracy curves ( Figure 9). We 420 plotted the accuracy on the test dataset with the smoothing parameter ε varying from 0 to 0.1 at an 421 interval of 0.01. The matching rate on all the test sets reaches optimal at 0.05, indicating that the 422 smoothing parameter of 0.05 is a suitable threshold for the proposed network. The parameter ε in Equation (1) indicates the weight between the binary cross entropy loss and the smoothing term. Empirically, our network obtained optimal performance when the smoothing parameter ε is set to 0.05, which was determined by observing the accuracy curves ( Figure 9). We plotted the accuracy on the test dataset with the smoothing parameter ε varying from 0 to 0.1 at an interval of 0.01. The matching rate on all the test sets reaches optimal at 0.05, indicating that the smoothing parameter of 0.05 is a suitable threshold for the proposed network.
Future studies will improve the efficiency of our algorithm. In our study, the stage of template searching is outside the network. We will consider whether the process can be incorporated into the end-to-end learning process of the network, which may speed up the training and testing processes.
interval of 0.01. The matching rate on all the test sets reaches optimal at 0.05, indicating that the 422 smoothing parameter of 0.05 is a suitable threshold for the proposed network. 423 424 Figure 9. The changes of test accuracy using different ε values. The matching rate on all the test sets 425 reaches optimal at 0.05. Figure 9. The changes of test accuracy using different ε values. The matching rate on all the test sets reaches optimal at 0.05.

Conclusions
In this study we developed a CNN based method for matching RGB and infrared images. The method features the use of band-wise concatenated input, densely-connected convolutional layers, an augmented loss function, and a template searching framework. The experiments on various RGB and infrared image sets demonstrated that our method is considerably superior than a CNN-based method for image-wise comparison between RGB and infrared images, and tremendously better than a feature-based CNN method. Especially, the densely-connected layers improved the performance of traditional building blocks more than 10% on satellite image matching. The utilization of lower features from early convolutional layers proved effective not only experimentally, but is also consistent with the empirical expertise in image matching.
It was also proven that the conventional feature based matching methods and template matching methods failed to obtain satisfactory results due to the huge appearance differences between RGB and infrared images. In contrast, we showed that extracting common semantic features from different appearances using a CNN could address the problem.
Our method was proven to have a high generalization ability to be effectively applied to multitemporal images and close-range images, which contributes to the superior performance of our method compared to other recent CNN-based methods.