PM-Net: A Multi-Level Keypoints Detector and Patch Feature Learning Network for Optical and SAR Image Matching

Li, Ziqian; Fu, Zhitao; Nie, Han; Chen, Sijing

doi:10.3390/app12125989

Open AccessArticle

PM-Net: A Multi-Level Keypoints Detector and Patch Feature Learning Network for Optical and SAR Image Matching

Faculty of Land and Resources Engineering, Kunming University of Science and Technology, Kunming 650031, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 5989; https://doi.org/10.3390/app12125989

Submission received: 13 May 2022 / Revised: 4 June 2022 / Accepted: 11 June 2022 / Published: 12 June 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Due to the differences in radiation and geometric characteristics of optical and synthetic aperture radar (SAR) images, there is still a huge challenge for accurate matching. In this paper, we propose a patch-matching network (PM-Net) to improve the matching performance of optical and SAR images. First, a multi-level keypoints detector (MKD) with fused high-level and low-level features is presented to extract more robust keypoints from optical and SAR images. Second, we use a two-channel network structure to improve the image patch matching performance. Benefiting from this design, the proposed method can directly learn the similarity between optical and SAR image patches without manually designing features and descriptors. Finally, the MKD and two-channel net-work are trained separately on GL3D and QXS-SAROPT data sets, and the PM-Net is tested on multiple pairs of optical and SAR images. The experimental results demonstrate that the proposed method outperforms four advanced image matching networks on qualitative and quantitative assessments. The quantitative experiment results show that using our method correct matching points numbers are increased by more than 1.15 times, the value of F1-measure is raised by an average of 7.4% and the root mean squared error (RMSE) is reduced by more than 15.3%. The advantages of MKD and the two-channel network are also verified through ablation experiments.

Keywords:

remote sensing; feature learning; patch matching; two-channel network; optical and SAR image

1. Introduction

With the rapid development of remote sensing technology, different platforms and sensors, such as Ikonos, Quickbird, TerraSAR-X, Cosmo Skymed, and WorldView [1] provide multiple means for Earth observation. The passive optical sensor and active SAR sensor are two important remote sensing technologies, and their images can reflect different feature characteristics. The combined application of optical and SAR images can form complementary information, which has diverse applications, such as image fusion [2,3] and change detection [4,5]. Therefore, this makes the optical and SAR image matching a foundational work for further applications and analysis.

Due to different imaging principles, there are nonlinear radiometric and geometric differences between optical and SAR images. Furthermore, speckle noise on SAR images seriously affects matching performance, which makes it very challenging to find the conjugate features for matching of optical and SAR images [6]. For optical and SAR image matching, the difficulty lies in constructing robust features from heterogeneous image pairs.

To deal with this problem, many studies have been conducted using the traditional handcrafted features for matching [7,8,9]. However, considering that handcrafted feature descriptors are usually constructed by image features under a small local image region, such as the intensity or gradient value, it is not effective to deal with the large disparity between the local geometric structures in optical and SAR images [6]. Therefore, some recent studies turned to deep learning with the development of artificial intelligence techniques. Deep learning methods are applied to optical and SAR image matching, with the aim of learning high-level semantic features for robust matching.

For optical and SAR image matching, the existing methods based on deep learning mainly adopt a two-branch framework in which the convolutional neural network (CNN) acts as both the feature extractor and feature descriptor [10]. The two-branch networks can be divided into Siamese networks and pseudo-Siamese networks. Merkle et al. [11] use the Siamese network to learn homogeneous features from optical and SAR images by using dilated convolution to obtain a bigger receptive field, and generate image patch matching similarity based on deep features using dot product computation. A similar architecture was applied in [12]; the differences being that it connected the output feature pairs using a convolution operation. However, Siamese networks using shared parameters to reduce the training cost cannot easily describe the differences between the different modalities for optical and SAR image matching tasks [10]. As a result, pseudo-Siamese networks are increasingly being used for cross-modality remote sensing image matching [13,14]. Although pseudo-Siamese networks can help capture significant differences between cross-modality images, there still exists similar semantic information, such as structure and shape, between optical and SAR images that the pseudo-Siamese networks cannot extract effectively. This local information can also have an impact on optical and SAR image matching performance.

In this study, we focus on learning high-level semantic information while preserving low-level local features to achieve accurate patch matching of optical and SAR images. An end-to-end matching network architecture of optical and SAR images is proposed. It mainly contains multi-level keypoints detection and patch matching. The contribution of this paper can be summarized as follows:

We present a multi-level keypoints detector, called MKD, which detects and fuses low-level and high-level feature maps by joint peak values to obtain richer image information for more robust keypoints.
We propose a two-channel patch matching network for optical and SAR images to improve the patch matching performance. This trained Siamese-type network can directly determine the similarity between optical and SAR image patches without manually designed features and descriptors. By joint processing of image patch training samples, it can effectively reduce training costs.
We propose a PM-Net to solve the difficulty in acquiring robust keypoints and completing accurate patch matching between optical and SAR images.

The remainder of this paper is organized as follows. Related work is discussed in Section 2. Section 3 describes the proposed method in detail. Section 4 gives comparative experimental results in conjunction with detailed analysis. In Section 5, a discussion is provided. We conclude in Section 6.

2. Related Work

Generally, image matching methods can be roughly divided into three categories, namely, area-based methods, feature-based methods and learning-based methods [15].

2.1. Area-Based Methods

Area-based methods search for geometric transformation by calculating the similarity between images. Mutual information (MI), Kullback-Leibler divergence and normalized cross correlation (NCC) have been widely used [16,17,18]. However, they are not applicable to cross-modality image matching tasks such as optical and SAR image matching because they are sensitive to remote sensing image noise and illumination changes, making it difficult to search for image change parameters through similarity metrics [19].

2.2. Feature-Based Methods

Feature-based matching methods can, to a certain extent, overcome the problems of matching due to noise and other factors. Owing to the efficient performance and invariance to scale and rotation, scale-invariant feature transform (SIFT) has become the most representative method [20]. Subsequently, varieties of improved SIFT methods have been reported. An improved version of SIFT is proposed to obtain better matching features [7]. It uses spatially consistent matching strategy to obtain more reliable feature point pairs for optical and SAR image matching. In order to match the high-resolution optical and SAR images, OS-SIFT was introduced [8]. This uses a multi-scale ratio of exponentially weighted averages (ROEWA) operator and a multi-scale Sobel operator, respectively to compute the consistent gradient magnitudes in optical and SAR images. Additionally, some papers focus on integrating the advantages of area-based methods and feature-based methods. Gong et al. [9] introduced a two-stage method for image registration, which acquired coarse results by SIFT and then achieved precise registration based on mutual information. An iterative multi-level strategy for adjusting parameters to re-extract and match features has been proposed [21] to improve the matching performance of optical and SAR images. However, conventional matching methods, as reviewed above, require expertise to design, and may disregard useful patterns hidden in the data [22].

2.3. Learning-Based Methods

Learning-based methods have also been gaining attention in the field of image matching. The major reason is that these represent a data-driven approach and can learn features automatically. Specifically, several non-linear data mapping functions can be stacked to form a deep network function. This deep neural network approximates arbitrarily complex functions and exploits the abstract representation based on a large number of training samples [23]. The Contextdesc network [24] adopts a CNN and introduces context awareness to augment local features, which aims to solve cross-modality image matching problems. D2-Net [25] uses a single CNN as a dense feature descriptor and feature detector, and presents a novel approach for local feature extraction. R2D2 [26] combines the detector and descriptor to improve the repeatability and reliability of keypoints. Luo et al. [27] proposed the ASLFeat, which jointly learns the local feature detector and descriptor for accurate shape and localization of keypoints.

Considering the ability of deep learning for feature extraction, learning-based methods have been gradually applied to remote sensing image matching and have shown superiority. A CNN feature-based multitemporal remote sensing image registration method has been proposed [28]. It uses CNN to learn multi-level features of images to achieve multi-temporal image matching. To compensate for the disadvantage that SIFT only uses local low-level image information, Ye et al. [29] used the CNN and SIFT fusion features to achieve multi-sensor image matching. However, these methods only use a pre-trained CNN model by fine tuning, and do not carry out their own training on remote sensing data. Similarly, Ma et al. introduce a two-stage framework [30], which casts the matching task by combining deep features and handcrafted features, and uses a random sample consensus (RANSAC) [31] method to refine feature matching and transformation estimation. However, the computational complexity of this method is increased. In recent years, Siamese-type architecture network structures that learn similarity functions between image patches have been receiving attention in the field of image matching [32,33]. To solve the matching problem of remote sensing images with a complex background, a Siamese convolutional neural network (SCNN) [34] was constructed. However, this matching method is not suitable for the remote sensing images with weak textures. Wang et al. [19] use a network self-learning approach for image patch similarity functions learning, and proposed a deep learning framework for remote sensing image registration. However, the image patches were obtained by SIFT and did not consider high-level feature representation.

In addition, Zagoruyko and Komodakis [35] evaluated the image patch matching performance of several Siamese-type network structures such as a two-channel network structure, a Siamese network structure and a pseudo-Siamese network structure. The results demonstrated that two-channel network structure using image patch feature matching had significant advantages.

3. Method

3.1. Overview

The proposed PM-Net is mainly divided into two stages: multi-level keypoints detector (MKD) and image patch feature learning (Figure 1). The specific steps are as follows.

Multi-level keypoints detector

First, in order to obtain the multi-layer feature maps, a pair of optical and SAR images is input to MKD network. Then, the peak values of the feature maps’ local response and channel-wise response are selected as the detection measurement. A score map is obtained by up-sampling reduction and taking weighted sum to extract keypoints of images. Finally, the input image is cropped into multiple image patch pairs centered on the keypoints, and the pairs of image patches compared are combined into two-channel images by processing jointly. Details are in Section 3.2.

Image patch feature learning

The fused image patches are used as the input of the two-channel network. Network learning is used to determine whether image patches are matching pairs. Then, the matching labels of them are output. Details are in Section 3.3.

3.2. Multi-Level Keypoints Detector

Considering the image differences between optical and SAR images, extracting only local or high-level features may cause matching problems. We present MKD, which fuses the detection results of higher-level feature maps while preserving lower-level features through a pyramidal feature hierarchy of convolutional neural networks. The detection of multi-layer feature maps is combined.

MKD scores the keypoints regarding both spatial and channel-wise responses at first. To further enhance the correlation with the actual distribution of all responses along the channel [27], the peak value is used as the basis for final keypoints measurement. The illustration of the peak value can be seen in Figure 2.

The feature map of network output under the feature space

ℝ^{H \times W \times C}

is

F

, and for each location

(i, j)

in

F^{c} (c = 1, 2, \dots, C)

, the peak value of channel-wise is obtained by:

α_{i j}^{c} = s o f t m a x (F_{i j}^{c} - \frac{1}{C} \sum_{t} F_{i j}^{t})

(1)

where softmax activates the peak value to a positive value. The corresponding peak value of local space is obtained by:

β_{i j}^{c} = s o f t m a x (F_{i j}^{c} - \frac{1}{| N (i, j) |} \sum_{(I, J) \in N (i, j)} F_{I J}^{c})

(2)

where

N (i, j)

is the neighboring pixels within 3 × 3 around

(i, j)

. The final keypoints detection score is combined as:

s_{i j}^{c} = \max_{t} (α_{i j}^{c} β_{i j}^{c})

(3)

The detected score is used as the weighting term of hardest-contrastive loss function (Section 3.4.1).

To better preserve low-level features, such as corner points and edges, MKD combines the detections from multiple feature levels to achieve multi-level integration. Specifically, using network convolution to obtain different levels of feature maps

{F^{(1)}, F^{(2)}, \dots, F^{(l)}}

, we use the aforementioned detection method at each feature map separately and obtain the corresponding score map

{s^{(1)}, s^{(2)}, \dots, s^{(l)}}

. Next, it is up-sampled and reduced to the same spatial resolution as the input image. Finally, the weighted combination is combined as:

\hat{s} = \frac{1}{\sum_{l} w_{l}} \sum_{l} w_{l} s^{(l)}

(4)

where

w_{l}

= 1, 2, 3,which can balance the information from low-level and abstracted feature maps [27], and

s^{(l)}

is the score maps.

The lightweight L2-Net [37] is chosen for the MKD backbone network to reduce the amount of network computation. Meanwhile, to improve the ability of adapting to geometric changes, the last three convolutional layers of L2-Net (conv7, conv8 and conv9) are replaced with deformable convolutional networks [38,39]. To extract information from different feature layers, three layers, conv2, conv4, and deform_conv9, are selected to perform above detection as well as the weighted combination.

3.3. Image Patch Feature Learning

Our main idea is to find corresponding patches after MKD extraction of keypoints from input images. To achieve high accuracy of patch matching, the similarity of patch pairs is learned from the labeled training examples of matching and non-matching optical and SAR image patches. Suppose that the input images are denoted as

I_{a}

and

I_{b}

, respectively. If MKD detects

m

keypoints on

I_{a}

, the set of its image patches is

P^{a} = {p_{1}^{a}, p_{2}^{a}, \dots, p_{n}^{a}}

. Similarly, another set of image patches is

P^{b} = {p_{1}^{b}, p_{2}^{b}, \dots, p_{m}^{b}}

. Then we obtain the corresponding image patch pairs

p a t c h - p a i r s = {p_{i}^{a}, p_{j}^{b}}

by combining the patches in images

I_{a}

and

I_{b}

based on MKD keypoints, where

i = 1, 2, \dots, n

,

j = 1, 2, \dots, m

. Based on significant advantages of joint processing of image patches in two-channel network [8], we perform joint processing by channel fusion. Specifically, the input image patch pair is processed by channel fusion t first, then it is used as the input of the matching network. The matching labels of the image patch pair are obtained after feature learning. If the network predicts the label

y_{i}^{'} = 1

, i.e., the image patch pair represents a correct matching pair, its center point is used as matching point. In this way, we convert the image patch matching problem to a binary classification problem that predicts the relationship between input image patches and matching labels.

Due to the imaging mechanism of optical and SAR images and the size of image patches, some similar image patches may be found in images with weak texture. This one-to-many matching situation may affect matching accuracy. Based on work in [19], we apply a global constrain method, i.e., RANSAC to remove the false matching points.

3.4. Loss Design

The loss function design of PM-Net is divided into two parts: keypoints detection and patch feature learning. With regards to MKD, robust keypoints are separated from optical and SAR images with large radiometric variations and geometric differences by hardest-contrastive loss [40]. As for two-channel matching network, it uses cross-entropy loss to predict the relationship between image patch pairs and matching labels.

3.4.1. Hardest-Contrastive Loss

Based on D2-Net’s strategy of searching the maximum responses of both spatial and channel-wise from high-dimensional feature maps, the loss function of MKD follows the form in D2-Net with following equation:

L o s s (I_{a}, I_{b}) = \frac{1}{| C |} \sum_{c \in C} \frac{s_{c}^{(a)} s_{c}^{(b)}}{\sum_{q \in C} s_{q}^{(a)} s_{q}^{(b)}} M (f_{c}^{(a)}, f_{c}^{(b)})

(5)

where

s_{c}^{(a)}

and

s_{c}^{(b)}

are combined detection scores in Equation (4), and

M (\cdot, \cdot)

is hardest-contrastive loss to maximize the inter-class distances for separating robust keypoints from optical and SAR images. The specific formula is as follows:

\begin{array}{l} M (f_{c}^{(a)}, f_{c}^{(b)}) = & {[D (f_{c}^{(a)}, f_{c}^{(b)}) - m_{p}]}_{+} \\ + {[m_{n} - \min (\min_{k \neq c} D (f_{c}^{(a)}, f_{k}^{(b)}), \min (\underset{k \neq c}{\min D (f_{k}^{(a)}, f_{c}^{(b)})}))]}_{+} \end{array}

(6)

where

D (\cdot, \cdot)

denotes the Euclidean distance between features

f_{c}^{(a)}

and

f_{c}^{(b)}

, and

m_{p}, m_{n}

are set to 0.2, 1.0, respectively [40]. Specifically, if

(a, b)

is a positive matching point pair, their inter-feature distances should satisfy

D (f_{c}^{(a)}, f_{c}^{(b)}) \to 0

. To prevent overfitting, it is required that

D (f_{c}^{(a)}, f_{c}^{(b)}) < m_{p}

. If

(a, b)

is a false matching point pair, their inter-feature distance should satisfy

D (f_{c}^{(a)}, f_{c}^{(b)}) > m_{n}

.

3.4.2. Cross-Entropy Loss

Image patch matching network based on two-channel structure converts the image patch matching problem to a binary classification. Its training goal is to make predicted matching label

y_{i}^{'}

consistent with the actual matching label

y_{i}

. Therefore, we use cross-entropy loss [41], which is commonly used in classification tasks; its formula for binary classification is obtained by:

L o s s = \frac{1}{N} \sum_{i = 0}^{N} (- y_{i} \log (y_{i}^{'}) - (1 - y_{i}) \log (1 - y_{i}^{'}))

(7)

where

y_{i}

is the actual matching label of image patch, and

y_{i}^{'}

denotes the predicted output,

y_{i} \in (0, 1)

.

4. Experiments

4.1. Implementation Details

4.1.1. Training Dataset

In order to make MKD extract robust keypoints from optical and SAR images, the GL3D dataset [42] was chosen for the training data, part of which is shown in Figure 3. GL3D is a large dataset for 3D reconstruction and image geometry learning and contains 378 different scenes. Each scene contains 50 to 1000 images with large geometric overlap and images with weak texture such as deserts and rivers. Meanwhile, the images in dataset are closely linked and cover a wide range of scenes with rich geometric background information. It can provide accurate geometric calculations to establish pixel-level correspondence between image pairs, such as camera pose parameters and geographic 3D information. Therefore, we used the GL3D dataset to train MKD.

The training data for two-channel matching network was based on QXS-SAROPT dataset [43], which consists of 20,000 pairs of optical and SAR remote sensing images ac-quired by Google Earth satellite and the Gaofen-3 SAR satellite [44]. To enable the network to learn more detailed image patch information, we randomly selected and cropped this dataset to obtain 80,000 pairs of optical and SAR image patches of size 32 × 32 as the training data. Some of them are shown in Figure 4.

4.1.2. Training Details

MKD was trained from scratch using the GL3D dataset, and 800,000 image pairs were used for training. Each image was zero-averaged and normalized before being input to the network, and processed for brightness, contrast, and blur. Stochastic gradient descent (SGD) [45] was chosen as the network optimization method, with momentum set to 0.9 and initial learning rate set to 0.1.

The two-channel matching network was trained using adaptive moment estimation (Adam) [46] optimizer for network parameter optimization, with initial learning rate set to 10-5, and the number of iterations is 1000.

4.1.3. Test Dataset

The dataset used for matching experiments was the multi-modality image matching dataset [15], which contains computer vision, medical, and remote sensing. The remote sensing dataset contains a multimodal remote sensing image dataset with optical and SAR images, optical and infrared images, and other remote sensing data. We used optical and SAR image data from above dataset (hereafter referred to as OS data) for experiments. The image coverage scenarios in OS data included urban areas, mountainous areas, rivers and plains, and their texture information and nonlinear radiation varied greatly. Therefore, we used OS data for matching comparison experiments. The OS data are shown in Figure 5.

4.2. Comparison Experiments

In comparison experiments, PM-Net was compared with the following methods: Contextdesc [24]; R2D2 [26];D2-Net [25] and ASLFeat [27]. In addition, corresponding ablation experiments were set to verify the advantages of MKD and image patch matching.

4.2.1. Qualitative Experiments

Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 show the corresponding points matched by the proposed method and other four matching networks on OS data. A yellow line represents a pair of correct matched points, while a red line represents false matching. The matching effect can be visualized by the number and distribution of yellow and red lines. For the images with strong texture information such as image pairs A and B, Contextdesc can extract geometric and regional information to enhance the descriptors obtained by the network. However, it is difficult to extract higher-level semantic information in other images with weak texture, resulting in many red false matching lines appearing. Compared to Contextdesc, R2D2 slightly improves the performance on experimental data with more difficult matching such as image pairs D and F, and the yellow lines that indicate the correct matches are increased. D2-Net and ASLFeat extract higher-level semantic information of images, so they perform better compared with above two methods and obtain more correct matching lines.

It can be seen that PM-Net not only obtains more correct matching lines, but also has a more uniform distribution of matched points from the results. Besides, the number of red false matching lines for PM-Net is very low. This is because MKD takes into account both low-level and high-level information of images, which provides basic information for the learning of two-channel matching network. Therefore, its number and distribution of matched points are better than D2 -Net and ASLFeat.

4.2.2. Quantitative Experiments

The number of correct matching points (NCM), F1-measure value and root mean square error (RMSE) of matching points were introduced as the quantitative measurements for five matching networks: Contextdesc, R2D2, D2-Net, ASLFeat and PM-Net. The threshold

ε

of NCM and RMSE in this paper was chosen as 3 pixels. The experimental results are shown in Table 1, Table 2 and Table 3 and Figure 12. The experimental results with best performance are marked in bold

The NCM is determined by checking the output of matching point $(x, y)$ . When the error rate of a point is less than the threshold $ε$ , it is considered to be a correct match.
The F1-measure captures the suitability of matched points through joint Recall and Precision, which is calculated as follows [47]:

$F 1- m e a s u r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(8)

Precision in the above equation is the ratio of correct matches to the total number of matches, and is used to calculate the ability of matching method to exclude false matching points. Recall is the ratio of correct matches to the number of corresponding keypoints [48], which evaluates the accuracy of returning correct matching points.

3.: The root mean square error of matching points reflects the accuracy of points derived by the matching algorithm and RMSE is calculated as:

R M S E = \frac{1}{N C M} \sum_{i = 1}^{N C M} ∥ H (x_{i}, y_{i}) - (\dot{x}, \dot{y}) ∥

(9)

To calculate RMSE, 20 checkpoints for each set of test images are selected manually, and then we can calculate the image geometric affine transformation parameters H and theoretical coordinates

(\dot{x}, \dot{y})

.

By comparing and analyzing the experimental results, it can be seen that PM-Net can give a certain number of correct matching points for all the experimental image pairs, and the RMSEs are within three pixels. In contrast, Contextdesc, which is designed specifically for multimodal images, only completes matching in image pairs A and B with rich texture information and fails in the other four pairs. This is because Contextdesc does not include the detection of keypoints, but rather extracts the regional and geometric information of images to enhance the feature description of matching network. Therefore, it performs the worst experimental results for optical and SAR images with large differences. R2D2 considers the repeatability of keypoints and improves the reliability of matching, so it is slightly improved compared with Contextdesc. ASLFeat can extract richer keypoints in the dense feature extraction process, and it uses the inherent characteristics of network hierarchy to store spatial resolution and local details for achieving more accurate keypoints localization. As for image pairs E and F with weak texture, the overall performances of ASLFeat are better than Contextdesc. However, for pairs C and D with noise and huge image differences, ASLFeat matching fails.

For the experimental image pairs of A and B, D2-Net outperforms PM-Net in terms of F1-measure values in Table 2. Because D2-Net uses a ‘Describe-and-Detect’ strategy, i.e., it extracts descriptors and keypoints simultaneously, D2-Net generates feature maps from the original image and then performs feature detection on low-level feature maps. Therefore, for the two image pairs A and B with strong texture, D2-Net can extract richer information from the original images, resulting in slightly better F1-measure values than PM-Net.

As can be seen from Table 1, Table 2 and Table 3, PM-Net obtains the greatest number of correct matching points on six pairs of test images, and it is also clear from the qualitative experiments that the points are more evenly distributed. With F1-measure comparisons, PM-Net obtains better results overall. This shows that PM-Net is able to extract robust matching points from optical and SAR images. Apart from this, RMSE of the matching points of PM-Net are all less than three pixels. PM-Net learns the similarity function of optical and SAR image patch pairs to matching labels through the end-to-end network structure. It can complete the matching task without descriptors, which is more adaptable for optical and SAR image matching. Therefore, PM-Net is able to obtain better matching accuracy for optical and SAR image pairs with noisy interference and large feature differences.

4.2.3. Ablation Experiments

In this section, corresponding ablation experiments are described, which were used to verify the advantages of MKD and two-channel matching network. Ten pairs of images were randomly selected from the QXS-SAROPT dataset as test data. The specific experimental setup was as follows. The ablation experiment results can be seen in Figure 13.

Performance of MKD

We chose the traditional keypoints detection method SIFT as well as the advanced network method D2-Net with MKD for the ablation experiments. At the same time, experiments were also set up to verify the superiority of MKD’s strategies: (1) not using peak values as the keypoints measurement, i.e., MKD (w/o peak values); (2) not implementing the weighted combination, i.e., MKD (w/o weighted combination).

Different keypoints detection strategies were used on the same image patch matching network (Matching Module, MM). Repeatability was used as the evaluation criterion [48], the higher the repetition rate, the higher the probability that extracted keypoints are correct matching points. The results are shown in Table 4, and the experimental results with best performance are marked in bold.

Performance of the two-channel matching network

To verify the superiority of the two-channel network structure (2ChN), we conducted ablation experiments with two commonly used image patch matching network structures, namely Pseudo-Siamese network (PSN) and Siamese network (SN). They were trained on the same dataset from part of the QXS-SAROPT dataset. Precision was used as the evaluation criterion. The results are shown in Table 5, and the experimental results with best performance are marked in bold.

When not using peak values for keypoints detection or conducting weighted combination for feature layers in MKD, the experimental results were unsatisfactory. Compared with D2-Net, which uses the maximum response values as the detection condition, MKD had better results overall.

As for two-channel network structure, we performed patch matching comparison tests with two other Siamese-type network structures (i.e., pseudo-Siamese network and Siamese network). They used the same loss function and were trained on the same dataset. From the experimental results, the two-channel network on optical and SAR test images had higher matching precision overall.

5. Discussion

The comparative experimental results corroborate the accuracy and robustness of our method. Compared with four advanced matching networks, our method achieves accurate, stable matching under different ground scenes, and our advantage was especially obvious in weak texture areas. The advantages are as follows. First, the keypoints detected based on MKD are richer, which combining the low-level and high-level semantic information. Second, according to the patch matching network based on the two-channel structure, the similarity relationship between optical image and SAR image is constructed, and image patch matching results are obtained.

Meanwhile, the ablation experimental results confirm the performance of the MKD and two-channel network structure used by PM-Net for optical and SAR image patch matching. Compared to the keypoints detection strategy that does not implement the weighted combination, i.e., MKD (w/o weighted combination), MKD’s repeatability improves by more than 24.4%. Similarly, when using peak values as the basis of keypoints measurement, the repeatability of MKD improves by at least 5.4%. The matching precision of the two-channel network structure increased by more than 7.5%.

6. Conclusions

In this paper, PM-Net was proposed to solve optical and SAR image patch matching problems involving MKD and patch feature learning. First, MKD combined multi-level feature maps to obtain richer information from images for keypoints detection. Second, t wo-channel network was presented for learning the mapping function from image patch to matching label directly. This completes image matching by judging the matching labels of image patch pairs. Through comparison experiments with Contextdesc, R2D2, D2-Net and ASLFeat, PM-Net had obvious advantages in terms of NCM, F1-measure and RMSE in six pairs of optical and SAR images. Furthermore, the structural advantages of PM-Net were verified in ablation experiments with ten pairs of test data. However, the time consumption of PM-Net is large due to the brute-force matching strategy; to improve the time efficiency of it by parallel processing will be our future work.

Author Contributions

Conceptualization, Z.L.; methodology, Z.F.; writing—review and editing, H.N.; data curation, S.C.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant nos. 41961053. It was also supported by Yunnan Fundamental Research Projects under grant nos.202101AT070102 and 202101BE070001-037.

Data Availability Statement

The main code and all data in the paper are open source on https://github.com/Liziqian-1012/PM-net (accessed on 10 June 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Suri, S.; Reinartz, P. Mutual-information-based registration of TerraSAR-X and Ikonos imagery in urban areas. IEEE Trans. Geosci. Remote Sens. 2009, 48, 939–949. [Google Scholar] [CrossRef]
Li, G.; Lin, Y.; Qu, X. An infrared and visible image fusion method based on multi-scale transformation and norm optimization. Inf. Fusion 2021, 71, 109–129. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Sahin, G.; Cabuk, S.N.; Cetin, M. The change detection in coastal settlements using image processing techniques: A case study of Korfez. Environ. Sci. Pollut. Res. 2022, 29, 15172–15187. [Google Scholar] [CrossRef] [PubMed]
Hou, B.; Liu, Q.; Wang, H.; Wang, Y. From W-Net to CDGAN: Bitemporal change detection via deep learning techniques. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1790–1802. [Google Scholar] [CrossRef]
Zhang, H.; Lei, L.; Ni, W.; Tang, T.; Wu, J.; Xiang, D.; Kuang, G. Explore Better Network Framework for High Resolution Optical and SAR Image Matching. IEEE Trans. Geosci. Remote Sens. 2021, 60. [Google Scholar] [CrossRef]
Fan, B.; Huo, C.; Pan, C.; Kong, Q. Registration of optical and SAR satellite images by exploring the spatial relationship of the improved SIFT. IEEE Geosci. Remote Sens. Lett. 2012, 10, 657–661. [Google Scholar] [CrossRef]
Xiang, Y.; Wang, F.; You, H. OS-SIFT: A robust SIFT-like algorithm for high-resolution optical-to-SAR image registration in suburban areas. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3078–3090. [Google Scholar] [CrossRef]
Gong, M.; Zhao, S.; Jiao, L.; Tian, D.; Wang, S. A novel coarse-to-fine scheme for automatic image registration based on SIFT and mutual information. IEEE Trans. Geosci. Remote Sens. 2013, 52, 4328–4338. [Google Scholar] [CrossRef]
Cui, S.; Ma, A.; Wan, Y.; Zhong, Y.; Luo, B.; Xu, M. Cross-Modality Image Matching Network With Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets. IEEE Trans. Geosci. Remote Sens. 2021, 60. [Google Scholar] [CrossRef]
Merkle, N.; Luo, W.; Auer, S.; Müller, R.; Urtasun, R. Exploiting deep matching and SAR data for the geo-localization accuracy improvement of optical satellite images. Remote Sens. 2017, 9, 586. [Google Scholar] [CrossRef]
Zhang, H.; Ni, W.; Yan, W.; Xiang, D.; Wu, J.; Yang, X.; Bian, H. Registration of multimodal remote sensing image based on deep fully convolutional neural network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3028–3042. [Google Scholar] [CrossRef]
Hughes, L.H.; Schmitt, M.; Mou, L.; Wang, Y.; Zhu, X.X. Identifying corresponding patches in SAR and optical images with a pseudo-siamese CNN. IEEE Geosci. Remote Sens. Lett. 2018, 15, 784–788. [Google Scholar] [CrossRef]
Zhu, H.; Jiao, L.; Ma, W.; Liu, F.; Zhao, W. A novel neural network for remote sensing image matching. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2853–2865. [Google Scholar] [CrossRef]
Jiang, X.; Ma, J.; Xiao, G.; Shao, Z.; Guo, X. A review of multimodal image matching: Methods and applications. Inf. Fusion 2021, 73, 22–71. [Google Scholar] [CrossRef]
Parmehr, E.G.; Zhang, C.; Fraser, C.S. Automatic registration of multi-source data using mutual information. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, 7, 301–308. [Google Scholar] [CrossRef]
Liang, J.; Liu, X.; Huang, K.; Li, X.; Wang, D.; Wang, X. Automatic registration of multisensor images using an integrated spatial and mutual information (SMI) metric. IEEE Trans. Geosci. Remote Sens. 2013, 52, 603–615. [Google Scholar] [CrossRef]
Xu, X.; Li, X.; Liu, X.; Shen, H.; Shi, Q. Multimodal registration of remotely sensed images based on Jeffrey’s divergence. ISPRS J. Photogramm. Remote Sens. 2016, 122, 97–115. [Google Scholar] [CrossRef]
Wang, S.; Quan, D.; Liang, X.; Ning, M.; Guo, Y.; Jiao, L. A deep learning framework for remote sensing image registration. ISPRS J. Photogramm. Remote Sens. 2018, 145, 148–164. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Xu, C.; Sui, H.; Li, D.; Sun, K.; Liu, J. An automatic optical and sar image registration method using iterative multi-level and refinement model. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci 2016, 7, 593–600. [Google Scholar] [CrossRef]
Ma, J.; Jiang, X.; Fan, A.; Jiang, J.; Yan, J. Image matching from handcrafted to deep features: A survey. Int. J. Comput. Vis. 2021, 129, 23–79. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Luo, Z.; Shen, T.; Zhou, L.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. Contextdesc: Local descriptor augmentation with cross-modality context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2527–2536. [Google Scholar]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable cnn for joint detection and description of local features. arXiv 2019, arXiv:1905.03561. [Google Scholar]
Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2d2: Reliable and repeatable detector and descriptor. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Luo, Z.; Zhou, L.; Bai, X.; Chen, H.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. Aslfeat: Learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6589–6598. [Google Scholar]
Yang, Z.; Dan, T.; Yang, Y. Multi-temporal remote sensing image registration using deep convolutional features. Ieee Access 2018, 6, 38544–38555. [Google Scholar] [CrossRef]
Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote sensing image registration using convolutional neural network features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236. [Google Scholar] [CrossRef]
Ma, W.; Zhang, J.; Wu, Y.; Jiao, L.; Zhu, H.; Zhao, W. A novel two-step registration method for remote sensing images based on deep and local features. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4834–4843. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 7–13 December 2015; pp. 118–126. [Google Scholar]
Ahmed, E.; Jones, M.; Marks, T.K. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3908–3916. [Google Scholar]
He, H.; Chen, M.; Chen, T.; Li, D. Matching of remote sensing images with complex background variations via Siamese convolutional neural network. Remote Sens. 2018, 10, 355. [Google Scholar] [CrossRef]
Zagoruyko, S.; Komodakis, N. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4353–4361. [Google Scholar]
Zhang, L.; Rusinkiewicz, S. Learning to detect features in texture images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6325–6333. [Google Scholar]
Tian, Y.; Fan, B.; Wu, F. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 661–669. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Choy, C.; Park, J.; Koltun, V. Fully convolutional geometric features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–18 October 2019; pp. 8958–8966. [Google Scholar]
Ruby, U.; Yendapalli, V. Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9. [Google Scholar]
Shen, T.; Luo, Z.; Zhou, L.; Zhang, R.; Zhu, S.; Fang, T.; Quan, L. Matchable image retrieval by learning from surface reconstruction. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 415–431. [Google Scholar]
Huang, M.; Xu, Y.; Qian, L.; Shi, W.; Zhang, Y.; Bao, W.; Wang, N.; Liu, X.; Xiang, X. The QXS-SAROPT dataset for deep learning in SAR-optical data fusion. arXiv 2021, arXiv:2103.08259. [Google Scholar]
Zhao, L.; Zhang, Q.; Li, Y.; Qi, Y.; Yuan, X.; Liu, J.; Li, H. China′s Gaofen-3 Satellite System and Its Application and Prospect. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11019–11028. [Google Scholar] [CrossRef]
Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 421–436. [Google Scholar]
Dogo, E.; Afolabi, O.; Nwulu, N.; Twala, B.; Aigbavboa, C. A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks. In Proceedings of the 2018 international conference on computational techniques, electronics and mechanical systems (CTEMS), Belgaum, India, 21–22 December 2018; pp. 92–99. [Google Scholar]
Nunes, C.F.; Pádua, F.L. A local feature descriptor based on log-Gabor filters for keypoint matching in multispectral images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1850–1854. [Google Scholar] [CrossRef]
Ye, Y.; Shan, J.; Hao, S.; Bruzzone, L.; Qin, Y. A local phase based invariant feature for remote sensing image matching. ISPRS J. Photogramm. Remote Sens. 2018, 142, 205–221. [Google Scholar] [CrossRef]

Figure 1. The pipeline of PM-Net. The proposed method has two stages: multi-level keypoints detector and patch feature learning.

Figure 2. Illustration of peak value, reprinted from Ref. [36]. 2018, Linguang Zhang. (Left): different response curves can lead to same ranking. (Right): peak value of the response curve can be evaluated as the area above the most-peaked response curve.

Figure 3. Sample image pairs of multi-level keypoints detector training data.

Figure 4. Sample image pairs of two-channel matching network training data.

Figure 5. Six test image pairs of optical and SAR (From left to right: pairs A, B, C, D, E and F).

Figure 6. Matching results of pair A. (a) Contextdesc; (b) R2D2; (c) D2-Net; (d) ASLFeat; (e) PM-Net. Yellow lines represent correct matches and red lines represent false matches.

Figure 7. Matching results of pair B. (a) Contextdesc; (b) R2D2; (c) D2-Net; (d) ASLFeat; (e) PM-Net.

Figure 8. Matching results of pair C. (a) Contextdesc; (b) R2D2; (c) D2-Net; (d) ASLFeat; (e) PM-Net.

Figure 9. Matching results of pair D. (a) Contextdesc; (b) R2D2; (c) D2-Net; (d) ASLFeat; (e) PM-Net.

Figure 10. Matching results of pair E. (a) Contextdesc; (b) R2D2; (c) D2-Net; (d) ASLFeat; (e) PM-Net.

Figure 11. Matching results of pair F. (a) Contextdesc; (b) R2D2; (c) D2-Net; (d) ASLFeat; (e) PM-Net.

Figure 12. Quantitative experimental results.

Figure 13. Ablation experiment results.

Table 1. Number of correct matching points.

Image Pair	Method
Image Pair	Contextdesc	R2D2	D2-Net	ASLFeat	PM-Net
A	140	129	265	243	305
B	105	45	155	140	259
C	7	34	81	101	238
D	3	12	113	16	216
E	6	5	124	94	298
F	2	62	165	116	276

Table 2. Values of F1-measure.

Image Pair	Method
Image Pair	Contextdesc	R2D2	D2-Net	ASLFeat	PM-Net
A	0.054	0.116	0.487	0.213	0.359
B	0.042	0.049	0.342	0.183	0.208
C	0.074	0.023	0.155	0.313	0.388
D	0.041	0.051	0.184	0.082	0.297
E	0.025	0.055	0.163	0.057	0.219
F	0.058	0.103	0.370	0.078	0.434

Table 3. Root mean square error of matching points.

Image Pair	Method
Image Pair	Contextdesc	R2D2	D2-Net	ASLFeat	PM-Net
A	3.084	1.358	2.465	1.557	1.043
B	2.691	3.480	2.151	2.377	1.101
C	3.671	3.062	2.590	3.372	1.166
D	/	3.237	3.149	3.671	1.797
E	3.196	3.839	3.226	3.504	1.989
F	/	3.592	3.319	3.746	2.810

‘/’ means matching failed.

Table 4. Ablation experiment result of keypoints detection.

Metric	Test Number	SIFT+MM	D2-Net + MM	MKD (w/o Peak Values) + MM	MKD (w/o Weighted Combination) + MM	MKD + MM
%Rep.	Test 1	44.18	62.70	26.00	30.00	61.06
	Test 2	49.56	47.27	28.24	42.86	53.33
	Test 3	21.13	31.46	45.28	33.33	48.05
	Test 4	41.57	44.72	26.09	39.13	55.07
	Test 5	38.51	63.56	35.91	39.89	59.87
	Average (1–10)	38.70	46.64	30.91	38.18	53.19

Table 5. Ablation experiment result of patch matching.

Metric	Test Number	MKD + PSN	MKD + SN	MKD + 2chN
%Precision	Test 1	38.72	40.89	57.05
	Test 2	31.93	61.69	66.36
	Test 3	36.84	51.49	45.38
	Test 4	34.13	22.98	38.89
	Test 5	41.19	35.64	49.87
	Average (1–10)	38.01	40.69	48.28

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Fu, Z.; Nie, H.; Chen, S. PM-Net: A Multi-Level Keypoints Detector and Patch Feature Learning Network for Optical and SAR Image Matching. Appl. Sci. 2022, 12, 5989. https://doi.org/10.3390/app12125989

AMA Style

Li Z, Fu Z, Nie H, Chen S. PM-Net: A Multi-Level Keypoints Detector and Patch Feature Learning Network for Optical and SAR Image Matching. Applied Sciences. 2022; 12(12):5989. https://doi.org/10.3390/app12125989

Chicago/Turabian Style

Li, Ziqian, Zhitao Fu, Han Nie, and Sijing Chen. 2022. "PM-Net: A Multi-Level Keypoints Detector and Patch Feature Learning Network for Optical and SAR Image Matching" Applied Sciences 12, no. 12: 5989. https://doi.org/10.3390/app12125989

APA Style

Li, Z., Fu, Z., Nie, H., & Chen, S. (2022). PM-Net: A Multi-Level Keypoints Detector and Patch Feature Learning Network for Optical and SAR Image Matching. Applied Sciences, 12(12), 5989. https://doi.org/10.3390/app12125989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PM-Net: A Multi-Level Keypoints Detector and Patch Feature Learning Network for Optical and SAR Image Matching

Abstract

1. Introduction

2. Related Work

2.1. Area-Based Methods

2.2. Feature-Based Methods

2.3. Learning-Based Methods

3. Method

3.1. Overview

3.2. Multi-Level Keypoints Detector

3.3. Image Patch Feature Learning

3.4. Loss Design

3.4.1. Hardest-Contrastive Loss

3.4.2. Cross-Entropy Loss

4. Experiments

4.1. Implementation Details

4.1.1. Training Dataset

4.1.2. Training Details

4.1.3. Test Dataset

4.2. Comparison Experiments

4.2.1. Qualitative Experiments

4.2.2. Quantitative Experiments

4.2.3. Ablation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI