In this part, we mainly detailed the generation of training data, implementation of network, qualitative experiment and quantative experiments.
4.3. Qualitative Experiments
To validate the effectiveness of the proposed method, two pairs of images (Sentinel-2 and Landsat-8 images) with size of
pixels are used to test. One pair mainly covers the scene of mountain, and they have large gray differences. The other mainly contains the human-made buildings, and some contents are obscured by cloud. For each pair of images, we adopted two different methods to extract image feature points. One method is to uniformly select
grid points from the sensed image (Sentinel-2), with a minimum distance of 60 pixels from the image boundary, in order to avoid the possibility of some boundary points not having corresponding points on the reference image. The other method is to extract 400 feature points from the image using the SIFT algorithm. Then, for each extracted image feature point, we used the proposed method to match it with the reference image, and the matching results are shown in
Figure 5.
For each pair of images, we calculate the re-projection error of their each matched points based on the true geometric transformation matrix H between them. Here, we set an error threshold to be 2. If the re-projection error is less than this threshold, we consider the feature point matching successful, otherwise it is considered a failure. We compute the matching success rate for each pair of images, which is defined as the ratio of the number of correctly matched points to the total number of feature points, and is displayed in the top left corner of the image. In order to visualize the matching effect of feature points better, we connect each pair of matched points with a line, and the color of the line represents the error of the matched points. As the re-projection error gradually increases, the color of the line changes from green to red. Green color represents a very small re-projection error, while red color represents a re-projection error that exceeds twice the error threshold.
From the perspective of the matching success rate metric, the matching success ratio of the proposed method is almost above , especially for the feature points extracted by SIFT, with a matching success rate exceeding . For the uniformly selected feature points, their matching accuracy can also reach or approach to , which demonstrates the effectiveness of the proposed method. At the same time, from the distribution of the matching points, the correctly matched points are almost uniformly distributed on the image, while some matching failures or large matching errors are mostly distributed near the edge of the image or where the image is occluded. This excellent distribution of correctly matched points is also very conducive to the final geometric correction of the image. So, the high matching success rate and uniform distribution of matched points illustrate the effectiveness of the proposed method.
4.4. Quantitative Experiments
To evaluate the registration accuracy and performance quantitatively of the proposed method, a total of 340 pairs of images are used as the test data. These images have different acquired time, large gray difference, and diverse content, including mountains, rivers, roads, human-made buildings, farmland, etc., and some image content is partially obscured by thin cloud. Six samples of these image pairs are shown in
Figure 6.
To quantitatively evaluate the performance of the proposed method, two measurements of mean re-projection error
and mean success ratio,
are adopted. The mean re-projection error of a pair of feature points
and
is calculated as follows:
where
is a feature point from the sensed image and
is a feature point from the reference image, and the re-projected feature point of
in the sensed image is
in the reference image. The success ratio
is calculated as the ratio of the number of successfully matched feature points
to the total number of feature points used to match
:
The standard for determining the success of feature point matching is whether their re-projection error falls below a specific threshold. A pair of feature points is considered as a successful match when their re-projection error is less than the threshold. For the
metric, a smaller value indicates better performance, while for the
metric, a larger value indicates better performance.
At the same time, three methods are utilized to compare. One is the most popular traditional method SIFT [
6], and the other two algorithms are deep learning based methods superGlue [
46] and LoFTR [
37]. Although for the SIFT algorithm, given a feature point, can generate its descriptor, but in the actual SIFT matching process, the scale and orientation information of this feature point are also required, and then a corresponding block is cropped based on the position information to generate the descriptor. For the other two deep-learning-based algorithms, they are even more closely related to feature extraction because they are trained through end-to-end training. Specifically, the SIFT algorithm extracts feature points using the Differene of Gaussian (DoG) operator and performs matching based on the ratio of the nearest neighbor distance to the second nearest neighbor distance. The SuperGlue method, on the other hand, utilizes the SuperPoint network [
30] to extract feature points and then employs the SuperGlue network for feature matching. As for the LoFTR method [
37], its feature points and matching are obtained through joint optimization learning. Therefore, the feature matching process of these three methods is essentially related to feature point extraction. In contrast, our proposed method works completely independently of feature extraction. Given the position information of a point in the image to be registered, we can find its corresponding feature point in another reference image. The weights of SuperPoint, SuperGlue and LoFTR are provided by their respective authors, and their hyper-parameters are also set to default values provided by the authors. The implementation of SuperPoint and SuperGlue are from their authors and the implementation of LoFTR are from the kornia library [
47].
However, for each algorithm among these three, the number, distribution, and location of the extracted feature points for the same pair of images are different. To avoid potential performance degradation in the three matching algorithms caused by differences in the extracted feature points, the proposed method, when comparing with each algorithm individually, takes the feature points extracted by each algorithm as the set of candidate matching points and then employs the proposed method for feature points matching. The comparison results of the mean success ratio and mean re-projection error of the proposed method and each algorithm under different threshold settings of 1, 3, 5, 7, and 9 pixels are shown in
Table 1. Further, the respective comparison of the boxplot of mean success ratio and mean re-projection Error between the Proposed Method and SIFT, SuperGlue, and LoFTR under different thresholds are shown in
Figure 7.
Based on the comparison with the SIFT algorithm, the average of the SIFT algorithm is about , while the average of the proposed method is at least , and as the threshold gradually increases, the average of the proposed method will further increase and can exceed . As expected, the value of the proposed method is larger than that of SIFT, mainly because under the same threshold setting, the proposed method can correctly match far more feature points than SIFT. When the threshold is set to one pixel, the of the proposed method is nearly higher than that of SIFT, while its is not higher than of SIFT’s . Therefore, this fully demonstrates that the performance of the proposed method is significantly better than that of the traditional SIFT algorithm, that is, the descriptor obtained by comprehensively considering the local and global information of feature points is more discriminative than the descriptor obtained by simply cropping small regions based on the scale and orientation information of feature points. Compared to the SuperGlue method, a deep learning-based approach, the proposed method has a significantly higher mean success ratio () value, especially at the same threshold setting where the proposed method’s average can be up to 5 times higher than that of SuperGlue. Furthermore, the proposed method has a smaller mean re-projection error () than that of SuperGlue, which fully demonstrates that the proposed method outperforms SuperPoint comprehensively. As for the poor performance of the SuperGlue, it is mainly due to its input being two sets of feature points, resulting in its attention to be focused only within the neighborhood of detected feature points.
Compared to the latest deep learning method LoFTR, the proposed method has a higher value, approximately higher than that of LoFTR. Regarding the metric, when the threshold is larger than 3, the proposed method’s is higher than that of LoFTR because the proposed method has a higher average , as expected. However, when the threshold is less than or equal to 3, the values of both methods are very close, which fully demonstrates that the proposed method is better than LoFTR.
Based on the boxplot graphs of mean success ratio from various comparative experiments, it can be observed that regardless of the method or threshold setting, there are cases of complete matching failure in the test dataset consisting of 340 image pairs. This indicates that the chosen dataset is quite challenging. Upon examining the test dataset, it is found that the complete matching failures are mainly due to a large extent of cloud coverage in the images or significant content variations caused by temporal differences between the image pairs. Although efforts are made to select images with minimal cloud coverage for the dataset, the subsequent local cropping process inevitably led to situations where a significant portion of the images are obscured by clouds.
Furthermore, comparing the proposed method with the SIFT method, although the proposed method exhibits a larger interquartile range, the 25th, 50th, and 75th percentiles of the proposed method are significantly higher than those of the SIFT method. A similar trend is observed in the comparison between SuperPoint and SIFT methods. In the case of comparing LoFTR with the proposed method, their quartile sizes are almost similar, but the 25th, 50th, and 75th percentiles of the proposed method are higher than those of LoFTR. Therefore, based on the boxplot of mean success ratio, it can be concluded that the proposed method’s mean success ratio is superior to the mean success ratios of the other three methods.
Analyzing the boxplot graphs of mean reprojection error from various comparative experiments, in comparison to the SIFT method, the proposed method exhibits a slightly smaller mean reprojection error, which aligns with our expectations since the proposed method generates more matched points under the same threshold. Comparing the proposed method with SuperGlue, it can be observed that not only does the proposed method have lower 25th, 50th, and 75th percentiles across different thresholds, but it also has significantly fewer outliers exceeding the maximum value, with at least half the number of outliers compared to the SuperGlue method. In the comparison between LoFTR and the proposed method, for thresholds greater than 3, the 25th, 50th, and 75th percentiles of the proposed method are higher than those of LoFTR. However, when it comes to outliers beyond the maximum value, the proposed method has slightly more outliers than LoFTR. Overall, the boxplot graphs of mean reprojection error provide consistent information with the Mean reprojection error values presented in
Table 1.
Besides, analyzing the three separate comparison experiments as a whole, we find that the proposed method’s mean
and
values are different because the number and location of the selected feature points are different for the same test dataset. For the proposed method, each experiment’s feature points are detected by the compared method. To investigate this further, we conducted another experiment where we uniformly sampled 900 points from each image as feature points and then performed feature matching, the mean success matching success ratio and the mean reprojection error of the proposed method are shown in
Table 2 and their boxplot graph of the are shown in
Figure 8.
Based on the comprehensive comparison of the proposed method’s matching success rate under different conditions of obtaining feature points, we find that the proposed method’s mean is the lowest when uniformly selecting feature points, and the highest when SuperPoint extracts feature points. After further analysis, we find that when using uniformly selected feature points, the quality is not high due to differences in the image content, such as partial cloud coverage, which can lead to a low-quality feature point selection, resulting in the lowest mean matching success rate. From the analysis of the metric, we can also conclude that the proposed method’s is the lowest when using SuperPoint’s extracted feature points, which proves that the feature points extracted by SuperPoint and the proposed method are the best combination.