1. Introduction
Multimodal remote sensing images can be applied to compensate the deficiencies of the single image source by increasing the amount of the image information. The generic and automatic registration of multimodal remote sensing image is the necessary step of point cloud coloring, image fusion, image stitching and mosaic, target recognition and change detection (e.g., change detection based on two heterogeneous images acquired by optical sensors and radars on different dates [
1]). And it is of fundamental importance for numerous emerging geospatial environmental and engineering applications (e.g., geometric correction of SAR image using the optical image, band-to-band image registration developed for the High-Precision Telescope of microsatellite remote sensing [
2], the spatial registration of point/line-scan hyperspectral sensor measurements in-water hyperspectral imaging [
3]). Therefore, multimodal image registration technology is the major obstacle impeding the improvement of the accuracy and effectiveness of various problems [
4]. Due to the various imaging mechanisms of different sensors and the disparate time, angle and environment of image acquisition, there are still many challenges in the field of multimodal images registration. The realization of high-precision and automatic registration technology is even more difficult, especially in some cases where there are significant differences between remote sensing data category (e.g., optical and SAR images, point cloud depth map and panoramic images) and band (e.g., visible light and medium wave infrared image, near infrared and multi-spectral images).
Traditional remote sensing image registration methods adopt the manual selection of control points to solve geometric transformation relationships to achieve pixel-by-pixel alignment between images, which are mainly constrained in terms of two aspects, including the heavy workload and low efficiency and automation. At present, the automatic registration technology for multimodal remote sensing images has been extensively studied and is one of the research hotspots in the field of image processing.
For multimodal remote sensing images, the grayscale characteristic is no longer a linear relationship. Neither is it even a non-function change generally with statistical correlations and geometric similarities in the gray relations between images. Meanwhile, the non-uniform deformation would occur during acquisition of multimodal images. As what is shown in
Figure 1, errors are in pixels and have been calculated based on manually denoted homonymy points. It sometimes would be ignored that numerous deformation properties of multimodal images are non-rigid and non-linear.
Image registration has been comprehensively studied through many approaches being proposed. Due to the weak grayscale correlation of multimodal images, simple, gray-based registration methods (e.g., the correlation function [
5]) would lead to inaccurate registration. Feature descriptors, although it has great advantages in homologous image registration, do not perform with the same robustness in multimodal remote sensing datasets [
6] while these methods (e.g., contour feature [
7,
8], SUSAN [
9], SIFT [
10] phase congruency based feature [
11,
12], extended SURF [
13]) are faced with the challenges of repetitive feature detection and modal-invariant feature description. Most of them are still exploratory or are only suitable for specific occasions or certain types of multimodal images.
Due to the statistical correlation of grayscale relationships and the similarity of geometric structures between multimodal remote sensing images, the regional registration method based on statistical dependence or image structure consistency is suitable for multimodal image registration. The following is a summary of the current research status of this type of registration method.
The regional registration methods based on statistical dependency outperforms the previous methods. The well-known mutual information [
14,
15,
16] mainly uses the statistical correlation of gray features between images, which is employed for medical image registration first and the multimodal image registration afterwards. The normalized mutual information [
17] can be applied to overcome the shortcoming of mutual information sensitivity to image overlap. High-dimensional mutual information [
18], regional mutual information [
19], feature mutual information [
20] have high dimension, complex calculation and other defects. Cross-Cumulative Residual Entropy (CCRE) [
21] introduces cumulative residual entropy into mutual information with strong anti-noise ability and stability. Besides, wavelet transform [
22] is applied to registration methods based on mutual information to realize the registration of infrared and optical images from coarse to fine. It is difficult for these methods to achieve ideal results in terms of time consumption or registration accuracy.
Generally, the basic structures of ground objects can be preserved in the multimodal image. These structures share certain similarities and invariances. Image structure information can generally be represented by gradients, edges and self-similarity. Yan et al. [
23] proposed the gradient consistency operator based on the norm-weighted angle between gradient vectors and applied it to the registration between medium-infrared and visible images. Heinrich et al. [
24] proposed Modality Independent Neighborhood Descriptor (MIND), which enjoys obvious advantages compared with the anti-noise characteristics and stability. However, it involves massive calculation. Ye et al. [
25] construct a shape similarity measurement using local self-similarity and normalized correlation coefficients to for multispectral remote sensing image registration. These methods are only suitable for certain multimodal images or have low computational efficiency.
Another type of registration methods is to combines image structure information and mutual information. Among them, the most popular research is the combination of gradient feature information and mutual information for heterogeneous image registration [
26,
27,
28,
29], which is widely applied to medical image registration research. But in the field of remote sensing image registration and application, these methods are still immature and further research is needed.
In order to register multimodal remote sensing images, various registration methods have been proposed through these decades. A generic and automated registration framework based on Markov Random Field (MRF) [
30,
31] has been successfully utilized for multimodal remote sensing image registration. The discrete optimization setting along with the introduced data-specific energy terms form a modular approach with respect to the similarity criterion, which endows the fully exploitation of the spectral properties of multimodal remote sensing datasets.
However, the registration accuracy of the registration framework based on Markov Random Field (MRF) cannot reach the subpixel level. Inspired by the aforementioned work, we propose a common, automatic registration framework of the multimodal image to satisfy the standards of the desired registration accuracy. The proposed approach performs non-rigid registration, formulates a Markov Random Field (MRF) model while grayscale and gradient statistical information of multimodal image is employed for the evaluation of the similarity and the spatial weighting function is optimized simultaneously. Deformable registration satisfies differential homeomorphism, which can better maintain the topology of the image [
32]. The value space is discretized to improve the convergence speed. In terms of spatial accuracy, the average registration error is less than 1 pixel while the maximum registration error is less than 1 pixel. The major contributions of this paper are listed below:
- (1)
We propose a generic and automatic registration framework for multiple multimodal remote sensing images which significantly improves registration accuracy.
- (2)
Compared with other methods, the proposed approach considers both the grayscale and gradient statistical information of images, which is used for similarity measures of data-specific energy term, hence to possess stronger robustness for the registration of multiple heterogeneous remote sensing images and target scenes.
- (3)
It can be noted that the contribution of the spatial weight function to the registration energy should be considered. We employed the Gaussian function method, which can overcome the discontinuity of the spatial weight function and hence to further improve registration accuracy.
The rest of the paper is organized as follows: In
Section 2, we describe the overall process and details of the proposed generic and automatic MRF-based registration for multimodal remote sensing image using grayscale and gradient Information.
Section 3 presents the experiments and results of four multimodal datasets to demonstrate the effectiveness both qualitatively and quantitatively. In
Section 4, a discussion about the proposed method with the accuracy improvement and the influence of parameter analysis is conducted.
Section 5 presents the conclusions and suggestions for the future work.
2. Materials and Methods
The multimodal image registration framework considering the gray and gradient information is the association of Markov Random Field (MRF) and B-spline interpolation. The basic framework of the framework is the Markov Random Field (MRF) model, which can be used to estimate the optimal transformation by minimizing the energy function. Firstly, the image is meshed to form a uniform control grid. Considering the gray and gradient information of the image for cost calculation, the optimal solution (Fast PD algorithm [
33,
34,
35], used drop library) is used to obtain the label of MRF (the displacement of the control grid node). Then the B-spline interpolation strategy is used for the displacement calculation of each pixel point to control image deformation and perform registration.
In the specific case of image registration, the MRF model is defined as follows [
30]: the node (the spatial position
xp) corresponds to the control point in the uniform B-spline grid; For each node, there is a set of discrete labels
L; The discrete labels set
L corresponds to the quantization of the solution space, representing the allowable discrete displacement. The number of labels set is 4
n + 1 while the uniform sampling number along the
x,
y and
diagonal lines is
n. The random variables would correspond with the displacement of the control points. The optimized energy function consists of two terms: the data term measures the data (the source image and the gradient image) likelihood of applying all allowed displacements to each random variable through the use of unary potentials
D and the regular term penalizes non-desirable interactions between the random variables through the use of pairwise potentials
P and introduces the prior knowledge of the smoothing constraints of the deformation field. Besides,
λ is a scalar value used to evaluate the influence of the regular term. The goal of image registration is to assign an optimal label to every control grid node, so that the following energy is minimized:
At the same time, the framework was implemented in this paper coupled with the pyramidal representation of the images and a multi-scale approach for the deformation model. The Gaussian pyramids could reduce the computational cost while the multi-scale approach for the deformation model can increase the resolution by halving the interval of the control points. Therefore, we can gradually improve the registration results and restore the larger displacement. For every grid resolution level under every pyramid level, an iterative scheme was used in order to enhance the efficiency of discrete labels. It is necessary to keep a reasonable label set space and optimize the labels at each iteration. Meanwhile, the displacements of grid nodes are used to improve the registration image to capture smaller displacements. The registration process is summarized in
Figure 2.
2.1. Uniform Grid Division of Control Point and Cost Calculation
The registration model we proposed in this paper is a nonparametric registration model based on MRF. The image was meshed to form a uniform control point grid
G:
M ×
N (
M and
N are significantly smaller than the image size; the interval between the grid nodes is
δ). The cost calculation of regular grid codes based on the MRF model is composed by two parts, namely the grayscale data term and the gradient date term, which are defined by Equation (2). We employed a block-matching similarity criterion [
30] to evaluate the maximum likelihood of applying all permissible displacements to each random variable, where the similarity measure is performed with the block image and its gradient image centered on each control grid node:
where
IS,
GS denote the original image and its corresponding gradient image. Accordingly,
IT,
GT refer to the target image and its gradient image.
ω represents the gradient statistic weight, which decreases corresponding with the increase of the number of the iterations.
η represents the block weighting function around the control point
p, ranging from 0 to 1. The closer the space to the central control point is, the greater the impact is
denotes Euclidean norm. Normal distribution is obviously a desirable weight distribution model. The center point
p is taken as the origin while the rest of the points are measured based on their positions on the normal curve. The density function of normal distribution is represented by Gaussian function:
where
b is the width parameter of the Gaussian function, which can be used to control the radial range.
In this paper, the gradient information of the multimodal image is extracted by the morphological gradient. The basic operation (e.g., morphological corrosion, expansion, open operation and closed operation) is used for image processing. The morphological gradient operator is formed correspondingly. Common morphological gradient operators can be calculated as follows:
The morphological gradient is the difference between the expansion graph and the corrosion graph. It can detect the edge of the image and extract the feature information and hence can obviously reduce the calculation load of the similarity measure and also improve the local extreme-value problem.
The similarity measure
ρ can use mutual information that is insensitive to changes in light. The normalized mutual information of two images can be used for similarity measure calculation based on block matching strategy, which can reflect the degree of mutual information between them through their entropy and joint entropy:
where
can be calculated from the joint probability distribution of the corresponding gray scale [
36]. The number of corresponding pixels is
n. Parzen estimates [
37] is used with the convolution of 2D Gaussian (represented by
):
Accordingly, the calculation of
can be similar to
:
For the regularization term, the priori constraint of the smoothing deformation field is introduced. This indicates that the displacement field of the control grid nodes is assumed to be smooth while the spatially close variables
p and
q should be assigned to have similar labels. We employ a simple strategy that is based on the vector differences between candidate labels normalized by grid distance
δ:
It is worth to note that, the cost calculation needs to use the grayscale and gradient information of the multimodal image based on normalization mutual information (NMI). At the same time, the block matching strategy is adopted for multimodal remote sensing datasets, which allows the local difference evaluation between the images to be registered while the computational efficiency can be improved. Cost calculation visualization flow chart shown in
Figure 3.
2.2. B-Spline Elastic Deformation Model
By optimizing the MRF model, we can solve the labels of controls grid nodes. The elastic deformation model of B-spline based on the grid can provide one-to-one and reversible conversion. The basic idea of the deformation model is to deform the underlying image through the control of the nodes of the grid and calculate their influence in the rest of the image domain via an interpretation strategy:
where
T(
x) is the target image;
d is the displacement of the control point
ij;
η corresponds to the interpolation or weighting function that determines the effect of the control point
ij on the image point
x. The closer the image point is, the greater the effect of the control point would be.
A uniform cubic B-spline function [
38] is used by the interpolation strategy. The (
m + 3) × (
n + 3) control grid is used to define the 2
D uniform cubic B-spline function. The function
F2 consists of
m ×
n 2
D patches, each of which is determined by 4 × 4 control points. Without loss of generality, a 2D uniform cubic B-spline function is represented by a patch
f2, which is defined by:
where,
,
B0,
B1,
B2 and
B3 is the basic function for the uniform cubic B-spline,
denotes to the displacement of the control point.
4. Discussion
4.1. Accuracy Improvements from Gradient Information
In order to verify the performance of the framework and accuracy improvements from gradient information, these three comparative methods have been studied based on four multimodal datasets, including (1) MRF-based registration framework only using grayscale information; (2) MRF-based registration framework only using gradient information; (3) MRF-based registration framework using grayscale and gradient information.
The evaluation of three comparative registration methods was performed both qualitatively and quantitatively, the results of which were further evaluated by the local characterization of the interest region on the checkerboard visualizations for the qualitatively evaluation. After the registration, the multimodal remote sensing images were superimposed together and displayed alternately in checkerboard pattern. As can be seen in
Figure 9, we selected two regions of interest on the checkerboards after three different registration methods to local zoom for the comparative analysis, which correspond to scenes with abundant textures and lack of textures, respectively. Through the observation of the checkerboard lattice edge of the multimodal image overlay, the following conclusions of the qualitative evaluation are made, which are presented below:
(1) For mid-infrared image and visible orthomosaic image registration, Worldview-3 near-infrared and multi-spectral image registration, the checkerboard visualization of the proposed framework is roughly equivalent with that of all other methods.
(2) For Terra SAR image and optical image registration, monolithic cloud point depth image and visible image registration, MRF-based registration framework only using grayscale information cannot achieve successful registration. MRF-based registration framework using grayscale and gradient information has better performance than that only using gradient information. There are still unaligned block edges on the checkerboard visualization of MRF-based registration framework only using gradient information, while checkerboard block edges of MRF-based registration framework using grayscale and gradient information are neatly coincident, so the registration effect of the proposed framework is much better than all other methods.
(3) For some scenes with abundant texture, the edge information is relatively abundant, MRF-based registration framework using grayscale and gradient information outperforms that only using grayscale information; for some scenes with lack of texture, the edge information is relatively deficient, MRF-based registration framework using grayscale and gradient information outperforms that only using gradient information. There are a large number of such areas with lack of textures in photogrammetry and remote sensing applications, so it is necessary to combine grayscale and gradient information.
In summary, the proposed framework has the best visualization results.
For quantitative evaluation, errors are in pixel level and have been calculated based on the manually-denoted homonymy points, which are all shown in
Table 5. The following conclusions can be drawn:
(1) Regarding the registration accuracy, MRF-based registration framework using grayscale and gradient information has better performance than other methods. The average target registration errors of four remote sensing datasets are less than 1 pixel, while the maximum displacement errors are less than 1 pixel.
(2) The proposed framework has improved the registration accuracy of four multimodal datasets, especially the registration of Terra SAR image and optical image and the registration of monolithic cloud point depth image and visible image. The proposed framework can achieve the sub-pixel-level registration precision even in the case where other methods cannot achieve successful registration or have poor registration accuracy.
As shown in
Figure 10, the registration error has been significantly decreased while the registration accuracy has been dramatically improved after the joint use of grayscale and gradient information. Above all, the proposed registration framework outperforms all other methods and meets the subpixel-level registration accuracy need.
4.2. Accuracy Improvements from the Spatial Weighting Function η
The common spatial weight functions are listed as follows, including (1) distance threshold method; (2) distance inverse method; (3) Gaussian function method. Although the distance threshold method is simple, it is constrained by the disadvantages that the function is not continuous. Therefore, it should not be used in the registration framework. In order to investigate the accuracy improvement of the proposed framework in relation to the spatial weighting function η, we set up two experiments correspondingly: other parameters will be fixed as optimal parameter settings in each dataset while the spatial weighting function η was chosen from distance inverse method or Gaussian function method for four multimodal datasets.
As shown in
Table 6,
Figure 11, the use of the Gaussian function method in the proposed framework can dramatically decrease the registration error and hence to increase the registration accuracy. Above all, the proposed registration framework using Gaussian function as spatial weighting function is superior to that using distance inverse method.
4.3. Influence of the Weight of the Regular Term λ
The weight of the regular term λ is a scalar value used to evaluate the influence of the regular term. In order to test the sensitivity of the proposed framework in relation to the weight of the regular term, the λ is varied from 200 to 2400.
As shown in
Figure 12, the registration error decreases at first before a gradual decrease. The bottom of registration error reflects the most suitable weight of the regular term
λ. There are four multimodal images having different suitable weights of the regular term
λ. When
λ is less than the most suitable weight of the regular term
λ, the registration error would greatly increase while the registration results would be seemingly distorted. When
λ is gradually greater than the most suitable weight of the regular term, wrong registration results would be obtained.