Heterogeneous Image Matching via a Novel Feature Describing Model

: Computer vision has been developed greatly in the past several years, and many useful and interesting technologies have been presented and widely applied. Image matching is an important technology based on similarity measurement. In this paper, we propose a novel feature describing model based on scale space and local principle component analysis for heterogeneous image matching. The traditional uniform eight-direction statistics is updated by a task-related k -direction statistics based on prior information of the keypoints. In addition, the k directions are determined by an approximately solution of a Min-Max problem. The principle component analysis is introduced to compute the main directions of local patches based on the gradient ﬁeld. In addition, the describing vector is formed by then implementing PCA on each sub-patch of a 4 × 4 mesh. Experimental results show the accuracy and efﬁciency of proposed method.


Introduction
Computer vision techniques have been widely applied in many areas such as object tracking, medical image analysis, and pattern recognition. Image matching can help to find the similar content in different images by analyzing the pixel values and their potential features [1][2][3][4][5][6][7][8][9][10][11][12]. Traditional image matching techniques are developed based on different principles and information using skills. Some of them directly use the original pixel values in a certain region, and it is easy to operate and work well in some simple scenes [13][14][15][16][17]. It often requires more computation costs and is sensitive to the geometrical differences. Some others process the matching computing based on features extraction instead of directly on the pixel values [2,3,6,8,14,18]. Invariance can be achieved or partly achieved in some procedures such as rotation, resizing, translation, and so on. After the features are detected, matching computing will be processed on a certain similarity.
Some local feature detection methods have been presented in the past years, and it was put forward firstly by Moravec [13] and Harris [1]. Moravec and Harris introduced auto-correlation function and auto-correlation matrix independently to find feature points (simple corners).
However, for the extracted feature points, the former is not adaptive to rotation and noise, the latter is adaptive to rotation and illumination, does not have scale invariance. Shi and Tomasi improved the Harris, and put forward the Shi-Tomasi [19], which makes the distribution of extracted feature points more uniform and reasonable. It is computationally expensive and the feature points are not scale invariant. SUSAN (smallest univalue segment assimilating nucleus) is proposed to determine corners by the number of pixels in the preset region. The algorithm is small in computation and does not have scale invariance [4]. Then, a machine learning method is used to compute FAST (features from accelerated segment test) [20]. It can calculate faster, but when there are noise in the image, there will be more error feature points. Based on FAST, Mair put forward AGAST (adaptive and generic corner detection based on the accelerated segment test), which effectively improves the speed of feature point extraction [21]. However, none of them have scale invariance. KAZE descriptor used a nonlinear diffusion filter to construct a stable nonlinear scale space. It takes a long time to construct the nonlinear scale space, and more octaves can reduce the efficiency of KAZE [22]. With the rapid development of machine learning, some new methods, including sparse coding and convolutional neural network (CNN), are applied to feature extraction [23][24][25][26][27][28][29][30]. Sparse coding is a linear combination that reconstructs the input data into a set of hypercomplete basis vectors. The new features of input data are composed of the coefficients of base vectors. Sparse coding is very slow in practice. The eigenvalue is obtained by a convex optimization method in the test phase. From the structure perspective, the network can use the increased nonlinearity to get the approximate structure of the objective function, and at the same time get a better representation of the characteristics by increasing the depth of CNN. However, it also increases the overall complexity of the network, making it difficult to optimize the network and easy to over fit.
Scale-Invariant Feature Transform (SIFT) is proposed to improve the features invariance of previous methods [3,6]. Image pyramid and local gradient information statistics are introduced to compute the 128-dimensional features. It can work well even if there exist some changes in scale, brightness, viewing angle, etc. However, the computation and the matching error rate are still considerable. Some techniques based on SIFT have been developed over the past years. Ke and Sukthankar introduce principal component analysis to replace the weighted histograms and reduce the dimension of features to 20 [7]. Speeded up robust features (SURF) are presented based on Hessian matrix and image convolutions to faster the computing [18]. There are some other techniques that have also been introduced to achieve better performance [9,[31][32][33][34].
However, heterogeneous image matching is much different from natural image matching [35][36][37][38][39]. Some vision characteristics may change seriously by the imaging condition. The related features describing will also change greatly. It can be found that many traditional methods do not work in heterogeneous image matching. This paper aims to present a novel local features describing model more advantageously for heterogeneous image matching. (1) The prior information of some keypoints is used to determine k adaptive directions instead of the traditional uniform eight directions. A gradient histogram is computed on these new directions and forms the first part of the describing vector. (2) The main direction of a local area is computed by carrying out PCA on the gradient fields instead of searching the extreme of a gradient histogram. (3) PCA is independently carried out on 16 sub-patches of the local area and the first components are integrated to the second part of the describing vector. After the final k + 32 dimensional descriptor formed, better performances can be achieved on some matching tasks by the proposed method with respect to some traditional methods.
The rest of this paper is organized as follows: in Section 2, the related fundamentals of scale space and principle component analysis are prepared. A novel feature descriptor based on PCA is proposed in Section 3. In Section 4, several experiments are implemented to verify the accuracy and efficiency of proposed algorithm.

Scale Space and Difference Space and Principal Component Analysis
Scale space theory [40] has been introduced in many models and it can help to distinguish the features in different levels. After a scale parameter added to the related model, a multi-scale representation sequence can be computed following the changing of the parameter. Many tasks can be achieved by analyzing or processing the sequence such as extracting features.
The classical linear scale space can be generated by a diffusion equation as shown in Equation (1), and it can be regarded as the observation of the image at different distance/scale: where t denotes the scale parameter and c denotes the diffusion constant. I(x, t) means the a scale space and Ω ⊂ R 2 is the image domain. This partial differential equation can be solved by some classical techniques such as Fourier transform and many related models have been presented for different image processing problems [41][42][43]: It means that the scale space can be denoted as the convolution of image function I(x) and a Gaussian kernel function g √ 2t with different t, so-called Gaussian scale space. Then, the discrete difference of Gaussian space can be denoted as d I(x, t) = I(x, t + dt) − I(x, t). ( Here, dt can be set according to the image data. Figure 1 shows two Gaussian images and a difference image of Lena. Principal Component Analysis (PCA) [44] is a common technique for dimensionality reduction and has been widely applied to many computer vision problems such as feature selection, object recognition, and so on. The main ideas of PCA can be summarized as follows. Assume n × d matrix Y is the normalized observation data of random variables Y 1 , Y 2 , ·, Y d . Then, the d-dimensional data can be mapped into an orthogonal space for a maximal variation in each new dimension. The orthogonal space can be computed by applying matrix eigen value theory on the covariance of Y: The components can be denoted as V i = Yv i with the assumption λ 1 > λ 2 > · · · > λ d and v i means the direction with maximal projection variation of original data. Then, the first k components can be used for dimension reduction of the original data, and k can be determined by specific needs.
Though PCA suffers from some restriction and shortcomings, it is still popular due to its simplicity. Fergus et al. introduce PCA to unsupervised scale-invariant learning for object recognition [45]. This technique is also applied to represent keypoint patches and improves SIFT's matching performance [7], named PCA-SIFT.

SIFT and PCA-SIFT
Based on the scale space and the difference space, the original Scale-Invariant Feature Transform (SIFT) can be described as four main steps.
The extreme points in the DoG (difference of Gaussian) space are detected by comparing the pixels in a 3 × 3 × 3 neighborhood, named non-maximum suppression. Let D(x) and U r (x) denote the DoG function and r-neighborhood of x; then, 2. Keypoint localization.
Taylor expansion and principle curvature are both applied to check the detected extreme points again and then keypoints can be located. Let H denote the Hessian of DoG function D(x); then, Generally, set r = 10.
To achieve rotation invariance, eight-direction statistics is applied on the gradient field of A(x), the local patch centered at x. Then, the orientation can be computed by searching the maximal direction of the histogram.
Here, hist8(θ k , ∇A) means the value in [θ k − π 8 , θ k + π 8 ) when the histogram was carried out on a local gradient patch ∇A(x). θ k is one of the eight uniform directions.

Keypoint descriptor.
Then, eight direction statistics is applied on each subarea after the local gradient field has been rotated by θ 0 and divided into 4 × 4 sub-areas. Let B ( i, j) denote the subareas of rotated A(x); then, a 128-dimensional descriptor can be obtained by arranging them as follows: The PCA-SIFT is similar to the SIFT in many details and they only differ in Step 4. It can be summarized as follows.
1. An eigenspace of the image gradient fields is pre-computed.
The gradient fields on a 41 × 41 patch centered at the keypoint can be rearranged to a 2 × 39 × 39 = 3042-dimensional vector. In addition, it is normalized to unit magnitude for reducing the unexpected impact of variations in illumination. Such about 21,000 patches will be collected and then principle component analysis is processed. The first 20 components will be selected to generate the projection matrix for any other observation. Let C denote the normalized version of collected patches; then, the projection matrix for any path can be computed as Here,

A simplified descriptor is used on the eigenspace.
A 3042-dimensional vector z on a given image patch centered at x will be computed and normalized, and then projected into the pre-computed feature space as denoted by Equation (10). A 20-dimensional vector can be used to describe the feature point. Figure 2 shows some keypoints and patches of the Lena image. Figure 2b shows the gradient filed on the yellow patch. Figure 2c shows the pre-computed feature space to be dimension reduced: After then, the Euclidean distance can be used to determine whether the two vectors mean the same keypoint in different images. In more detail, a proper threshold ρ 0 is presented for the ratio between the best and second-best match as in the SIFT. Let I a and I b be the two images, x ∈ I a and x ∈ I b are keypoints, and then the matching between x and y can be determined based on PCA-SIFT features F PCA−SIFT (x) and F PCA−SIFT (y) Here, It is obvious that a smaller ρ 0 means fewer false positives and more false negatives. On the contrary, a bigger ρ 0 leads to more false positives and fewer false negatives. A proper threshold ρ 0 is advantageous to an appropriate trade-off between false positives and false negatives.
Different from the features represented by histogram of local gradient fields in traditional SIFT, PCA-SIFT represents the features in the eigenspace. The descriptor can be computed more efficiently after the eigenspace pre-computed based on sufficient prior keypoints. Speeded up robust features (SURF) detect the extremes in the difference of Hessian (DoH) instead of DoG. Then, the downsampling in the Gaussian pyramid is replaced by a box-filter with some fixed size, and the eight-direction statistics is replaced by computing a Harr wavelet response. However, the strong dependence on local gradient and discrete scale in DoH should still be addressed in some cases. KAZA constructs a nonlinear scale space via a nonlinear diffusion filter for better matching performance in complex cases. In addition, the additive operator splitting (AOS) scheme requires more computation.

Proposed Method
The original PCA-SIFT describes the feature via the first 20 components of collected patches centered known keypoints. It is advantageous for the linear structure, but some information of nonlinear structure will be ignored. In addition, pre-computing should be carried out on considerable classical images/patches for good representativeness. Furthermore, 20 components may be not sufficient in some scenes for the information lost.
In our opinion, PCA is more advantageous to achieve local representativeness than the global, and it is robust to some image changes such as noise, rotation, and so on. Thus, we introduce it to describe local information and then form a novel feature describing model.
Moreover, the traditional uniform eight directions in [0, 2π) are easy to be used to describe the local gradient distribution centered at a keypoint. However, various distributions can be found in different tasks indeed. Task-related directions will be more beneficial to achieve accurate features and correct matches. The prior information of keypoints can be used to determine the directions/intervals which were used to implement gradient statistics.
Heterogeneous images are often captured by difference imaging devices/methods on a same scene/object. Figure 3 shows two types of heterogeneous images and their local gradient fields. The traditional image changes such as noise, rotation, and scaling can be regarded as specific heterogeneity. Heterogeneous features or keypoints can be explained in a similar way. It can be found that the local gradient fields of heterogeneous images are very different from the original. Traditional methods based on gradient statistics often fail to achieve matching tasks on such heterogeneous images. This paper aims to present a novel feature describing model that is advantageous to address some heterogeneous images matching tasks.

Main Steps
Based on previous sections, a novel frame for heterogeneous image matching can be formulated as follows.

Gradient histogram and adaptive intervals.
According to the matching task, some prior information of keypoints can be helpful to determine proper candidate intervals for computing gradient histogram. Such adaptive intervals are more advantageous to describe the keypoints accurately and distinguish different keypoints than traditional uniform eight-intervals.
The local gradient information of some known keypoints can be integrated as ∇I(x), x ∈ Ω key . Ω key means the set of keypoints and ∇I(x) is the gradient at a keypoint x in the rotated local patch. Then, refined interval statistics is carried out on it, and a weighted histogram can be denoted as Here, {c i } m i=1 means the centers set of candidate intervals. θ(x) denotes the direction angle of ∇I(x). Figure 4 shows the traditional histogram and the weighted version of Lena image as m = 36.
Then, an optimization model (14) is introduced to determine a set of adaptive intervals for local gradient statistics: Here, E j (j = 1, 2, · · · , k) means the adaptive intervals and C j (j = 1, 2, · · · , k) denotes the centers, k is the number of selected intervals. It can be approximately solved by a dynamic programming algorithm after k preset.

SIFT keypoints localization.
In this step, the keypoints are detected as the SIFT. The extreme points in the DoG space can be detected by non-maximum suppression. Let D(x) and U r (x) denote the DoG function and r-neighborhood of x; then, the keypoints can be located as Equations (5) and (6).

Main direction computing based on PCA of local patch.
The main direction will be computed on the local gradient field of patch A(x) centered the keypoint x. To recognize heterogeneous image features, we normalize the keypoint a maximal point in the difference space. The local gradient field extracted from the pre-computed global gradient field will be treated in the same way. Then, PCA will be carried out on the normalized rearranged 2 × n matrix L, and the first component can used to determine the main direction: Here, v 1 = PCA(1, ∇A) means the first component of PCA carried out on ∇A; it can be denoted in more detail as Then, the patch will be rotated by the main angle before computing the feature vector.

Feature vector computing based on PCA and the adaptive intervals.
To enhance the ability recognizing heterogeneous image features, we define the feature vector by two parts. Gradient statistics of the rotated local patch B will be taken on the k selected intervals, and the first part of the feature vector is generated. Then, the local patch B will be divided into 4 × 4 sub-patches, denoted by B s,t (s, t = 1, 2, 3, 4). Different amounts of sub-patches can be set for different performances on accuracy or computation. A bigger amount means more computation and higher sensitivity. PCA will be carried out out independently on each small patch and determine the main gradient. Figure 5 shows the 16 divided sub-patches and the main gradient computed by PCA on a local gradient field. These 16 main gradients generate the second part of the feature vector. The final 32 + k dimensional descriptor can be obtained by arranging them as where F 1 = hist(E j , ∇A) k , F 2 = [PCA(1, ∇B s,t )] 4×4×2 , hist(E j , ∇A) means the value in interval E j when the histogram was implemented on local patch A(x). E j is one of the selected intervals computed in Step 1.

Parameters Setting
The adaptive intervals are determined by the optimization model (14) based on m candidate intervals. Too big of an m is not necessary for the limited discrete information of the keypoints, and we set m = 360 in the experiments.
k adaptive intervals are used for gradients statistics and embedded in the feature vector. It can be found that a smaller k is advantageous to save computation but not beneficial to precision. A smaller k means less information from the keypoint collected in F 1 , the first part of the feature vector. It will reduce the accuracy in some sense. A bigger k required more computation, and it is beneficial to precision. More information for the keypoint can be collected in the F 1 . However, it also leads to higher sensitiveness. For a trade-off between computation and precision, it is advantageous to set k = 36 in our experiments.
Then, the matching between two keypoints x, y can be determined based on features F(x) and F(y) by Equation (11) and Equation (12). The threshold ρ 0 is set in [1,3]. A smaller ρ 0 prefers to find more matches, but the precision may be low. A bigger ρ 0 is advantageous to capture accurate matches, but the amount may be more limited. In our experiments, ρ 0 = 1.6 is set for an expected precision.

Experiments
In real scenes, images are often affected by changes of viewpoint, scale, illumination, and so on. We ran two main types of experiments to explore the difference between the traditional methods and the proposed method. The first types of experiments explore the robustness to effects caused by rotation and the addition of noise. The second types of experiments verify the efficiency of the proposed method to heterogeneous image matching. We collected several classical natural images and heterogeneous image pairs. Some transforms are applied to them: (1) additional Gaussian noise (SNR = 20 db); (2) rotation of 7π/36 followed by additional Gaussian noise; and (3) scaled 50% and rotated π/6. The experiments are completed under Windows 7 with Matlab R2017b. To evaluate the performance, the proposed algorithm is compared with Harris, SIFT, and PCA-SIFT. For more credible and stable results, the first two experiments are repeated for 10 times, and the results are computed on average. In addition, according to the widely known work for features matching performance evaluation [5,46], correct number, cost time per match (ms), precision, and recall are applied to evaluate the performance. The related definitions are stated as follows. The correspondence between two images can be computed as illustrated in [47]: Noised images matching test. The four images (Pavilion, Monkey, Fruits, Elaine) are all added Gaussian noise (SNR=20db) for their polluted versions. Then, the proposed method and the compared methods are all carried out on each pair (original and polluted).
Matching results of one implementation are shown in Figure 6. The rows from top to bottom mean different image pairs, Pavilion, Monkey, Fruits and Elaine, corresponding to the horizontal axis in Figure 7. The columns from left to right mean different methods: Harris, SIFT, PCA-SIFT, and the proposed method. Harris finds the least matching number while SIFT and PCA-SIFT achieve better performance. More matching can be found by the proposed method than others in this test. It can be found that 91 matches in the Pavilion image pair while 100 matches in the Monkey image pair by the proposed method. In addition, 52 matches and 98 matches can be found independently in the Fruits image pair and Elaine image pair. More details of an accurate matching number can be found in Figure 7a.
The cost time per match is shown in Figure 7b, and the proposed method achieves better performance than SIFT and PCA-SIFT. The bar value means the total cost time and the green part means the localization time. Precision and recall of each image pair are shown in Figure 7c,d. It can be found PCA-SIFT loses on precision but gains on recall compared to SIFT. The proposed method can achieve higher precision and recall than SIFT and PCA-SIFT, but roughly equivalent to Harris. Rotated and noised images matching test. The four images (Pavilion, Monkey, Fruits, Elaine) are all added Gaussian noise (SNR = 20 db) after rotation by π 6 . Then, the proposed method and the compared methods are all carried out on each pair. Matching results are shown in Figures 8 and 9. The columns mean different method results. Harris achieves accuracy of about 95% on the Pavilion image pair, and it finds only one match in the Fruits image pair. Our method has an advantage on the correct matching number. The details of an accurate matching number can be found in Figure 9a.
As shown in Figure 9a, Harris finds the least matching number in each image pair. The proposed method finds the most matches in three of four image pairs. It can be found that there are 42 matches in the Pavilion image pair while 60 matches in the Monkey image pair. In addition, 23 matches and 41 matches can be found independently in the Fruits image pair and Elaine image pair. More details of accurate matching numbers can be found in Figure 9.   The cost time per match is shown in Figure 9b, and the proposed method achieves more stable performance. Precision and recall are shown in Figure 9c,d. It can be found that the proposed method achieves better precision and recall.  Scaled and rotated images matching test. The four images (Pavilion, Monkey, Fruits, Elaine) are all scaled 50% after rotation by π/6. After applying the proposed method and the compared methods on each image pair, the matching results are shown in Figure 10. The proposed method finds more matches than other methods, and Harris finds no match. PCA-SIFT finds less matches than SIFT, and it means that the components do not work so well with the mixed changes. Other results can be found in Figure 11a.
As shown in Figure 11b, the proposed method achieves the best performance on cost time than SIFT and PCA-SIFT (Harris fails to find any match, and the cost time can't be computed). Precision and recall are shown in Figure 11c,d. PCA-SIFT achieves four 100% on Precision and the proposed method achieves three. PCA-SIFT achieves the best Recall, and our method performs better than SIFT and Harris.
Heterogeneous image matching test. The original four images are all brushed to simulate the heterogeneous version. After applying the proposed method and the compared methods on each image pair, the matching results are shown in Figure 12. Harris only can find very few matches in the second pair and the fourth pair but could find no matches in other pairs. SIFT and PCA-SIFT can find roughly the same matches in each pair. Our method can find more matches than the other three methods in each pair.
To be more exact, the proposed method finds 46 matches in the Pavilion image pair while 40 matches in the Monkey image pair by the proposed method. Thirty matches and 54 matches can be found independently in the Fruits image pair and Elaine image pair. SIFT and PCA-SIFT only catch about half of the proposed method. Harris fails to find any match in Pavilion and Fruits. Other results can be found in Figure 13a.
As shown in Figure 13b, the proposed method achieves the best performance on cost time compared to SIFT and PCA-SIFT (Harris fails to find any matches in two pairs, and the cost time can't be computed). Precision and recall are shown in Figure 13c,d. PCA-SIFT achieves a little worse performance than SIFT on Precision. SIFT achieves two 100%, and our method achieves one, but our method achieves obviously better performance in Monkey and Fruits. The proposed method achieves the best performance on recall in three pairs and only a little lower than PCA-SIFT in the Elaine image pair.  Based on the four experiments above, we use three performance levels (Low, Moderate, High) to comprehensively evaluate the methods as shown in Table 1. It can be found the proposed method is advantageous in mixed changed image matching and heterogeneous image matching. For more comprehensive results from the proposed method, we carry out these methods on the Oxford database in the last experiment. Matching test on Oxford database. Eight image pairs are selected from the well-known feature matching test database Oxford provided by Mikolajczyk et al. [11]. All the images are scaled to one-third of the original size, and the right one of each pair is transformed into a heterogeneous version as shown in Figure 14. After applying the proposed method and the compared methods on each image pair, the matching results can be obtained as shown in Figure 14. Harris finds few matches and the proposed method finds the most matches in each pair. SIFT and PCA-SIFT can find moderate matches. Our method can find more matches than the other three methods in each pair as shown in Figure 15a and Table 2.   Harris  1  9  6  3  0  2  4  0  SIFT  6  74  5  15  25  12  19  1  PCA-SIFT  0  73  3  3  26  4  17  0  Proposed  19  98  25  28  35  24  26  16 As shown in Figure 15b, the proposed method achieves better performance on cost time than Harris and SIFT but worse than PCA-SIFT. Though Harris finds the least matches, it still achieves the best performance on Precision excluding Leuven and Wall. The proposed method achieves the best performance as shown in Figure 15c. PCA-SIFT achieves four 100%, and SIFT achieves seven 100%. The recall performances are shown in Figure 15d, and it can be found that the proposed method achieves better performance than the other compared methods.

Conclusions
In this paper, a novel features describing method based on scale space and local principle component analysis is presented for heterogeneous image matching. The keypoints are located in the DoG space and similar to traditional SIFT. Then, k directions are optimized for gradient statistics on some prior local patches centered keypoints and the first part of the feature describing vector is formed. Then, PCA is introduced to compute the main direction of each local patch instead of using the traditional eight-direction statistics. In addition, the second part of the feature describing vector is then achieved by applying PCA on each sub-patch of the rotated local patch. Several experiments' results verified the efficiency and accuracy of the proposed method.