Feature Point Matching Method Based on Consistent Edge Structures for Infrared and Visible Images

: Infrared and visible image match is an important research topic in the ﬁeld of multi-modality image processing. Due to the di ﬀ erence of image contents like pixel intensities and gradients caused by disparate spectrums, it is a great challenge for infrared and visible image match in terms of the detection repeatability and the matching accuracy. To improve the matching performance, a feature detection and description method based on consistent edge structures of images (DDCE) is proposed in this paper. First, consistent edge structures are detected to obtain similar contents of infrared and visible images. Second, common feature points of infrared and visible images are extracted based on the consistent edge structures. Third, feature descriptions are established according to the edge structure attributes including edge length and edge orientation. Lastly, feature correspondences are calculated according to the distance of feature descriptions. Due to the utilization of consistent edge structures of infrared and visible images, the proposed DDCE method can improve the detection repeatability and the matching accuracy. DDCE is evaluated on two public datasets and are compared with several state-of-the-art methods. Experimental results demonstrate that DDCE can achieve superior performance against other methods for infrared and visible image match.


Introduction
Infrared and visible image match aims to establish the correspondence of feature points between images formed by different spectrums. Visible images can obtain the fine details of scenes and infrared images can obtain coarse structures of scenes even under the condition of limited light like nighttime and fog [1]. Infrared and visible image match can provide complementary information captured by multi-modality images. Abbas et al. [2] utilize infrared thermal images to measure the temperature of neonates. Beauvisage et al. [3] utilize the infrared image to carry out the night-time navigation. These methods use infrared images to accomplish the tasks that are impossible for visible images. The infrared and visible image match has been widely applied in an unmanned aerial vehicle [3,4], remote sensing satellite [5,6], and security monitoring platform [7,8].
The infrared and visible image match is still a difficult problem even though single-modality image matching methods have been extensively studied [9]. Visible images capture reflected light with spectrum 0.4~0.7 µm and infrared images capture thermal radiation with a spectrum of 0.75~15 µm [1]. Due to different imaging mechanisms, there are significant differences in image contents between infrared and visible images. The differences of image contents include nonlinear multi-oriented Sobel spatial filters are introduced in EOH (Edge Oriented Histogram) [27], PCEHD (Phase Congruency Edge Histogram Descriptor) [28], HoDMs (Histogram of Directional Maps) [29], and HOSM (Histogram of Oriented Structure Maps) [30]. EOH calculates edge orientation responses by multi-oriented Sobel spatial filters and use the orientation corresponding to the maximum response as the edge orientation. The edge orientation histogram is established as feature descriptions. PCEHD extracts edges and feature points of infrared and visible images by using the phase consistency method and calculates the edge orientation histogram for feature points. Based on EOH, HoDMs uses edge strength responses to describe image textures and HOSM enhances image edges by the guided filter. Multi-scale and multi-oriented Log-Gabor filters are introduced in LGHD (Log-Gabor Histogram Descriptor) [31], MFD (Multispectral Feature Descriptor) [32], and RIDLG (Rotation Invariant Feature Descriptor based on Log-Gabor Filters) [33] to compute the edge orientation. These methods establish the edge orientation histogram by the orientation of the maximum Log-Gabor response. On this basis, RIFT (Radiation Invariant Feature Transform) [34] and MSPC (Maximally Stable Phase Congruency) [35] extract image features by phase congruency. To increase the number of feature points in multi-modal images, RIFT adds edge points as feature points further. Log-Gabor filters can obtain richer feature descriptions than Sobel filters. However, Log-Gabor filter suffers from a high computation burden. Because edge-based methods leverage image structures rather than pixel information, these methods are more suitable for infrared and visible image match than gradient-based methods [31].
However, infrared and visible image matching methods still face the following problems. Gradient-based methods and edge-based methods need to extract common feature points to avoid non-common feature points participating in image matching. Due to the consistency of edge structures of infrared and visible images, edge properties including orientation and length are consistent as well.
In the process of establishing feature descriptions by the edge orientation, the edge length should be used to improve the ability of describing the global structure of infrared and visible images further.
In order to overcome these deficiencies of edge-based methods, a feature detection and description method based on consistent edge structures of images (DDCE) is proposed. The main contributions are listed as follows.
(1) Consistent edges are extracted by selecting long edges to present global structures, which are similar in both infrared and visible images.
(2) Common feature points are detected according to the constraints of consistent edge structures.
(3) By using the edge properties including edge length and edge orientation, the edge length weighted edge orientation histogram is computed to build feature descriptions.
The remainder paper is organized as follows. In Section 2, the proposed matching method is described including consistent edge extraction, common feature detection, feature description, and feature matching. In Section 3, experimental results and corresponding analyses are given out. In Section 4, this paper is concluded.

Proposed Methods
In this section, DDCE is presented in detail. First, consistent edges of infrared and visible images are extracted to capture the global structures of images. Second, common feature points are detected based on consistent edges. Third, the edge length weighted edge orientation histogram is calculated to build feature descriptions. Lastly, feature correspondences are established according to the obtained feature descriptions.

Consistent Edge Extraction
Consistent edges are usually the long edges of global structures, which are similar in infrared and visible images [36]. Due to different imaging mechanisms, the consistent edges in infrared and visible images are mainly the long edges that are formed by global structures of images, while the inconsistent edges are mainly the short edges that are formed by local details like textures [12]. Figure 1 illustrates a pair of visible and infrared images and their corresponding edge images. As shown in Figure 1, the long edges of global structures like the building, the crown, and the tent maintain consistent in infrared and visible images. Since each side of these edges has different reflection and radiation characteristics, these edges can usually be captured in infrared and visible images. The short edges of local details like road textures and leaves are different, as shown in Figure 1. These short edges are a mutual loss or have different position responses in infrared and visible images.
with a gradient greater than T1 and less than T2 are edge points if there is a continuous path linking those points to the points with a gradient greater than T2. In this paper, the modified Canny method [40] is adopted. The high and low thresholds of the Canny method are set automatically. The detection procedure of the modified Canny method incorporates the same steps as the original Canny method except that T2 and T1 are set locally. Thresholds (T2, T1) are determined within a moving window centered on the current pixel. Following the setups used in Reference [40], the window size is chosen to be 20 × 20. T2 is set to the gradient magnitude that is ranked as the top 30% in the window and T1 is set to 40% of T2 [40]. The set of the extracted edge is denoted as E={e1,e2,…,eM} where M is the number of edges. Lastly, the edge length is calculated by the contour search algorithm [38]. As a result, an edge is denoted as ei={p1,p2,…,pli} where p is the edge point and li is the edge length. Consistent edges are generated by selecting edges that are longer than α. The set of consistent edges is denoted as le = {ei| |ei|>α, ei∈E}.

Common Feature Point Detection
Non-common feature points reduce the detection repeatability of feature points. Because local details of infrared and visible images are different from mutual missing and disparate position responses, the feature points extracted on local details are non-common feature points. The performance of feature point matching methods depend on the repeatability of detected feature Consistent edges of infrared and visible images are extracted by selecting the long edges of images. First, due to the intense difference between infrared and visible images, a white balance method is utilized to handle infrared and visible images so that they all occupy the maximal possible range [0,255] [37]. Second, histogram equalization is used to enhance the edges of infrared and visible images [38]. Third, the Canny method [39] is used to extract image edges. The key idea of Canny is to use two different thresholds in order to determine which point should belong to an edge: a low threshold T1 and a high threshold T2. Points with a gradient greater than T2 are edge points. Points with a gradient greater than T1 and less than T2 are edge points if there is a continuous path linking those points to the points with a gradient greater than T2. In this paper, the modified Canny method [40] is adopted. The high and low thresholds of the Canny method are set automatically. The detection procedure of the modified Canny method incorporates the same steps as the original Canny method except that T2 and T1 are set locally. Thresholds (T2, T1) are determined within a moving window centered on the current pixel. Following the setups used in Reference [40], the window size is chosen to be 20 × 20. T2 is set to the gradient magnitude that is ranked as the top 30% in the window and T1 is set to 40% of T2 [40]. The set of the extracted edge is denoted as E = {e 1 ,e 2 , . . . ,e M } where M is the number of edges. Lastly, the edge length is calculated by the contour search algorithm [38]. As a result, an edge is denoted as e i = {p 1 ,p 2 , . . . ,p li } where p is the edge point and li is the edge length. Consistent edges are generated by selecting edges that are longer than α. The set of consistent edges is denoted as le = {e i | |e i |>α, e i ∈E}.

Common Feature Point Detection
Non-common feature points reduce the detection repeatability of feature points. Because local details of infrared and visible images are different from mutual missing and disparate position responses, the feature points extracted on local details are non-common feature points. The performance of feature point matching methods depend on the repeatability of detected feature points [10,17]. Matching of non-common feature points of images can only bring in the wrong matches, which will bring down the matching performance.
The proposed common feature point detection method utilizes the consistent edges of infrared and visible images to detect feature points. Due to the similarity of consistent edge structures, the percentage of common feature points in the detected feature points can be increased. The common feature point detection method contains two steps. First, candidate common feature points and consistent edges of infrared and visible images are detected. Second, consistent edges are leveraged to check whether the candidate feature points are common feature points or not.
Candidate common feature points are detected by the Harris response [41] of images. Saleem et al. [17] prove by experiments that Harris corners have better repeatability than other feature point detection methods for infrared and visible images. Harris Response R of the point p is computed by the equations below.
w(x, y) I 2 x I x I y I x I y I 2 y (1) where w is the Gaussian weight function, I x and I y are image derivatives, N(p) is the 3 × 3 neighborhood of point p, and k is a constant in the interval [0.04, 0.06] [39]. Points with response R > 0 are marked as candidate feature points [39]. The set of candidate common feature points P' is obtained by local maximum suppression of candidate feature points. Common feature points of infrared and visible images are obtained by filtering the feature point set P', according to the constraints of the consistent edges set le. The set of common feature points denoted as P is defined by the equation below.
where N(p) is also the 3 × 3 neighborhood. According to Equation (3), the feature points located on or near the consistent edge are selected as the common feature points. Because the consistent edge reflects the global structure, which is consistent in infrared and visible images, feature points constrained by the consistent edge can be considered common feature points. An example of common feature point detection is shown in Figure 2. As shown in Figure 2a, the feature points generated by the text on the visible image have no corresponding points on the infrared image. As a result, these feature points are non-common feature points. Matching of these feature points can only get the wrong matches. The feature points extracted by the common feature point detection method are shown in Figure 2b. The number of feature points extracted on the text of visible images is reduced significantly when compared to Figure 2a. The common feature point detection method can improve the matching performance by removing non-common feature points. feature points generated by the text on the visible image have no corresponding points on the infrared image. As a result, these feature points are non-common feature points. Matching of these feature points can only get the wrong matches. The feature points extracted by the common feature point detection method are shown in Figure 2b. The number of feature points extracted on the text of visible images is reduced significantly when compared to Figure 2a. The common feature point detection method can improve the matching performance by removing non-common feature points.

Feature Description Establishment
The proposed feature description method leverages the edge length weighted edge orientation histogram to establish feature descriptions based on the global structure of images. Because edges of the global structure are similar in infrared and visible images, feature descriptions need to depict the edge information in the local neighborhood of feature points. The edge length weighted edge orientation histogram can depict the statistics of edges in the local neighborhood of feature points. The steps of establishing feature descriptions based on edge structures include feature point neighborhood partition, edge orientation computation, edge orientation histogram establishment, and histogram normalization.
(1) The local neighborhood of feature points is divided into multiple sub-regions to describe the spatial distribution of edges in the neighborhoods. The size of the neighborhood is 80 × 80 [27]. Twolayer concentric circles are used to partition the neighborhood as HOG (Histogram of Oriented Gridients) [42]. The outer circle has a radius of r and the inner circle has a radius of r/2 where r is 40. Each circle is equally divided by π/9 radians into 18 sectors. Each of the two adjacent sectors of the same circle are expanded so that their overlap is π/36. In the case that only edge information is available, which is not as rich as gray and gradients, the neighborhood partition that contains overlapping areas can improve the description ability for feature points [43].
(2) The edge orientation is used to describe the edge information in the neighborhood. During the calculation of the edge orientation, the inconsistent short edges formed by local details are ignored and the consistent long edges formed by global structures are considered. The calculation of the edge orientation is performed by 0°, 45°, 90°, 135°, and non-directional (n.o.) Sobel filters, which are shown in Figure 3 [28]. The edge orientation is calculated for each point p of consistent edges in the neighborhood. The edge orientation of the point p is the orientation of the Sobel filter corresponding to the maximum response value. Let symbols fi, i = 1, 2, 3, 4, 5 denote 0°, 45°, 90°, 135°, and n.o. Sobel filters. The orientation b of the point p can be formulated by the equation below.
where * stands for the convolution operation and || stands for the absolute operation.
(3) The edge length weighted edge orientation histogram is used to present the distribution of edges in the neighborhood of feature points. After the edge orientation of each edge pixel in the neighborhood is obtained, the histogram is calculated in each sub-region. Since the long edge has

Feature Description Establishment
The proposed feature description method leverages the edge length weighted edge orientation histogram to establish feature descriptions based on the global structure of images. Because edges of the global structure are similar in infrared and visible images, feature descriptions need to depict the edge information in the local neighborhood of feature points. The edge length weighted edge orientation histogram can depict the statistics of edges in the local neighborhood of feature points. The steps of establishing feature descriptions based on edge structures include feature point neighborhood partition, edge orientation computation, edge orientation histogram establishment, and histogram normalization.
(1) The local neighborhood of feature points is divided into multiple sub-regions to describe the spatial distribution of edges in the neighborhoods. The size of the neighborhood is 80 × 80 [27]. Two-layer concentric circles are used to partition the neighborhood as HOG (Histogram of Oriented Gridients) [42]. The outer circle has a radius of r and the inner circle has a radius of r/2 where r is 40. Each circle is equally divided by π/9 radians into 18 sectors. Each of the two adjacent sectors of the same circle are expanded so that their overlap is π/36. In the case that only edge information is available, which is not as rich as gray and gradients, the neighborhood partition that contains overlapping areas can improve the description ability for feature points [43].
(2) The edge orientation is used to describe the edge information in the neighborhood. During the calculation of the edge orientation, the inconsistent short edges formed by local details are ignored and the consistent long edges formed by global structures are considered. The calculation of the edge orientation is performed by 0 • , 45 • , 90 • , 135 • , and non-directional (n.o.) Sobel filters, which are shown in Figure 3   where * stands for the convolution operation and || stands for the absolute operation.

Feature Matching
The bidirectional matching method is utilized to obtain the stable match result [38]. According to the bidirectional matching method, a match is considered valid only if the same pair of feature points in both directions is obtained. Because visible images usually have more content than infrared images, visible images can have more feature points. When searching for similar features of feature points of the visible image in the infrared image, the feature points generated by the extra content of the visible image will only produce the wrong matches. To avoid this situation, the bidirectional matching method searches for similar features of feature points within the infrared image in the visible image. Then similar features of the matched feature points of the visible image are searched in the infrared image.
When determining feature correspondences, the nearest neighbor ratio method is used to select matching points for feature points. The nearest neighbor ratio method is introduced in Reference [13] to improve the matching robustness. When the ratio of the distance of the nearest neighbor over the distance of the second nearest neighbor is less than the specified threshold, the nearest neighbor is regarded as the correct match. The nearest neighbor ratio method is defined by the equation below.
||fa -fb|| / ||fa -fc|| <r, where fb and fc are the nearest and second nearest neighbors of fa, respectively. Parameter r is set as 0.8 [13]. The distance between two feature descriptions is calculated by the equation below.
where h i is the histogram bin of the feature description f.

Algorithm Procedure
The procedure of DDCE is shown in Table 1. Given infrared image Iir and visible image Ivis, feature correspondence C of Iir and Ivis is output by DDCE. Consistent edge extraction, common feature detection, feature description establishment, and feature matching are described in section (3) The edge length weighted edge orientation histogram is used to present the distribution of edges in the neighborhood of feature points. After the edge orientation of each edge pixel in the neighborhood is obtained, the histogram is calculated in each sub-region. Since the long edge has strong discrimination for infrared and visible images, the edge length is utilized in the histogram establishment. In order to enhance feature discrimination, the gradient orientation histogram is weighted by the gradient magnitude in the generation of SIFT and HOG. The length of the edge where the point is located is used as the weight in a similar manner to SIFT and HOG when the edge orientation histogram is calculated. The weight of point p on edge e i is calculated by the equation below.
where l i is the length of e i and δ(·) = 1 true 0 f alse . Parameter l m prevents long edges from suppressing the other edges and dominating the histogram. Parameter l m is 1/2max{H,W} where H and W are image height and width, respectively. Parameter α reduces the effect of inconsistent short edges by only using consistent long edges. The histogram of each sub-region can be formulated by the equation below.

Feature Matching
The bidirectional matching method is utilized to obtain the stable match result [38]. According to the bidirectional matching method, a match is considered valid only if the same pair of feature points in both directions is obtained. Because visible images usually have more content than infrared images, visible images can have more feature points. When searching for similar features of feature points of the visible image in the infrared image, the feature points generated by the extra content of the visible image will only produce the wrong matches. To avoid this situation, the bidirectional matching method searches for similar features of feature points within the infrared image in the visible image. Then similar features of the matched feature points of the visible image are searched in the infrared image.
When determining feature correspondences, the nearest neighbor ratio method is used to select matching points for feature points. The nearest neighbor ratio method is introduced in Reference [13] to improve the matching robustness. When the ratio of the distance of the nearest neighbor over the distance of the second nearest neighbor is less than the specified threshold, the nearest neighbor is regarded as the correct match. The nearest neighbor ratio method is defined by the equation below.
where f b and f c are the nearest and second nearest neighbors of f a , respectively. Parameter r is set as 0.8 [13]. The distance between two feature descriptions is calculated by the equation below.
where h i is the histogram bin of the feature description f.

Algorithm Procedure
The procedure of DDCE is shown in Table 1. Given infrared image I ir and visible image I vis , feature correspondence C of I ir and I vis is output by DDCE. Consistent edge extraction, common feature detection, feature description establishment, and feature matching are described in Section 2.1, Section 2.2, Section 2.3, and Section 2.4, respectively. Note that steps (1), (2), and (3) are identical for infrared image I ir and visible image I vis .

Experimental Results
In this section, the matching performance of DDCE is evaluated using visible and infrared images. First, the dataset and evaluation criteria are introduced. Second, images of matching results are given out to illustrate the performance of DDCE on visible and infrared images. Third, matching performance analyses of DDCE are presented quantitatively when compared with state-of-the-art methods. Lastly, running time of DDCE is given out.

Datasets and Evaluation Criteria
Two public datasets, known as the CVC (Computer Vision Center of Universitat Autònoma de Barcelona) dataset [27] and LWIR (Long Wave Infrared Images) dataset [31], are utilized to validate the performance of DDCE. These two datasets are composed of visible and long wave infrared images. Homography transformations between images are provided by the datasets. Parameter α is utilized in feature detection and description of DDCE. For the CVC dataset and the LWIR dataset, α is set as 20.
Feature detection repeatability, feature matching accuracy, and RANSAC (RANdom SAmple Consensus) [44] estimation result are used to evaluate the matching performance of DDCE. Feature detection repeatability depicts the percentage of repeatable features detected in infrared and visible images. The re-projection error ε of two features with positions p i and p j can be expressed by the equation below.
||p i − Hp j || = ε, where H is the homography transformation. The repeatability of two features is computed by the formula below.
where the threshold of the re-projection error ε is 2 pixels [17]. Repeatability is derived from the ratio between the number of repeatable features and the minimum number of features in two images.
where N a is the number of all repeatable features.
and #I is the number of features detected in image I. Precision, recall, and the F1 score are used to assess feature matching accuracy. The correspondence of two features can be formulated by the equation below.
A pair of points p i and p j can be identified as a correct match if they satisfy Equation (7) and Equation (10) simultaneously. Precision is computed by the formula below.
where N c is the number of correct matches and N w is the number of wrong matches.
Recall is computed by the formula below.
The F1 score is computed by the equation below.
RANSAC is used to estimates the homography transformation and identifies the correct matches because of the existence of wrong matches. As shown in Reference [28], the matching image formed by identified correct matches is used to display the result of RANSAC estimation.

Comparison with Single-Modality Image Matching Methods
In this section, a group of matching results on infrared and visible images is given out to show the performance comparison of DDCE against single-modality image matching methods. The single-modality image matching methods used for comparison include SIFT, ORB, LIFT, and SuperPoint. Introductions of these methods are listed in Table 2. This section will present the visual results and performance analyses of DDCE and the matching methods listed in Table 2.  Figure 4 shows the matching results of SIFT, ORB, LIFT, SuperPoint, and DDCE on a pair of infrared and visible images successively. In Figure 4, red lines indicate the wrong matches and green lines indicate correct matches. To make the matching results easier to see, the number and the percentage of correct matches of each method shown in Figure 4 are listed in Table 3. Table 3. The number and percentage of correct matches of single-modality matching methods.

Methods
The As shown in Table 3, DDCE achieves the largest number and the highest percentage of correct matches. Due to the difference of pixel intensities and gradients between infrared and visible images, SIFT and ORB can only obtain 1 and 0 correct matches, respectively. SIFT and ORB build feature descriptions by gradient orientation and pixel comparison, respectively. Because of the nonlinear intensity variation, SIFT and ORB descriptions for infrared and visible images are dissimilar. LIFT only obtains one correct match. Because the training data of LIFT is obtained from image 3D reconstruction, LIFT aims to tackle the matching of wide baseline visible images. As a result, LIFT cannot handle nonlinear intensity variations between infrared and visible images as well. SuperPoint achieves better matching performance than other single-modality image matching methods. SuperPoint is trained for matching the objects with regular shapes like buildings [16]. Because consistent edge structures reduce the effect of nonlinear intensity variations between infrared and visible images, DDCE achieves the best performance.

Comparison with Multi-Modality Image Matching Methods
In this section, a group of matching results on infrared and visible images is given out to show the performance comparison of DDCE against multi-modality image matching methods. The multi-modality image matching methods used for comparison include PIIFD, MMSURF, PCEHD, LGHD, RIFT, and MSCB. Introductions of these methods are listed in Table 4. This section will present the visual results and performance analyses of DDCE and the matching methods listed in Table 4. Table 4. Multi-modality image matching methods.

Methods
Type Gradient-Based Method Figure 5 shows the matching results of PIIFD, MMSURF, PCEHD, LGHD, RIFT, MSCB, and DDCE on a pair of infrared and visible images successively. Similarly, redlines indicate the wrong matches and green lines indicate correct matches. In a similar manner to Section 3.2.1, the number and the percentage of correct matches of each method in Figure 5 are shown in Table 5. Table 5. The number and percentage of correct matches of multi-modality matching methods.

Methods
The As shown in Table 5, DDCE achieves the second largest number and the third highest percentage of correct matches. Based on the consistency of edge structures, DDCE can achieve superior performance on the number and the percentage of correct matches. According to Table 5 and Figure 5, two conclusions are given out as follows.
(1) There is a significant difference of gradient information between infrared and visible images. The matching performance of gradient-based methods including PIIFD, MMSURF, and MSCB is worse than edge-based methods. Even though gradient orientation reversal and gradient magnitude modification are adopted to establish feature descriptions for infrared and visible images, local detail discrepancies can also cause the dissimilarity of feature descriptions. These results indicate that it is difficult to establish similar descriptions through the gradient information.
(2) Structure information maintains a certain degree of similarity in infrared and visible images. By using image structure information, edge-based methods including PCEHD, LGHD, and DDCE achieve better matching performance than other methods. Although LGHD and PCEHD achieve higher percentage of correct matches, LGHD and PCEHD obtain fewer correct matches than other methods. RIFT obtains the largest number of correct matches. RIFT extracts dense features of images by detecting corners and edge points. However, the percentage of correct matches of RIFT is lower than the other edge-based methods.

Feature Detection Performance Analysis
Repeatability is used to evaluate the feature detection performance of DDCE and the feature detection methods listed in Table 2. Because multi-modality image matching methods focus on feature descriptions, only single-modality image matching methods listed in Table 2 are adopted for the performance evaluation of feature detection.
The quantitative comparison of the feature detection repeatability of each method is shown in Table 6. DDCE achieves the second highest feature detection repeatability among these methods. Because DDCE extracts common feature points based on consistent edge structures, DDCE achieves the superior feature detection repeatability. SIFT leverages DOG (Difference of Guassian) to detect image blobs. ORB and DDCE leverage FAST and Harris, respectively, to detect image corners. The repeatability of SIFT is lower than that of ORB and DDCE for long-wave infrared and visible images. This result is consistent with the conclusions presented in the reference [17]. Repeatability of LIFT is lower than the other methods. LIFT essentially extracts image patches that cannot locate feature points accurately. It can be found that some LIFT features are located in smooth regions like chimney in Figure 4. SuperPoint is trained by endpoints of simple artificial geometric shapes during the initialization phase and is extended to real images afterward. SuperPoint can detect objects with regular geometric shapes such as buildings and roads. SuperPoint can achieve excellent detection repeatability for CVC and the LWIR dataset in which the infrared and visible images contain buildings and roads.

Feature Detection Performance Analysis
Repeatability is used to evaluate the feature detection performance of DDCE and the feature detection methods listed in Table 2. Because multi-modality image matching methods focus on feature descriptions, only single-modality image matching methods listed in Table 2 are adopted for the performance evaluation of feature detection.
The quantitative comparison of the feature detection repeatability of each method is shown in Table 6. DDCE achieves the second highest feature detection repeatability among these methods. Because DDCE extracts common feature points based on consistent edge structures, DDCE achieves the superior feature detection repeatability. SIFT leverages DOG (Difference of Guassian) to detect image blobs. ORB and DDCE leverage FAST and Harris, respectively, to detect image corners. The repeatability of SIFT is lower than that of ORB and DDCE for long-wave infrared and visible images. This result is consistent with the conclusions presented in the reference [17]. Repeatability of LIFT is lower than the other methods. LIFT essentially extracts image patches that cannot locate feature points accurately. It can be found that some LIFT features are located in smooth regions like chimney in Figure 4. SuperPoint is trained by endpoints of simple artificial geometric shapes during the initialization phase and is extended to real images afterward. SuperPoint can detect objects with regular geometric shapes such as buildings and roads. SuperPoint can achieve excellent detection repeatability for CVC and the LWIR dataset in which the infrared and visible images contain buildings and roads.  Tables 2 and 4. Single-modality and multi-modality image matching methods are adopted for the performance evaluation of the feature matching method.
The quantitative comparison of the feature matching accuracy of each method is shown in Table 7. It can be found that multi-modality image matching methods achieve better performance than single-modality image matching methods. Due to the difference of pixel intensities and gradients between infrared and visible images, it is difficult for single-modality image matching methods to match infrared and visible images. To match infrared and visible images, hand craft methods need to be modified according to characteristics of infrared and visible images. Deep learning methods need to be trained by massive training data. Among multi-modality image matching methods, edge-based methods achieve better performance than gradient-based methods. DDCE leverages consistent edge structures to establish feature descriptions and achieves the second highest F1 score among these methods. Although the similarity of image structures decreases as the spectral difference increases, image structures still maintain better consistency than image pixel intensities and gradients.
Precision and recall of multi-modality image matching methods on infrared and visible images are lower than those of single-modality image matching methods on visible images. Due to the large difference between infrared and visible images, the similarity of descriptions of common feature points is low. Even though common feature points can be detected, the number of common feature points that can be correctly matched is still small. As a result, the match recall of feature points is low. Due to the existence of non-common feature points and the low similarity of common feature points, the match precision of feature points is low as well.

RANSAC Estimation Performance Analysis
RANSAC estimation is used to evaluate the matching performance of LGHD and DDCE. The performance of RANSAC estimation depends on both the percentage and the number of correct matches [44]. A group of experimental results of LGHD and DDCE is illustrated in Figure 6. It can be found that LGHD achieves a small number of correct matches, as presented in Table 5. As shown in Figure 6a, the experimental result of LGHD contains wrong matches indicated by non-horizontal lines. In addition, the building that is the main scene of infrared and visible images is missing. As shown in Figure 6b, however, DDCE successfully identifies the building as indicated by horizontal lines. Although LGHD achieves a slightly higher percentage of correct matches than DDCE, DDCE obtains a significantly larger number of correct matches than LGHD, which leads to a better RANSAC estimation performance by DDCE.  Table 8 presents the Average Running Time (ART) of the multi-modality image matching methods listed in Table 4. The unit of ART is second. All experiments are performed on a desktop with 64Byte Windows 7 OS. The hardware of the desktop comprises 3.30 GHz Intel i5 4 core CPU and 8G memory. The software of the desktop comprises vs2010 and opencv2.4. DDCE is implemented by C++ and all methods used for comparison are provided by the authors.

Running Time Performance
As shown in Table 8, DDCE achieves the second best performance on running time. MMSURF is inherited from SURF and achieves the best performance on running time. The methods including PCEHD, LGHD, and RIFT are time-consuming because phase congruency need to be computed.

Conclusions
In this paper, a feature detection and description method based on consistent edge structures known as DDCE was proposed for infrared and visible image match. First, consistent edge structures were detected to address nonlinear intensity variations and local detail discrepancies of infrared and visible images. Second, common feature points of infrared and visible images were extracted based on the consistent edge structures to improve the repeatability of feature points. Lastly, feature descriptions were established according to edge attributes including length and direction to enhance the description ability. In order to validate the performance DDCE, two public datasets CVC and LWIR were employed for matching test and several state-of-the-art methods were used for comparison. Experimental results showed that DDCE could achieve superior matching performance compared with PIIFD, MMSURF, MSCB, PCEHD, and RIFT. Although LGHD achieved the highest percentage of correct matches, DDCE could obtain better RANSAC estimation performance than LGHD.  Table 8 presents the Average Running Time (ART) of the multi-modality image matching methods listed in Table 4. The unit of ART is second. All experiments are performed on a desktop with 64Byte Windows 7 OS. The hardware of the desktop comprises 3.30 GHz Intel i5 4 core CPU and 8G memory. The software of the desktop comprises vs2010 and opencv2.4. DDCE is implemented by C++ and all methods used for comparison are provided by the authors. As shown in Table 8, DDCE achieves the second best performance on running time. MMSURF is inherited from SURF and achieves the best performance on running time. The methods including PCEHD, LGHD, and RIFT are time-consuming because phase congruency need to be computed.

Conclusions
In this paper, a feature detection and description method based on consistent edge structures known as DDCE was proposed for infrared and visible image match. First, consistent edge structures were detected to address nonlinear intensity variations and local detail discrepancies of infrared and visible images. Second, common feature points of infrared and visible images were extracted based on the consistent edge structures to improve the repeatability of feature points. Lastly, feature descriptions were established according to edge attributes including length and direction to enhance the description ability. In order to validate the performance DDCE, two public datasets CVC and LWIR were employed for matching test and several state-of-the-art methods were used for comparison. Experimental results showed that DDCE could achieve superior matching performance compared with PIIFD, MMSURF, MSCB, PCEHD, and RIFT. Although LGHD achieved the highest percentage of correct matches, DDCE could obtain better RANSAC estimation performance than LGHD.
In the future, more infrared and visible images including different kinds of targets and scenes will be acquired under a variety of meteorological conditions. Specific matching strategies for different targets and scenes will be designed to improve the matching reliability of DDCE. DDCE will be modified to be an invariant to rotation and scale. DDCE will be attempted to apply on practical platforms like unmanned aerial vehicles and remote sensing satellites.