Improved WαSH Feature Matching Based on 2D-DWT for Stereo Remote Sensing Images

Image matching is an outstanding issue because of the existing of geometric and radiometric distortion in stereo remote sensing images. Weighted α-shape (WαSH) local invariant features are tolerant to image rotation, scale change, affine deformation, illumination change, and blurring. However, since the number of WαSH features is small, it is difficult to get enough matches to estimate the satisfactory homography matrix or fundamental matrix. In addition, the WαSH detector is extremely sensitive to image noise because it is built on sampled edges. Considering the shortcomings of the WαSH detector, this paper improves the WαSH feature matching method based on the 2D discrete wavelet transform (2D-DWT). The method firstly performs 2D-DWT on the image, and then detects WαSH features on the transformed images. According to the methods of descriptor construction for WαSH features, three matching methods on the basis of wavelet transform WαSH features (WWF), improved wavelet transform WαSH features (IWWF), and layered IWWF (LIWWF) are distinguished with respect to the character of the sub-images. The experimental results on the dataset containing affine distortion, scale distortion, illumination change, and noise images, showed that the proposed methods acquired more matches and better stableness than WαSH. Experimentation on remote sensing images with less affine distortion and slight noise showed that the proposed methods obtained the correct matching rate greater than 90%. For images containing severe distortion, KAZE obtained a 35.71% correct matching rate, which is unacceptable for calculating the homography matrix, while IWWF achieved a 71.42% correct matching rate. IWWF was the only method that achieved the correct matching rate of no less than 50% for all four test stereo remote sensing image pairs and was the most stable compared to MSER, DWT-MSER, WαSH, DWT-WαSH, KAZE, WWF, and LIWWF.


Introduction
Image feature matching, which aims to acquire reliable homonymous features-matches between images-is one of the most basic and important processes of remote sensing image processing [1,2]. The accuracy of matching results directly affects the reliability of image registration [3] and fusion, automatic moving objects detection [4], and aerial triangulation and 3D reconstruction [5].
Feature matching includes three steps: feature detecting, feature describing, and similarity matching. At present, the most widely used feature matching algorithm is the Scale Invariant Feature decomposition on the original image and obtains four sub-images: LL, LH, HL, and HH. We discard the HH layer in which the noise in the original image is concentrated and transform the remaining sub-images to the original image size. WαSH detector is then performed on the transformed images. Depending on the descriptor construction methods for the WαSH features, three matching methods on the basis of wavelet transform WαSH feature (WWF), improved wavelet transform WαSH feature (IWWF), and layered IWWF (LIWWF), respectively, are distinguished. The matches are finally determined by using similarity matching. The procedure of improved WαSH feature matching based on 2D-DWT is illustrated in Figure 1.

2D-DWT
DWT of an image transforms the image data from the spatial domain to frequency domain by using hierarchical decomposing functions [25]. A 2D decomposition is achieved by sequentially applying 1D-DWT along horizontal and vertical directions, respectively [26]. In a 1D decomposition, there are both low-(L) and high-pass (H) filters along each direction, which leads to one "L" and one "H" sub-band, respectively. The "L" sub-band corresponds to low-frequency components in the wavelet domain, while the "H" sub-band to high-frequency ones. After doing the first level 2D decomposition, there are four sub-bands in the wavelet domain, which are labeled as LL, LH, HL, and HH, respectively.
The LL sub-band is generated by convolving the low-pass wavelet filter along horizontal and vertical directions on the image. It is an approximate representation of the original image. The LH sub-band is generated by convolving a low-pass wavelet filter in the horizontal direction and then convolving a high-pass wavelet filter in the vertical direction on the image. It represents a character in the vertical direction of the original image. The HL sub-band is generated by convolving a lowpass wavelet filter in the horizontal direction after a convolution of high-pass wavelet filter in the vertical direction on the image. It represents the horizontal characteristic of the original image. The HH sub-band is a wavelet coefficient generated by convolving high-pass wavelet filters in two directions on the image. It represents the diagonal edge characteristics of the image. Therefore, the LL sub-band contains most features of the original image, while the noise in the original image is concentrated in the HH sub-band.

2D-DWT
DWT of an image transforms the image data from the spatial domain to frequency domain by using hierarchical decomposing functions [25]. A 2D decomposition is achieved by sequentially applying 1D-DWT along horizontal and vertical directions, respectively [26]. In a 1D decomposition, there are both low-(L) and high-pass (H) filters along each direction, which leads to one "L" and one "H" sub-band, respectively. The "L" sub-band corresponds to low-frequency components in the wavelet domain, while the "H" sub-band to high-frequency ones. After doing the first level 2D decomposition, there are four sub-bands in the wavelet domain, which are labeled as LL, LH, HL, and HH, respectively.
The LL sub-band is generated by convolving the low-pass wavelet filter along horizontal and vertical directions on the image. It is an approximate representation of the original image. The LH sub-band is generated by convolving a low-pass wavelet filter in the horizontal direction and then convolving a high-pass wavelet filter in the vertical direction on the image. It represents a character in the vertical direction of the original image. The HL sub-band is generated by convolving a low-pass wavelet filter in the horizontal direction after a convolution of high-pass wavelet filter in the vertical direction on the image. It represents the horizontal characteristic of the original image. The HH sub-band is a wavelet coefficient generated by convolving high-pass wavelet filters in two directions on the image. It represents the diagonal edge characteristics of the image. Therefore, the LL sub-band contains most features of the original image, while the noise in the original image is concentrated in the HH sub-band.

WαSH Feature Detector
The WαSH detector exploits stable and distinctive dominant structural features in an image on the basis of image edge and α shape. It can be divided into four parts, namely edge detecting, edge sampling, WαSH constructing and feature extracting. The processes are as follows: (1) Edge detecting: g is the normalized gradient image of the input image in [0, 1]. The binary edge image e is acquired by applying the Canny detector on g.
(2) Edge sampling: With a fixed sampling interval s, e is sampled uniformly along edges to obtain a discrete set of edge points P ⊆ R 2 . For each point p ∈ P, the weight w(p) is defined to be multiple of its gradient strength: where g(p) ∈ [0, 1] is the value of g at p.
(3) WαSH construction: Regular triangulation of P is calculated. The line segments and triangles of triangulation are added into a collection K and are ordered by descending size, followed by the weighted α-shape constructing. For more details, please see [19]. (4) Feature extracting: the neighbors of each triangle σ T ∈ K are its three edges, while the neighbors of each line segments σ E ∈ K are the two adjacent triangles in the triangulation. Since the size of an edge is not larger than that of its two adjacent triangles, and α is decreasing, the intuition is that this edge can keep the two triangles disconnected until it is processed itself. Based on this neighborhood system, each connected weighted α complex is called a component. Considering that each element (line segment or triangle) of K in descending order of size is an independent component, the components are joined with their neighbors that have already been processed. Decreasing the value of α continuously, the strength of a component k U is calculated following Equation (2) when it is joined to another component through its boundary element: where a(k U ) is the total area of k U (the area of the line segment is 0); ρ T is the size of the boundary element. If strength is greater than a fixed threshold τ, the component is determined to be a WαSH feature which is consequently fitted to an ellipse. We assume the image coordinate of the ellipse center is (x, y), and the ellipse parameter is (a, b, c), then the ellipse equation The WαSH feature is denoted as x = (x, y, a, b, c) T in this paper.

Feature Set Construction
Wavelet transform (the Haar wavelet base is used in this paper) is firstly performed on the reference image and the image to be matched to obtain four sub-bands-LL, LH, HL, and HH. Since the sub-bands are two-dimensional, we consider them as four images of the LL, LH, HL, and HH layer. The LL layer image retains the original image content information, the LH layer, and the HL layer, respectively, maintain the vertical and horizontal information on the original image after wavelet transform. They are favorable for edge information and WαSH local features detecting. The noise in the original image is concentrated in the HH layer. Therefore, we perform WαSH detector on the images of LL layer, LH layer, and HL layer while discarding the HH layer. In order to avoid inconsistent coordinates, we transform the sizes of three-layer images to be equal to the original image. We perform the WαSH detection on the images of the LL layer, LH layer, and HL layer, and obtain the WαSH feature sets S LL = x LL i i = 0, 1, . . . , n LL , S LH = x LH i i = 0, 1, . . . , n LH , S HL = x HL i i = 0, 1, . . . , n HL , respectively, as displayed in Figure 2.  , respectively, as displayed in Figure 2. According to the normalization theory in [16], the WαSH feature, an elliptical region, can be normalized into a circular one, on which the description of the feature is more robust to affine distortion.
A variety of feature describing operators were compared in [27], reaching the conclusion that SIFT-based description operators have better robustness than others. Therefore, we use the SIFT algorithm to calculate the descriptors for WαSH features.
Three feature sets-the wavelet transform WαSH feature (WWF) set, the improved wavelet transform WαSH feature (IWWF) set, and the layered IWWF (LIWWF) set are constructed based on the above-mentioned WαSH local invariant features detected on the images of LL layer, LH layer, and HL layer in this paper. As the features consist of coordinates on the image x and the description vectors d , we then denote the features using the representation of The process is demonstrated in Figure 3a. According to the normalization theory in [16], the WαSH feature, an elliptical region, can be normalized into a circular one, on which the description of the feature is more robust to affine distortion.
A variety of feature describing operators were compared in [27], reaching the conclusion that SIFT-based description operators have better robustness than others. Therefore, we use the SIFT algorithm to calculate the descriptors for WαSH features.
Three feature sets-the wavelet transform WαSH feature (WWF) set, the improved wavelet transform WαSH feature (IWWF) set, and the layered IWWF (LIWWF) set are constructed based on the above-mentioned WαSH local invariant features detected on the images of LL layer, LH layer, and HL layer in this paper. As the features consist of coordinates on the image x and the description vectors d, we then denote the features using the representation of f = (x, d). In the following context, we use a superscript to distinguish different feature sets constructing method and a subscript to distinguish the original image for feature sets M containing descriptors.

WWF Set Constructing
Set S LL , S LH , and S HL are combined into set S W = x W i i = 0, 1, . . . , n W , where n W = n LL + n LH + n HL is the number of features in WWF. The description vector of feature x W i is calculated by using SIFT algorithm on original image. Then the set of WWF is acquired and denoted as The process is demonstrated in Figure 3a.

LIWWF Set Constructing
Similar to IWWF set constructing method, the description vectors of features in LL S are calculated on the image of LL layer, the descriptor vectors of features in LH S and HL S are calculated on the original image. The difference with IWWF set constructing method is that a variable F is introduced to distinguish the layer to which the features belong in LIWWF. Therefore, we obtain sets

Similarly Matching
The

Similarly Matching
The

Performance Compared to WαSH for Different Distortion
Six representative image sequences in the Krystian Mikolajczyk's personal homepage [28] with affine distortion, scale distortion, and illumination change, and a group of UAV image pairs with different levels of noise are employed to test the performance of WαSH, WWF, IWWF, and LIWWF. The test images are shown in Figure 4 in which each sequence contains six images.

Performance Compared to WαSH for Different Distortion
Six representative image sequences in the Krystian Mikolajczyk's personal homepage [28] with affine distortion, scale distortion, and illumination change, and a group of UAV image pairs with different levels of noise are employed to test the performance of WαSH, WWF, IWWF, and LIWWF. The test images are shown in Figure 4 in which each sequence contains six images.   The number of total matches (N t ), the number of correct matches (N c ) and correct matching rate (C r ) are employed to evaluate the performances of the four methods. The matching results are shown in Figure 5, of which the 1, 2, 3, 4, 5 of the horizontal axis are different image pairs representing image 2 to 1, 3 to 1, 4 to 1, 5 to 1, and 6 to 1, respectively. C r is calculated as the ratio percentage of N c to N t . According to [16], a pair of matches was determined to be the correct one when the overlap error is less than 50%.  According to the above experimental results, the analysis is as follow: From Figure 5, we can find that the similar orders and changes occurred in the plots of the number of total matches and correct matches. WWF, IWWF, and LIWWF offered higher Nt and Cr than WαSH for most image pairs. That is because the improved methods use features on images of LL layer, LH layer, and HL layer which contain obvious image structure characteristics. Combining the features on all of three sub-images, WWF, IWWF, and LIWWF achieved more matches than the original WαSH method.
Take the 5th image pair of Leuven in Figure 5 for example. The results of WαSH, WWF, IWWF, and LIWWF for this image pair are shown in Figure 6, and the repeatability and correct matching rate are displayed in Table 1. The repeatability measures how large proportion of the detected features is the corresponding scene region. The correct matching rate assesses the distinctiveness of the detected features.  Table 1 shows the WαSH features on LL and LH layer had a relative high repeatability. The reason is that the image of LL and LH layer contains obvious image structure characteristics. In addition, the matching accuracy of features described on LL layer was higher than the matching accuracy of features described on the original image. That means the image context of the LL layer is According to the above experimental results, the analysis is as follow: From Figure 5, we can find that the similar orders and changes occurred in the plots of the number of total matches and correct matches. WWF, IWWF, and LIWWF offered higher N t and C r than WαSH for most image pairs. That is because the improved methods use features on images of LL layer, LH layer, and HL layer which contain obvious image structure characteristics. Combining the features on all of three sub-images, WWF, IWWF, and LIWWF achieved more matches than the original WαSH method.
Take the 5th image pair of Leuven in Figure 5 for example. The results of WαSH, WWF, IWWF, and LIWWF for this image pair are shown in Figure 6, and the repeatability and correct matching rate are displayed in Table 1. The repeatability measures how large proportion of the detected features is the corresponding scene region. The correct matching rate assesses the distinctiveness of the detected features. According to the above experimental results, the analysis is as follow: From Figure 5, we can find that the similar orders and changes occurred in the plots of the number of total matches and correct matches. WWF, IWWF, and LIWWF offered higher Nt and Cr than WαSH for most image pairs. That is because the improved methods use features on images of LL layer, LH layer, and HL layer which contain obvious image structure characteristics. Combining the features on all of three sub-images, WWF, IWWF, and LIWWF achieved more matches than the original WαSH method.
Take the 5th image pair of Leuven in Figure 5 for example. The results of WαSH, WWF, IWWF, and LIWWF for this image pair are shown in Figure 6, and the repeatability and correct matching rate are displayed in Table 1. The repeatability measures how large proportion of the detected features is the corresponding scene region. The correct matching rate assesses the distinctiveness of the detected features.  Table 1 shows the WαSH features on LL and LH layer had a relative high repeatability. The reason is that the image of LL and LH layer contains obvious image structure characteristics. In addition, the matching accuracy of features described on LL layer was higher than the matching accuracy of features described on the original image. That means the image context of the LL layer is   Table 1 shows the WαSH features on LL and LH layer had a relative high repeatability. The reason is that the image of LL and LH layer contains obvious image structure characteristics. In addition, the matching accuracy of features described on LL layer was higher than the matching accuracy of features described on the original image. That means the image context of the LL layer is more distinctive than the image context of the original image in this case. Therefore, the C r scores of IWWF and LIWWF are higher than the C r scores of WαSH and WWF.

Application for Stereo Remote Sensing Images Matching
In order to perform a comprehensive analysis of the proposed methods, we selected four pairs of stereo remote sensing images obtained by UAV to the experimental analysis. Seeing in Figure 7, these UAV images contain different distortion including affine distortion, rotation, blur, and noise. The MSER+SIFT matching algorithm (abbreviated as MSER), DWT-MSER, which applies DWT to smooth images before MSER detecting, the WαSH + SIFT matching algorithm (abbreviated as WαSH), DWT-WαSH, and KAZE [24] are compared in this experiment. KAZE is a multiscale feature detection and description algorithm in nonlinear scale spaces [24]. It is similar to SIFT, but more robust than SIFT. In the MSER/WαSH method, the MSER/WαSH detector is applied to extract local features, which are consequently described by the SIFT-describing algorithm. All of the comparison methods determine matches by using NNDR. The threshold value is set to 0.8, which is equal to that of the proposed method. The matches of the proposed methods are depicted in Figure 7.
The matching results are presented in Table 2. N t is the number of total matches; N c is the number of correct matches which have overlap error lower than 50%; C r is the correct matching rate that is equal to the percentage of N c and N t . Based on the results displayed in Table 1, the analysis is as follow: Comparing N t of MSER and DWT-MSER, we can find that the number of total matches decreased when DWT was paired to MSER. Nevertheless, when DWT was paired to WαSH, it increased. The reason is that MSER detected much fewer features on the smoothed images than on the original images, while WαSH detected a similar number due to edge detecting process. The values of N t and N c of WWF, IWWF, and LIWWF were improved relative to WαSH, which is consistent with the result in Section 5.1. This is because WWF, IWWF, and LIWWF detect WαSH features on three images, which increase the number of features. Among the WαSH-based methods (i.e., WαSH, DWT-WαSH, WWF, IWWF, and LIWWF), LIWWF achieved the largest number of total matches and correct matches, followed by IWWF. C r values of IWWF and LIWWF were relatively stable. For images in Figure 7b, which contain simple texture and large rotation distortion, the correct matching rate of MSER, DWT-MSER, WαSH, DWT-WαSH, and WWF were less than 50% with a low N t that is difficult to estimate the homography matrix between image pairs, while IWWF and LIWWF obtained relatively reliable matches. For images in Figure 7c containing affine distortion and slight noise, C r values of KAZE and proposed methods were all above 90%, and the correct matching rate of WWF reached to 100%. KAZE obtained the largest value of N t than other methods. However, KAZE obtained a 35.71% correct matching rate, which is extremely low, in Figure 7a that may cause failure in calculating the homography matrix between the image pairs. IWWF was the only method that achieved the correct matching rate no less than 50% for all of the test images. Therefore, IWWF is the most stable to different kinds of distortion among these comparative methods.

Application for Stereo Remote Sensing Images Matching
In order to perform a comprehensive analysis of the proposed methods, we selected four pairs of stereo remote sensing images obtained by UAV to the experimental analysis. Seeing in Figure 7, these UAV images contain different distortion including affine distortion, rotation, blur, and noise. The MSER+SIFT matching algorithm (abbreviated as MSER), DWT-MSER, which applies DWT to smooth images before MSER detecting, the WαSH + SIFT matching algorithm (abbreviated as WαSH), DWT-WαSH, and KAZE [24] are compared in this experiment. KAZE is a multiscale feature detection and description algorithm in nonlinear scale spaces [24]. It is similar to SIFT, but more robust than SIFT. In the MSER/WαSH method, the MSER/WαSH detector is applied to extract local features, which are consequently described by the SIFT-describing algorithm. All of the comparison methods determine matches by using NNDR. The threshold value is set to 0.8, which is equal to that of the proposed method. The matches of the proposed methods are depicted in Figure 7.

Conclusions
The WαSH local invariant feature has advantages in distinctiveness and robustness. However, image matching based on WαSH features performs a small number of matches and noise sensitivity. Focusing on addressing these problems, this paper improved WαSH based on 2D-DWT and proposed three methods of WWF, IWWF, and LIWWF for image matching. Experiments on sequences of affine distortion, scale distortion, illumination change images, and a sequence of simulated noise images showed that the WWF, IWWF, and LIWWF acquired more matches and better performance than WαSH. For remote sensing images with less affine distortion and slight noise, the correct matching rate of KAZE and proposed methods were higher than MSER and WαSH, and the correct matching rate of WWF reached to 100%. Nevertheless, for images containing severe distortion, KAZE obtained extremely low correct matching rate which is unacceptable for the homography matrix calculating, while IWWF achieved 71.42% correct matching rate. IWWF was the only method that achieved the correct matching rate no less than 50% for all of the test images and was the most stable comparing with MSER, DWT-MSER, WαSH, DWT-WαSH, KAZE, WWF, and LIWWF.
The proposed methods are 2-3 times slower than WαSH because they detected WαSH features on three sub-images. In addition, the normalizing and resampling processes make the proposed methods moderate increase in computational cost. The proposed methods improved the number of matches and the stability at the expense of computational efficiency. We are going to focus on speeding up the proposed method in future work.