A Spatial-Spectral Feature Descriptor for Hyperspectral Image Matching

Hyperspectral Images (HSIs) have been utilized in many fields which contain spatial and spectral features of objects simultaneously. Hyperspectral image matching is a fundamental and critical problem in a wide range of HSI applications. Feature descriptors for grayscale image matching are well studied, but few descriptors are elaborately designed for HSI matching. HSI descriptors, which should have made good use of the spectral feature, are essential in HSI matching tasks. Therefore, this paper presents a descriptor for HSI matching, called HOSG-SIFT, which ensembles spectral features with spatial features of objects. First, we obtain the grayscale image by dimensional reduction from HSI and apply it to extract keypoints and descriptors of spatial features. Second, the descriptors of spectral features are designed based on the histogram of the spectral gradient (HOSG), which effectively preserves the physical significance of the spectral profile. Third, we concatenate the spatial descriptors and spectral descriptors with the same weights into a new descriptor and apply it for HSI matching. Experimental results demonstrate that the proposed HOSG-SIFT performs superior against traditional feature descriptors.


Introduction
Hyperspectral remote sensing application has made significant progress in the past few years [1] and shown competitive performance in a wide range of fields from remote sensing to biomedicine [2][3][4][5][6][7][8]. Hyperspectral image (HSI) matching is essential for hyperspectral applications but lacking attention. Therefore, constructing an HSI descriptor with good discriminative power and matching performance is of significant importance to a large number of hyperspectral vision tasks [9,10]. On the contray, feature matching algorithms are prosperous for grayscale images and applied in many vision tasks, e.g., image mosaic [11][12][13], image registration and fusion [14,15], structure-from-motion [16][17][18] and image-based localization [19,20].
Different from the grayscale image, HSI is represented as a three-dimensional (x, y, λ) data cube, where x and y represent two spatial dimensions of the scene, and λ represents the spectral dimension (comprising a range of wavelengths) [21,22]. In other words, HSI contains a sequence of scalar image which represents a narrow wavelength range of the spectrum [23], providing much spectral information. Such spectral information includes distinct material properties of objects, offering the potential to improve the overall performance of the initial matching. Although spatial descriptors are spatially structural distinct, invariant to rotation and some certain geometric transformations with the advantage of difference of Gaussian (DOG) function and histogram of gradient (HOG), they are not designed for HSI matching. When applied in HSI, the spatial descriptors usually extract features on a single band or on a grayscale image produced by reducing the dimension of the HSI. In this way, the spectral information is ignored, which is significant to distinguish objects with similar appearances but different materials. Therefore, spatial descriptors do recent years, learning descriptors are proposed for grayscale image matching, performing better than the hand-craft descriptors in terms of discriminative ability [35][36][37][38].
However, the fore-mentioned spatial descriptors are elaborately designed for grayscale or RGB image with notable performance. When applied to HSI matching, spatial features are usually extracted on a single band without using the superiority of spectral information which is beneficial to improve the performance of feature descriptors.

Multidimensional Descriptors
Multimodal image, which can be regarded as an n-dimensional (N-D) data cube, is structurally similar to the hyperspectral image in some aspects [39]. Multidimensional descriptors, designing for multimodal image matching and performing well in specific applications, can be used for HSI matching directly. Methods developed from SIFT utilize the multidimensional information to present multimodal image features. Particularly, n-dimensional scale invariant feature transform (N-SIFT) [40] uses hyperspherical coordinates for gradients and multidimensional histograms to create the feature vector from multimodal medical image. 3D-SIFT [41] is proposed for action recognition based on extracting repeatable keypoint features from video data. This work shows the feasibility of 3D SIFT used for spatial-temporal data. In addition, the work [42] presents a method for volumetric image registration by making changes to orientation assignment and gradient histograms based on 3D SIFT.
Although HSI is similar to multidimensional image and can be regarded as an N-D data cube, HSI has different physical significance from other types of N-D data. The spectral dimension feature of HSI corresponds to continuous reflectance change across wavelengths, while the counterpart indexes time or spatial location in videos or medical image. In other words, each spectral dimension of HSI has similar spatial construction, while the spatial features of videos and medical image change a lot in each slice. Such differences between HSI and other N-D images should be considered when developing 3D SIFT for HSI to construct the spectral descriptors.

Hyperspectral Descriptors
Few efforts are made to explore the HSI matching. Dorado-Muñoz [43] proposed a vector SIFT detector for HSI, improving the edge performance by taking the vectorial nature of the HSI into account. The multiscale representation of the HSI is generated by vector nonline diffusion. Additionally, spectral-spatial scale invariant feature transform (SS-SIFT) [28] is designed for HSI matching. It adopts the 3D Gaussian filter and 3D-DOG to detect keypoints in spectral and spatial domains simultaneously. After that, two descriptors are proposed for each keypoint by exploring the distribution of spectral-spatial gradient magnitude in its local 3D neighborhood. However, using 3D-DOG breaks the continuity of spectral signature, which reflects the features of objects.
In general, more works should be made to promote HSI matching. The performances of spatial descriptors are limited since they are designed for grayscale image without using spectral information. On the other hand, for multidimensional descriptors, the differences in physical significance between HSI and other N-D images are failed to be considered. Moreover, the existing HSI descriptors break the continuity reflectance of the spectral profile, depressing the effectiveness of methods.

Method
HSI augments the spectral information of grayscale or RGB image, yet spectral information is usually ignored by state-of-the-art spatial descriptors. Meanwhile, existing spectral descriptors break the continuity of spectral profile. Spectral gradient [44] provides a material descriptor invariant to geometry and incident illumination with high performance. Thus, this paper proposes a method to construct HSI descriptors based on HOSG.
In this section, we introduce the overall framework of HOSG-SIFT construction. To clarify our approach straightforwardly, we apply Unmanned Aerial Vehicle (UAV) HSIs to display keys steps in this section. More details of UAV HSIs are illustrated in Section 4.1. Figure 1 depicts the overall structure of the proposed method. The main steps are as follows: (1) Spatial descriptors construction. There are two steps to construct spatial descriptors.
The first step is to extract spatial interest points from an HSI in a new space produced by dimensional reduction. The second involves descriptors generating which describes the image spatial feature. (2) Spectral descriptors construction. After obtained the spatial interest points, we construct the spectral descriptors from the spectral gradient of surrounding neighbors. HOSG is used to construct the spectral descriptors. Additionally, normalized methods are used to eliminate the influences caused by incident changes in the environment. (3) A combination of the spatial and spectral feature. We obtain a spatial-spectral descriptor of 256 elements by concatenating the spatial descriptor of 128 elements and the spectral descriptor of 128 elements. The spatial and spectral descriptors have the same weight.  Figure 1. Overall framework of HOSG-SIFT construction. The spatial-spectral descriptor of 256 elements is constructed by concatenating the spatial and spectral descriptor. The weights of the spatial and spectral descriptor are w 1 and w 2 , respectively. In this paper, we use the same w 1 and w 2 .
3.1. Spatial Descriptor 3.1.1. Dimensional Reduction by PCA HSI contains much information and generates better discriminant performance for many applications. However, for a number of narrow bands, they have a strong correlation that results in massive redundant information in HSI. The strong correlation between several narrow bands results in massive redundancy in HSI such that it is more challenging to process HSI. Additionally, the redundancy in HSI may have a negative impact on spectral feature extraction. To resolve this problem, we transform the original features into a new space by PCA algorithm. Then, the feature keypoints and descriptors are generated in the new space.
PCA [45] is the process of computing the principal components and using them to perform a change of basis on the data, always using only the first few principal components and ignoring the rest. In fact, the PCA projects along the eigenvectors of the covariance matrix corresponding to the largest eigenvalues, where the eigenvectors points in the direction with the highest amount of data variation. Figure 2 shows the results of applying PCA with the UAV HSIs dataset.
From Figure 2 and Table 1, it can be observed that using the first principal component of HSI produces more spatial interest points than single band image. More detected keypoints allows higher number of matches and inliers (correct matched). The number of matches and inliers has a great impact on following applications, such as image mosaic, 3D-reconstruction.

Spatial Descriptor Construction
Grayscale images have been obtained by reducing the dimension of HSI in the last section. The next step is to construct the local spatial feature based on grayscale image. Local spatial features typically involve three distinct steps: keypoint detection, orientation estimation and spatial descriptor extraction. SIFT is used to extract spatial features in this work. The main steps of SIFT algorithm are as follows: (1) Local extremum detection. Local extremum points are identified by constructing a Gaussian pyramid and searching for local peaks in a series of DOG images. Taylor expansion is also applied to get the interpolated estimate for a more accurate local extremum. Moreover, candidate interest points are eliminated if found to be unstable. (2) Dominant orientation assignment. To achieve invariance to image rotation, dominant orientation are assigned to each keypoint based on local image properties. An orientation histogram is formed from the gradient orientation of sample points within a region around the keypoint. (3) Keypoint descriptors. A keypoint descriptor that should be highly distinctive and invariable to some environmental variations is then created by first computing the gradient magnitude and orientation at each image sample point in a region around the keypoint location. Figure 3 illustrates an example of constructing a spatial descriptor.
Spatial gradient Spatial gradient histogram Spatial descriptor 128 elements Figure 3. Spatial descriptor construction using the histogram of gradient. The image sample points are then accumulated into 8 orientation histograms summarizing the contents over 4× sub-regions. Therefore, a 8 × 4 × 4 elements spatial feature vector is constructed for each keypoint.

Spectral Descriptor
Taking advantage of the spectral structure to build feature descriptors is beneficial to HSI matching. Generally, 3D HOG is used to construct the spectral descriptors, yet it is time-consuming. In addition, 3D HOG ignores the physical significance of HSI that the spectral profile in each pixel represents the unique components of objects.
The appropriate description for the spectral feature is a benefit to distinguish different objects with similar appearance but different spectral features. Spectral gradient, which is commonly used in many HSI applications, describes the spectral profile rather than the original one with the advantage of reducing the magnitude offsets caused by illumination change and other impact factors. The difference between the original spectral profile and spectral gradient profile is shown in Figure 4. Meanwhile, Using spectral gradient avoids the destruction of spectral profile and save the computing time of spectral features.  Different from 3D HOG, we build the spectral descriptors using the HOSG. As shown in Figure 5, the main steps to generate the spectral descriptors are as follows: (1) Keypoints assignment. Same as SIFT, the potential keypoints, which should be invariant to scale and orientation, are identified by using a DOG function. Thereby, the coordinate and orientation of each spatial keypoint are determined and applied to construct the spectral descriptors (2) Sub-region division. We first designate a 16 × 16 × n (n is the number of spectral bands) patch surrounding the centre of each keypoint and rotate it to align its orientation assigned by previous step. Then, the patch is split up regularly into 16 sub-regions with a size of 4 × 4 × n.
(3) Vertices division. In this paper, the spectral gradient is evenly divided into eight vertices whose magnitude primarily ranges from −0.04 to 0.04. However, a few spectral gradient values are larger than 0.04 or smaller than −0.04. According to our statistics, most of those values are abnormal caused by the unstable imaging state of the hyperspectral sensor. Thus, we modify them to moderate the adverse effects. Specifically, the values smaller than −0.04 are increased to −0.04, while the values larger than 0.04 are decreased to 0.04. Moreover, most datasets of HSI have a similar range of spectral gradient magnitude (−0.04 to 0.04) after normalization. In this case, we believe the range (−0.04 to 0.04) can be applied to most HSI datasets. (4) Extracting the HOSG of the sub-region. The main steps of extracting the HOSG of a sub-region is depicted in Figure 6. Specifically, the spectral gradient profile is first calculated for each pixel in the sub-region. Second, we obtain a gradient histogram for the sub-region by accumulating the spectral gradient magnitude into eight vertices, which summarizes the contents of the sub-region. Finally, a vector with eight elements is constructed to represent HOSG of a sub-region. (5) Spectral feature vector construction and normalization. Based on the previous steps, we obtain 16 vectors from one patch, which respectively represent 16 sub-regions. Consequently, a spectral feature vector of 8 × 16 elements is constructed for each keypoint by concatenating the vectors of sub-regions. In that way, our method preserves the physical significance of HSI. In addition, the spectral descriptor should be normalized to reduce the negative impacts resulting from the changes in incident illumination. Specifically, the spectral feature vector is firstly normalized to the range of [−1,1]. A change in spectral profile in which each pixel value is multiplied by a constant will multiply gradients by the same constant, so this change will be canceled by vector normalization. Therefore, the descriptor is invariant to spectral changes. After that, we update the large gradient magnitudes by thresholding the values in the feature vector to each be no larger than 0.2 and no smaller than −0.2, and then renormalizing to the range of [−1,1]. This means that matching the magnitudes for large gradients is no longer as important, and that the distribution of orientations has greater emphasis. Note that the threshold values of 0.2 and −0.2 are commonly used for feature vector normalization in many works, such as SIFT [30], 3D-SIFT [24], and SS-SIFT [28], to reduce the negative impacts resulting from the changes in incident illumination.

Spatial-Spectral Descriptor
We obtain a spatial descriptor of 128 elements by SIFT and a spectral descriptor of 128 elements by HOSG, of which the method are detailedly depicted in Sections 3.1 and 3.2, respectively. In this section, we introduce the construction of spatial-spectral descriptor, which is effective for HSI matching.

Spatial-Spectral Descriptor Construction
Based on the previous steps, we construct the spatial-spectral descriptor of 256 elements by concatenating the spatial and spectral descriptors. Considering that the weights of two feature vectors may influence the performance of the spatial-spectral descriptor, a comparison test of descriptors weights is conducted. Specifically, we assume that the weights of the spatial and spectral descriptors are w 1 and w 2 , respectively. The sum of w 1 and w 2 is one (i.e., w 1 + w 2 = 1). And the ratio of w 1 and w 2 is set to 1/9, 2/8, 3/7, 4/6, 5/5, 6/4, 7/3, 8/2, and 9/1 for comparison test. The performance of the spatial-spectral descriptor with different w 1 and w 2 is shown in Figure 7a.
As shown in Figure 7a, the spatial-spectral descriptor performs the best with the weight ratio of 5/5. It tells the spatial and spectral descriptor are equally important for HSI feature representation. Consequently, in our method, we concatenate two vectors (the spatial and the spectral descriptor) of 128 elements with the same weights to obtain the spatial-spectral descriptor of 256 elements.

Evaluation Metrics
In our experiments, some popular metrics are used to evaluate the performance of HOSG-SIFT on a per-image pair basis, including recall, precision, putative match ratio, matching score and F1-score.
Recall measures the ability of descriptor to identify the possible correct matches.

Recall = Correct Matches
Correspondences (1) Precision defines the inlier ratio of putative matches, as determined by geometric verification.
The putative match ratio represents the selectivity of the descriptor and illustrates which kind of the detected features will be initially identified as a match.
Matching score describes the number of initial features that will result in correct matches.
Matching Score = Correct Matches Features (4) F1-score represents the harmonic mean between precision and recall. It is used as a statistical measure to rate performance. The higher F1-score, the better performance.

Experiment Settings
In this section, experiments are performed to evaluate the performance of proposed method. The HSIs with various spectral and spatial resolutions are collected via different hyperspectral cameras carried on unmanned aerial vehicles (UAV) and ground platforms in the different imaging environments. The descriptions of datasets are as follows.
(1) UAV HSIs. The UAV dataset, containing 18 sequence images, are collected via a UAV-borne hyperspectral sensor carried on a DJI M600 UAV provided by Sichuan Dualix Spectral Imaging Technology Company, Ltd., Chengdu, China. The aerial images have 176 spectral bands ranging from 400 nm to 1000 nm, with a spectral resolution of 3 nm, an image size of 1057 × 960 pixels, and a ground sample distance (GSD) of 10 cm. These images may come across projective distortion or nonrigid transformation due to the unstable imaging condition. The data is collected in a botanic garden, containing various objects, such as vegetation, artifacts, soil and many other categories. The transformation matrices of dataset are calculated previously. (2) Ground platform HSIs. The ground platform HSIs are provided by a public dataset (http://icvl.cs.bgu.ac.il/hyperspectral, accessed on 12 September 2021)-"BGU ICVL Hyperspectral Image Dataset" [46]. Images are collected at 1392 × 1300 spatial resolution over 31 spectral bands ranging from 400 nm to 700 nm. The data exhibit large changes in illumination, imaging condition and viewpoint. Images with overlapping regions are selected to perform matching experiments.
To demonstrate the feasibility of proposed method, we evaluate our method against some well-known spatial descriptors and multimodal descriptors, including SIFT [30] and SURF [31] , Root-SIFT, 3D-SIFT [24] and SS-SIFT [28], which have shown competitive performance in many vision applications [47,48]. With the experimental environment shown in Table 2 , all descriptors are evaluated using the same steps and parameters as the following steps.
(1) Descriptor construction. The spatial descriptors are extracted from the first principal component of HSI produced by PCA algorithm while the multimodal descriptors are extracted from the whole HSI cube. (2) Descriptor matching. Euclidean distances of descriptors are used to measure the similarity between the descriptor of images I1 and I2. Nearest neighbor matching has been detected if the minimum Euclidean distance between descriptor of one point in I1 and its nearest neighbour in I2 is less than 0.7. (3) Matching metrics. We evaluate the raw matching performance on a per image pair basis using the evaluation metrics demonstrated in Section 3.4. In addition, we also focus on the downstream performance of descriptors by evaluating the matching results obtained from RANSAC [49]. RANSAC is one of the most popular algorithms for outlier removal with a superior precision in complex scenes according to the existing studies.

Parameters Initialization
Here, we discuss parameters initialization before applying our scheme. Regarding the method of spectral descriptor construction, the number of sub-regions and vertices can be used to vary the complexity and performance of the descriptor. We evaluate our spatial-spectral descriptor with different parameters using F1-score.
Specifically, to test the influence of sub-regions number, we evaluate the performance with of descriptor with 1, 4, 9, 16, 25, 36 sub-regions. As shown in Figure 7b, the performance of descriptor is improved with the growth of sub-regions number-however, the performance decrease when the sub-regions number is larger than four. It shows that, in a certain range, the information of surrounding points is beneficial to enhance the robustness and discrimination ability of descriptor.
In addition, we verify the influence of the vertices numbers on descriptor performance by changing them with 4, 8, 16. Figure 7b indicates that the spectral descriptor constructed with eight and 16 vertices perform well in the same sub-region range. However, considering the computing complexity of the descriptor, we design 8 vertices to construct HOSG. Consequently, we designate 16 sub-regions for each patch surrounding the centre of each keypoint and divide the spectral gradient into eight vertices to construct spectral descriptors by practical value.

Matching Results in UAV Dataset
To demonstrate the effectiveness of the proposed method, we analyze the putative matching results and the downstream performance of different descriptors. The matching results obtained from RANSAC are utilized to evaluate the downstream performance of descriptors.

Detected Feature Points of Putative Matching Results
The putative matching results of SIFT, SURF, ROOT-SIFT, 3D-SIFT, SS-SIFT and HOSG-SIFT on UAV images are shown in Figure 8. We also count the number of detected feature points, putative matches and inliers as listed in Table 3 to compare the putative matching results of different methods clearly. The statistical results in Table 3 conform to the matching results in Figure 8. Among the spatial descriptors, ROOT-SIFT, as an improved version from SIFT, does the best with the matches number of 780 and the inliers number of 614. However, the keypoints with similar spatial structures are falsely regarded as matching point pairs since the spatial descriptors are designed for gray scales images without considering the spectral feature of objects.
3D-SIFT pictured in Figure 8d generates a minimun number of putative matches of 364 and inliers of 269 since it is proposed for medical image registration and performs relatively low when applied to HSI cube. By exploring both spectral and spatial dimensions simultaneously, SS-SIFT produces a largest number of putative matches but with a small number of inliers of 431, as shown in Figure 8e.
Compared with the methods mentioned above, HOSG-SIFT pictured in Figure 8f obtains a relatively high matches number and the largest inliers number. Different with SS-SIFT, we use a HOSG instead of 3D HOG to combine the spatial feature and spectral feature. Using spectral feature increases the number of putative matches. On the other hand, spectral profile is completely preserved by using the HOSG, thus the number of outliers (wrong matches) is decreased. Consequently, our method generates more highquality putative matches with fewer outliers, demonstrating the effectiveness of the spectral feature in HSI matching tasks.

Evaluation Metrics of Putative Matching Results
The Quantitative evaluation metrics are summarized in Figure 9 by precision, recall, matching ratio and matching score with cumulative distribution. Moreover, the average values of each evaluations is demonstrated in Table 4 for a more straightforward and comprehensive comparison.  Figure 9. Cumulative distribution of precision, recall, matching ratio, matching score and F1-score. Figures (a-e) are the cumulative distribution of precision, recall, matching ratio, matching score, and F1-score, respectively. A point on the curve with coordinate (x,y) denotes that there are (100 × x) percents of image pairs which have the performance value no more than y. Regarding the spatial descriptors, it is hard to distinguish those false matches with similar spatial features due to the lack of spectral information. Among them, ROOT-SIFT achieves superior results of the highest precision and the best recall due to the benefits of using a square root kernel instead of the standard Euclidean distance to measure the similarity. SURF is designed to accelerate SIFT and improve matching efficiency, which is sensitive to viewpoint and illumination. Therefore, regarding UAV HSIs where the viewpoint and illumination change frequently, SURF obtains inferior results in matching tasks compared with SIFT and ROOT-SIFT.
On the contrary, the evaluation results of 3D descriptors, including 3D-SIFT and SS-SIFT, are relatively poor, same as the results in Figure 8 and Table 3. 3D-SIFT is usually used for medical image processing with an outstanding performance. Medical image is structurally different from HSI, whose pixels in different slices represent different spatial locations. Thus, 3D-SIFT is limited in HSI matching. Additionally, SS-SIFT takes advantage of spatial and spectral features by 3D HOG. In this way, the falsely extracted matches increase along with the number of putative matches result in a deficient performance of SS-SIFT.
By comparison, the proposed method obtains a considerable precision of 79.54%, a recall of 59.87%, and the best F1-score of 67.77%. Our descriptor outperforms the spatial descriptors since we simultaneously explore the spatial and spectral information, which are essential to distinguish objects with similar spatial features but different spectral features. On the other hand, our approach also surpasses the performance of 3D descriptors, telling that extracting the HOSG to describe spectral features is effective and robust.

Matching Results from RANSAC
As shown in Figure 8 the putative matches contain a lot of outliers inevitably [50,51]. Using outlier filtering algorithms to improve the performance of image matching is commonly used. Here, we use RANSAC to remove outliers in the experiments and evaluate the descriptors in matching tasks.
The matching results by RANSAC filtering are shown in Figure 10. Most outliers are erased by the RANSAC algorithm. However, the number and ratio of outliers in putative matching significantly impact the performance of RANSAC. Although RANSAC improves the SS-SIFT results to a large extent, the performance of SS-SIFT is still limited due to numerous outliers in putative matching. Fewer matching pairs are preserved and distribute unevenly (see the first and second figures in Figure 10e. Regarding 3D-SIFT, a part of outliers is removed in putative matching results. Thus, fewer matching pairs are preserved after outliers filtering.
By comparison, our method produces densely correct matches distributed evenly across the image after outliers filtering, exceeding the performance of other conventional methods pictured in Figure 10. The results generated by RANSAC also demonstrate that the feasibility of the proposed method.

Matching Results on ICVL Dataset
To verify the robustness of our method, we also compare matching ability on the ICVL dataset collected by the ground platform. The ground platform HSIs are usually used for spectral reconstruction, thus there is no ground truth of the dataset for matching evaluation. Considering such a situation, we apply the RANSAC algorithm to estimate the homography and regard the results as the transformation matrix of images. The putative matching results of ground platform HSIs are shown in Figure 11. The quantitative metrics are listed in Table 5.
Similar to the matching results of UAV HSIs, ROOT-SIFT obtains a precision of 47.12% that is higher than SIFT and SURF but poorer than our method, whose precision is 49.08%. The matching results of descriptors are depressed without considering the benefits spectral features. SURF produces a large amount of putative matching, and receives the best recall of 78.93% that is higher than our method by 18.25%.
Due to the limits of algorithms, matching errors prevalently exist. Although 3D-SIFT pictured in Figure 11d eliminates some false matches by taking advantage of multiband information, the putative matching number of 3D-SIFT results decrease along with the errors. Similarly, SS-SIFT shows the inferior performance on evaluation metrics since it requires outliers filtering for enhancement.
By contrast, our method outperforms other methods with more correct matches while preserving the precision, as depicted in Figure 11f. The matching results in the ICVL dataset also reveal that the robustness and effectiveness of proposed method.

Running Time Comparison
We compare the running time of different methods, as shown in Table 6. Without considering the spectral feature, spatial descriptors perform more effective than 3D descriptors. In other words, constructing a descriptor based on multiple dimensions features is much more time-consuming. SURF operates fastest since it simplifies the extracting process by using the Hessian matrix instead of DOG to detect key points. In addition, ROOT-SIFT yields a relatively good performance without increasing computational costs.
3D descriptors, including 3D-SIFT and SS-SIFT, cost much time in keypoints detection and descriptors construction as they use 3D Gaussian convolution kernel and 3D-DOG to generate spatial-spectral keypoints. Although our descriptor is less effective than spatial descriptors, we get higher results both in precision and recall. We also learn that considering the spatial and spectral features both will increase the computing time. On the other hand, compared with existed 3D descriptors, our approach, constructing a HOSG profile in surrounding neighbors, is much more time-saving and performs more outstanding.

Discussion
The proposed method to construct HSI descriptors has three main steps: spatial descriptor extraction from grayscale images, spectral descriptor generation using HOSG and spatial-spectral descriptor construction. Although HSI descriptors perform well in HSI matching, limitations still exist.
On the one hand, regarding the spatial and spectral features, they have the same weights in our spatial-spectral descriptor. We concatenate the spatial descriptor of 128 elements and the spectral descriptor of 128 elements to obtain a spatial-spectral descriptor of 256 elements. Generally, the weights of spatial and spectral features are supposed to vary in different scenarios. Although we verify the performance of the descriptor with fixed ratio of weights, we have yet to discuss this with the adaptive ratio of weights.
On the other hand, although our descriptor outperforms the common spatial descriptors in precision for HSI matching, our method is limited in efficiency since the spectral information increases the computing complexity. On the other hand, compared with 3D descriptors such as 3D-SIFT and SS-SIFT, our descriptor is superior as we apply the HOSG to present spectral features. Thus, we believe our method is an alternative for those cases without a request for very high efficiency.

Conclusions
This paper presents an HSI descriptor constructed using SIFT and HOSG, which combines spatial and spectral features. The proposed HSI descriptors improve the performance of HSI matching with a precision of 79.54% in the UAV dataset and 49.80% in the ground platform dataset. In terms of overall performance, the proposed method outperforms other popular descriptors. Moreover, compared to spatial descriptors, the proposed method can better distinguish objects with the same material but different structures, providing a reference for descriptor construction in HSI. Considering the limitations of our work, we will enrich the diversity of hyperspectral image datasets in the future. Meanwhile, we will develop better feature extractors and matches with fewer errors using spectral information.