Improved Class-Speciﬁc Codebook with Two-Step Classiﬁcation for Scene-Level Classiﬁcation of High Resolution Remote Sensing Images

: With the rapid advances in sensors of remote sensing satellites, a large number of high-resolution images (HRIs) can be accessed every day. Land use classiﬁcation using high-resolution images has become increasingly important as it can help to overcome the problems of haphazard, deteriorating environmental quality, loss of prime agricultural lands, and destruction of important wetlands, and so on. Recently, local feature with bag-of-words (BOW) representation has been successfully applied to land-use scene classiﬁcation with HRIs. However, the BOW representation ignores information from scene labels, which is critical for scene-level land-use classiﬁcation. Several algorithms have incorporated information from scene labels into BOW by calculating a class-speciﬁc codebook from the universal codebook and coding a testing image with a number of histograms. Those methods for mapping the BOW feature to some inaccurate class-speciﬁc codebooks may increase the classiﬁcation error. To effectively solve this problem, we propose an improved class-speciﬁc codebook using kernel collaborative representation based classiﬁcation (KCRC) combined with SPM approach and SVM classiﬁer to classify the testing image in two steps. This model is robust for categories with similar backgrounds. On the standard Land use and Land Cover image dataset, the improved class-speciﬁc codebook achieves an average classiﬁcation accuracy of 93% and demonstrates superiority over other state-of-the-art scene-level classiﬁcation methods.


Introduction
With the development of remote sensing sensors, satellite image sensors can offer images with a spatial resolution of a level of decimeter.We call these images high-resolution remote sensing images (HRIs).HRIs are fundamental in land-use classification since they can provide detailed ground information and complex spatial structural information for land-use classification [1].However, due to complex arrangements of the ground objects and multiple types of land-cover [2,3], scene-level land-use classification of HRIs is a challenging task [4].
In order to recognize and analyze scenes from HRIs, various scene classification methods have been proposed over the years.As mentioned in [5], the scene classification methods can be classified into three kinds, namely: methods using low-level visual features, methods relying on mid-level visual representations and methods based on high-level vision information.
Low-level methods describe one image with a feature vector from low-level visual attributes such as Scale Invariant Feature Transform (SIFT) [6], Local Binary Pattern (LBP) [7], Color Histogram (CH) [8] and GIST [9].Low-level methods deliver better performance on images with uniform structures and spatial arrangements but it is difficult to recognize images with the high-diversity and non-homogenous spatial distributions.
Mid-level approaches attempt to develop a global scene representation through the statistical analysis of the extracted local visual attributes.One of the most popular mid-level approaches is the bag-of-words (BOW) [10] model.This method simply counts occurrences of local features in an image without considering their spatial relationships.
In addition, to encode higher-order spatial information between low-level local visual words for scene modeling, topic models are developed to take into account the semantic relationship among the visual words.These methods include: Latent Dirichlet Allocation (LDA) [11] and probabilistic Latent Semantic Analysis (PLSA) [12].One main difficulty of such methods without modification lies in the fact that they may lack the flexibility and adaptability to different scenes [3].
High-level methods are usually based on popular deep learning.In general, deep learning methods [13,14] use a multi-stage global feature learning architecture to adaptively learn image features and often cast the scene classification as an end-to-end problem.Existing available pre-trained deep Convolution Neural Network (DCNN) architecture are Overfeat [15], CaffeNet [16] and GoogLeNet [17].However, those traditional unmodified DCNN architecture may need a large number of annotated samples to train a large-scale neural network.
Recently, the BOW model initially in the field of text analysis has been successfully applied to scene-level land-use classification using HRIs [18,19].However, the traditional BOW model has shown the following drawbacks in the remote sensing domain: (1) Some extracted keypoints are unhelpful for land-use classification, which may have a negative effect on computational efficiency and image representation [20].(2) The traditional BOW model uses a universal codebook for all categories without incorporating information of specific scene labels into it [21], which may result in misclassification in categories with similar backgrounds.(3) Existing methods incorporating information of labels into BOW model code a testing image with a number of class-specific image representations in each category rather than just one specific representation [21], leading to a large error in mapping universal BOW representation to some inaccurate categories.
In order to solve the first problem, we can use keypoint selection to remove redundant keypoints.Figure 1 shows original extracted SIFT keypoints in (a) and selected keypoints with presented modified keypoint selection method in my text in (b).As we can see, keypoints in (a) are redundant and some of them occur in many other images, after selecting keypoints, we get condensed keypoints that are more helpful for land-use classification.
In the field of keypoint selection or descriptor selection, many experiments have been done to enhance classification performance.Dorko and Schmid [22] introduced a novel method where local descriptors are first divided into several groups using Gaussian Mixture Model (GMM).Then, a Support Vector Machine (SVM) classifier is trained for each group to determine the most discriminative groups.Vidal-Naquet and Ullman [23] proved that using a linear classifier to select informative keypoints delivers better performance.Agarwal and Roth [24] extracts informative parts from images and images can be represented by those parts.Chin et al. [25] proposed the SAMME algorithm to extend the popular AdaBoost algorithm [26] to multiclass problems in order to learn and select the most representative descriptors.However, methods above haven't focused on the BOW scenario.Lin [27] proposed a two-step iterative keypoint selection method designed for bag-of-word feature, but his initial seed keypoint will have an effect on later keypoint selection results.Therefore, we replace choosing one seed point with a keypoint filter by response value of keypoints.
Traditional BOW model uses a universal codebook for all categories, which may result in misclassification in similar categories as shown in Figure 2. As we can see, these images in (a), (b) and (c) are with similar backgrounds, so their image vocabularies and representations are similar and difficult to distinguish.Several studies have concentrated on incorporating scene label information into the codebook to improve classification performance.Perronnin [21] creates specifically tuned vocabularies for each image category using the maximum a posteriori (MAP) criterion.Umit [28] proposes a method based on the class-specific codebook derived from Self-Organizing Maps (SOM).Li [29] proposed a method of generating a codebook for each class using piecewise vector quantized approximation (PVQA) on considering the difference between categories.However, these methods delivering better performance in Computer Vision are not suitable for land-use classification of HRIs, since HRIs can provide more complex appearance and spatial arrangements and scene categories in HRIs are largely affected and determined by human and social activities.Therefore, we need to capture the characteristics in each category and we propose a class-specific codebook base on Mutual Information (MI) to evaluate the importance of each vocabulary for each category.The category with the highest MI value in one vocabulary will be assigned to this vocabulary in a class-specific codebook.
Existing methods incorporating information of categories code the testing image with a group of histograms.One histogram of the testing image can be classified by the SVM classifier and the label of class with the largest number of predicting results from SVM will be the final output.The error items of existing methods are illustrated in the left part of Figure 3. Mapping error here means error in mapping universal histograms to class-specific codebook of inaccurate categories, namely the differences between the class-specific histogram of the category and the class-specific histogram of true labels.Mapping error may lead to inaccurately representing the information of scene labels, which may cause misclassification.Therefore, we predict the relatively accurate results of one testing image using a kernel collaborative representation based classification (KCRC) method [30] instead of blind mapping, as shown on the right-hand side of Figure 3. Several studies have concentrated on incorporating scene label information into the codebook to improve classification performance.Perronnin [21] creates specifically tuned vocabularies for each image category using the maximum a posteriori (MAP) criterion.Umit [28] proposes a method based on the class-specific codebook derived from Self-Organizing Maps (SOM).Li [29] proposed a method of generating a codebook for each class using piecewise vector quantized approximation (PVQA) on considering the difference between categories.However, these methods delivering better performance in Computer Vision are not suitable for land-use classification of HRIs, since HRIs can provide more complex appearance and spatial arrangements and scene categories in HRIs are largely affected and determined by human and social activities.Therefore, we need to capture the characteristics in each category and we propose a class-specific codebook base on Mutual Information (MI) to evaluate the importance of each vocabulary for each category.The category with the highest MI value in one vocabulary will be assigned to this vocabulary in a class-specific codebook.
Existing methods incorporating information of categories code the testing image with a group of histograms.One histogram of the testing image can be classified by the SVM classifier and the label of class with the largest number of predicting results from SVM will be the final output.The error items of existing methods are illustrated in the left part of Figure 3. Mapping error here means error in mapping universal histograms to class-specific codebook of inaccurate categories, namely the differences between the class-specific histogram of the category and the class-specific histogram of true labels.Mapping error may lead to inaccurately representing the information of scene labels, which may cause misclassification.Therefore, we predict the relatively accurate results of one testing image using a kernel collaborative representation based classification (KCRC) method [30] instead of blind mapping, as shown on the right-hand side of Figure 3. Several studies have concentrated on incorporating scene label information into the codebook to improve classification performance.Perronnin [21] creates specifically tuned vocabularies for each image category using the maximum a posteriori (MAP) criterion.Umit [28] proposes a method based on the class-specific codebook derived from Self-Organizing Maps (SOM).Li [29] proposed a method of generating a codebook for each class using piecewise vector quantized approximation (PVQA) on considering the difference between categories.However, these methods delivering better performance in Computer Vision are not suitable for land-use classification of HRIs, since HRIs can provide more complex appearance and spatial arrangements and scene categories in HRIs are largely affected and determined by human and social activities.Therefore, we need to capture the characteristics in each category and we propose a class-specific codebook base on Mutual Information (MI) to evaluate the importance of each vocabulary for each category.The category with the highest MI value in one vocabulary will be assigned to this vocabulary in a class-specific codebook.
Existing methods incorporating information of categories code the testing image with a group of histograms.One histogram of the testing image can be classified by the SVM classifier and the label of class with the largest number of predicting results from SVM will be the final output.The error items of existing methods are illustrated in the left part of Figure 3. Mapping error here means error in mapping universal histograms to class-specific codebook of inaccurate categories, namely the differences between the class-specific histogram of the category and the class-specific histogram of true labels.Mapping error may lead to inaccurately representing the information of scene labels, which may cause misclassification.Therefore, we predict the relatively accurate results of one testing image using a kernel collaborative representation based classification (KCRC) method [30] instead of blind mapping, as shown on the right-hand side of Figure 3.However, the predicting results of KCRC may still be unreliable since information in HRIs is more detailed in surface features and complexity of scenes than images in Computer Vision.Therefore, we perform classification in two steps to increase classification performance.Firstly, we use KCRC combined with Spatial Pyramid Matching (SPM) to predict two true labels of the testing image instead of just one predicted label.Then we map universal histograms to these two classspecific codebooks.These two class-specific histograms will be respectively put into Support Vector Machine (SVM) [31] for outputting confidence in each label.Then the label with the largest sum of confidence will be the final classification result.
Inspired by the aforementioned work, we incorporate a proposed class-specific codebook into a BOW model for scene-level land-use classification.The main contributions of this paper are summarized below： (1) We modify an iterative keypoint selection algorithm with the filter by keypoints' response values, which enables us to reduce the computational complexity by filtering out indiscriminative keypoints and select representative keypoints for better image representation.(2) We propose a class-specific codebook designed for HRIs based on feature selection using MI to allocate vocabularies in the universal codebook for each category in order to expand differences between locality-constrained linear coding (LLC) codes of various land use categories.(3) In the testing period, we classify the testing image in two steps.We introduce the KCRC algorithm to obtain two comparatively accurate predicting results of testing samples to make sure the testing sample may be mapped to their unique class-specific codebook and decrease the prediction error by putting these two class-specific histograms respectively into SVM classifiers to output the confidence in each label.The testing image will be assigned to the label with the largest sum of confidence.
The rest of the paper is organized as follows: In Section 2, we describe the overall process of the proposed approach and details of proposed approach.In Section 3, several experiments and results are presented to demonstrate the effectiveness and superiority of the proposed algorithms.In Section 4, a discussion about the proposed method is conducted.Conclusions and suggestions for future work are summarized in Section 5.

Materials and Methods
In this section, we present a scene classification method of HRIs based on an improved classspecific codebook as shown in Figure 4, which can be divided into four main steps.
In the first step, dense Scale-Invariant Feature Transform (SIFT) descriptors [18] are extracted from training images in each local patch of Spatial Pyramid Matching(SPM) [32] and are selected using a modified iterative keypoint selection method to remove keypoints that are unhelpful for classification.
In the second step, we use the selected keypoints to generate a universal codebook with "kmeans" and get BOW representations in each local patch of SPM by LLC [33].However, the predicting results of KCRC may still be unreliable since information in HRIs is more detailed in surface features and complexity of scenes than images in Computer Vision.Therefore, we perform classification in two steps to increase classification performance.Firstly, we use KCRC combined with Spatial Pyramid Matching (SPM) to predict two true labels of the testing image instead of just one predicted label.Then we map universal histograms to these two class-specific codebooks.These two class-specific histograms will be respectively put into Support Vector Machine (SVM) [31] for outputting confidence in each label.Then the label with the largest sum of confidence will be the final classification result.
Inspired by the aforementioned work, we incorporate a proposed class-specific codebook into a BOW model for scene-level land-use classification.The main contributions of this paper are summarized below: (1) We modify an iterative keypoint selection algorithm with the filter by keypoints' response values, which enables us to reduce the computational complexity by filtering out indiscriminative keypoints and select representative keypoints for better image representation.(2) We propose a class-specific codebook designed for HRIs based on feature selection using MI to allocate vocabularies in the universal codebook for each category in order to expand differences between locality-constrained linear coding (LLC) codes of various land use categories.(3) In the testing period, we classify the testing image in two steps.We introduce the KCRC algorithm to obtain two comparatively accurate predicting results of testing samples to make sure the testing sample may be mapped to their unique class-specific codebook and decrease the prediction error by putting these two class-specific histograms respectively into SVM classifiers to output the confidence in each label.The testing image will be assigned to the label with the largest sum of confidence.
The rest of the paper is organized as follows: In Section 2, we describe the overall process of the proposed approach and details of proposed approach.In Section 3, several experiments and results are presented to demonstrate the effectiveness and superiority of the proposed algorithms.In Section 4, a discussion about the proposed method is conducted.Conclusions and suggestions for future work are summarized in Section 5.

Materials and Methods
In this section, we present a scene classification method of HRIs based on an improved class-specific codebook as shown in Figure 4, which can be divided into four main steps.
In the first step, dense Scale-Invariant Feature Transform (SIFT) descriptors [18] are extracted from training images in each local patch of Spatial Pyramid Matching(SPM) [32] and are selected using a modified iterative keypoint selection method to remove keypoints that are unhelpful for classification.
In the second step, we use the selected keypoints to generate a universal codebook with "k-means" and get BOW representations in each local patch of SPM by LLC [33].
In the third step, for each category in the training set, we calculate an MI value between each vocabulary and each category to obtain a matrix about MI.Assuming the MI value of one category exceeds that of other categories in a particular vocabulary, then we add this vocabulary to class-specific codebook of that category.Repeating the above procedure, we finally get a unique class-specific codebook in each category.
Remote Sens. 2017, 9, 223 5 of 24 In the third step, for each category in the training set, we calculate an MI value between each vocabulary and each category to obtain a matrix about MI.Assuming the MI value of one category exceeds that of other categories in a particular vocabulary, then we add this vocabulary to classspecific codebook of that category.Repeating the above procedure, we finally get a unique classspecific codebook in each category.
Finally, we perform classification in two steps on testing datasets.Firstly, we use a KCRC combined with the SPM method to predict two true labels of testing sample and the testing sample are mapped to these two class-specific codebooks.Then we represent the testing image with two class-specific histograms.These two class-specific histograms will be respectively put into a Support Vector Machine (SVM) to compute the confidence in each label.Then the label with the largest sum of confidence will be the final classification result.
Details of those specific principles and implementation processes are provided in the subsequent sub sections.Finally, we perform classification in two steps on testing datasets.Firstly, we use a KCRC combined with the SPM method to predict two true labels of testing sample and the testing sample Remote Sens. 2017, 9, 223 6 of 24 are mapped to these two class-specific codebooks.Then we represent the testing image with two class-specific histograms.These two class-specific histograms will be respectively put into a Support Vector Machine (SVM) to compute the confidence in each label.Then the label with the largest sum of confidence will be the final classification result.
Details of those specific principles and implementation processes are provided in the subsequent sub sections.

Iterative Keypoint Selection with the Filter by Keypoints' Response Values
The central idea of iterative keypoint selection method [27] is illustrated in Figure 5.In one iteration, we identify the discriminative descriptors and filter out unrepresentative ones with a distance measure since keypoints with similar descriptors appear to be close.The iteration repeats until no unrepresentative descriptors are filtered out.

Iterative Keypoint Selection with the Filter by Keypoints' Response Values
The central idea of iterative keypoint selection method [27] is illustrated in Figure 5.In one iteration, we identify the discriminative descriptors and filter out unrepresentative ones with a distance measure since keypoints with similar descriptors appear to be close.The iteration repeats until no unrepresentative descriptors are filtered out.The key problem of iterative keypoint selection is to how to choose discriminative keypoints.Lin randomly chose one keypoint as the initial keypoint.Thus, the location of the initial keypoint will have an effect on the selection results.
In order to solve the problem of initial keypoint selection, we use response value with neighboring keypoints to reflect the saliency of keypoints.We remove keypoints with lower contrast than a threshold  according to Equation (1).
where D is the value of Difference of Gaussian (DOG) function [6] in the location of keypoints and ( , , ) X x y   is the offset of keypoints.
Filter of response value can not only avoid the problem of initial keypoint selection but also remove some unreliable keypoints which are not different from neighboring keypoints since we just need a critical keypoint for image representation.This step can help to offer discriminative results for the later iterative keypoint selection.
Then filtered keypoints are clustered using "k-means" in the SIFT feature space.Keypoints closest to the cluster center are regarded as representative keypoints.Keypoints whose Euclidean distance in SIFT feature space are within a threshold T of those representative keypoints will be removed.This is the first iteration of selection.
The selection results of the first iteration will be used as the initial keypoints in the second iteration and the procedure will be the same as in the first iteration.The iteration repeats until no keypoints will be filtered out or remaining keypoints are inadequate to be clustered.
The proposed keypoint selection can not only improve computational efficiency but also remove keypoints that are unhelpful for classification to enhance classification performance in HRIs land-use The key problem of iterative keypoint selection is to how to choose discriminative keypoints.Lin randomly chose one keypoint as the initial keypoint.Thus, the location of the initial keypoint will have an effect on the selection results.
In order to solve the problem of initial keypoint selection, we use response value with neighboring keypoints to reflect the saliency of keypoints.We remove keypoints with lower contrast than a threshold θ according to Equation (1).

D(
where D is the value of Difference of Gaussian (DOG) function [6] in the location of keypoints and X = (x, y, σ) is the offset of keypoints.Filter of response value can not only avoid the problem of initial keypoint selection but also remove some unreliable keypoints which are not different from neighboring keypoints since we just need a critical keypoint for image representation.This step can help to offer discriminative results for the later iterative keypoint selection.
Then filtered keypoints are clustered using "k-means" in the SIFT feature space.Keypoints closest to the cluster center are regarded as representative keypoints.Keypoints whose Euclidean distance in SIFT feature space are within a threshold T of those representative keypoints will be removed.This is the first iteration of selection.
The selection results of the first iteration will be used as the initial keypoints in the second iteration and the procedure will be the same as in the first iteration.The iteration repeats until no keypoints will be filtered out or remaining keypoints are inadequate to be clustered.
The proposed keypoint selection can not only improve computational efficiency but also remove keypoints that are unhelpful for classification to enhance classification performance in HRIs land-use classification.

LLC Coding
Traditional SPM solves the following constrained least square fitting problem using Vector Quantization (VQ) coding [32] in Equation ( 2): where an image is represented by a set of extracted dense SIFT descriptors X, namely X However, VQ coding may lead to vector quantization error due to the hard-assignment strategy.In order to solve this problem, the restrictive cardinality constraint c i l 0 = 1 in Equation ( 3) can be relaxed by using a sparsity regularization term in ScSPM [34].Moreover, it is a standard sparse coding (SC) problem how to code each SIFT descriptor x i with a soft-assignment strategy.The SC problem can be solved in Equation ( 4): As suggested by J. Wang [31], sparsity is not as essential as locality since locality must result in sparsity but not vice versa.The LLC coding can be solved in Equation ( 4): where means multiplication of each element in matrix d i and c i , and d i ∈ R M is the locality adaptor assigning distinctive freedom to each vocabulary in codebook according to its similarity to the SIFT descriptor x i in Equation ( 5). where is the Euclidean distance between the SIFT descriptor x i and vocabulary b j , σ is used for adjusting the weight decay speed for the locality adaptor.Then the max-pooling strategy is applied to the coding results C to get the final LLC coding.
For each local patch extracted by SPM, we get the LLCSPM coding with Equation ( 6).Then we fuse the LLC coding with the weight in SPM to form a longer LLCSPM coding to represent the image.
where LLC j i represents the LLC code in the jth patch on the i-th level of SPM.LLC combined with SPM not only incorporates spatial information into BOW model but also demonstrates the lowest vector quantization error.

Generation of Class-Specific Codebook Using MI
The class-specific codebook is obtained through the vocabulary selection from the universal codebook for each category using class-specific data in training set.The class-specific codebook has two interesting properties [21].It needs fewer training samples to estimate parameters of the specific category since during vocabulary selection we have made some assumptions on the a priori location of parameters in the whole parameter space.Moreover, the class-specific codebook maintains some correspondence with the universal codebook since it is derived from the universal codebook.
After obtaining the universal codebook and universal histogram with the above methods, we can obtain a class-specific codebook and class-specific histogram as shown in Figure 6.Assume that we need to generate a class-specific codebook in two categories, forest and river.As we can see in Figure 6, if the number of visual words is equal to 2, the vocabularies in the universal codebook are assigned to only one of the two categories respectively according to their MI value.If the MI value of forest is above that of river, then this vocabulary marked with the green color will be assigned to forest and vice versa.Each vocabulary can only belong to one specific category.Finally, we will get a class-specific codebook and the number of red bars is the vocabulary that belongs to the class-specific codebook.The red bar represents the class-specific histogram derived from the universal histogram.
Remote Sens. 2017, 9, 223 8 of 24 category since during vocabulary selection we have made some assumptions on the a priori location of parameters in the whole parameter space.Moreover, the class-specific codebook maintains some correspondence with the universal codebook since it is derived from the universal codebook.
After obtaining the universal codebook and universal histogram with the above methods, we can obtain a class-specific codebook and class-specific histogram as shown in Figure 6.Assume that we need to generate a class-specific codebook in two categories, forest and river.As we can see in Fig 6, if the number of visual words is equal to 2, the vocabularies in the universal codebook are assigned to only one of the two categories respectively according to their MI value.If the MI value of forest is above that of river, then this vocabulary marked with the green color will be assigned to forest and vice versa.Each vocabulary can only belong to one specific category.Finally, we will get a classspecific codebook and the number of red bars is the vocabulary that belongs to the class-specific codebook.The red bar represents the class-specific histogram derived from the universal histogram.

River universal histogram
Forest universal histogram As we can see, the universal histograms of river and forest are similar, which may result in misclassification.However, we chose the most representative vocabulary from the universal codebook for each category and only one vocabulary can exist in just one class-specific codebook.That is to say, vocabularies best representing this class-specific codebook will exist in this classspecific codebook.Thus, the class-specific codebook can better reflect the information of this category.Mapping the universal histogram to the class-specific codebook means value will exist in vocabularies that belong to this class-specific codebook.Values in other vocabularies of this codebook will be 0. The dimension of universal histograms is the same as that of class-specific histograms, but the class-specific histogram is more discriminative since it reflects information belonging to its own labels rather than information about the whole image.Details of the generation of the class-specific codebook will be illustrated as follows.
The MI value between each vocabulary and each category reflects contributions of each vocabulary to each category.MI value can be calculated as Equation ( 7).As we can see, the universal histograms of river and forest are similar, which may result in misclassification.However, we chose the most representative vocabulary from the universal codebook for each category and only one vocabulary can exist in just one class-specific codebook.That is to say, vocabularies best representing this class-specific codebook will exist in this class-specific codebook.Thus, the class-specific codebook can better reflect the information of this category.Mapping the universal histogram to the class-specific codebook means value will exist in vocabularies that belong to this class-specific codebook.Values in other vocabularies of this codebook will be 0. The dimension of universal histograms is the same as that of class-specific histograms, but the class-specific histogram is more discriminative since it reflects information belonging to its own labels rather than information about the whole image.Details of the generation of the class-specific codebook will be illustrated as follows.
The MI value between each vocabulary and each category reflects contributions of each vocabulary to each category.MI value can be calculated as Equation (7).

MI(b i |c
where b i is the i-th vocabulary in the universal codebook and c j is the j-th category in scene labels.P(b i c j ) reflects the possibility that b i exists in training samples of c j and P(b i ) is the possibility of b i existing in the training set.
As illustrated above, we calculate MI values between each vocabulary and each category to obtain a matrix concerning MI values as Equation (8) shows: For each row in the Equation ( 8), we can obtain a maximum, such as MI(b i c j ) .Then we add b i to the class-specific codebook of c j .After traversal of all vocabularies in the universal codebook, we will finally get the class-specific codebook in each category.

KCRC Combined with SPMPpredicting Method and Two-Step Classification
As shown in Figure 3, in methods related to the class-specific codebook, bag-of-features are usually mapped to each category to get histograms in each category for classification.If bag-of-features are mapped to inappropriate class-specific codebooks, the predicting results might be incorrect and this class-specific histogram will be useless for classification.In order to overcome the limitations, we present a KCRC combined with the SPM algorithm to obtain two accurate predicting outcomes for the testing sample.Details of KCRC method are illustrated as follows.
Assuming X i = [x i,1 , x i,2 , . . ., x i,n i ] ∈ R 21×M×n i is the LLC codes combined with SPM in the training sample of i-th class, where x i,j (j = 1, 2, . . ., n i ) is a vector with a dimension of 21 × M, namely 3 levels of spatial pyramid, from the j-th training samples in the i-th class.y 0 ∈ R 21×M is the LLC codes of a testing sample and training samples are X = [X 1 , X 2 , . . . ,X C ] with C categories.Then the testing sample can be represented by a linear combination of training samples y 0 = w 0 X, where w 0 = [0, . . ., 0, w i,1 , w i,2 , . . ., w i,n i , 0, .., 0] T .KCRC method can effectively discover nonlinear structures such as changes of illumination, spectral noise [35] and large attitude by mapping the samples into a higher dimensional space and operating traditional CRC [36] method in this high-dimensional space.Denote Φ = [φ(x 1,1 ), φ(x 1,2 ), . . ., φ(x c,n c )] as the mapped samples from the original feature space to a high-dimensional space and we employ the Gaussian radial basis function (RBF) kernel k(x, y) = exp −σ x−y 2 2 for better fitting the SVM RBF kernel classifier.The KCRC combined with the SPM Algorithm can be illustrated as follows: LLC codes combined with SPM are calculated for each training sample and the testing sample.y 0 is the testing sample and X are the training samples.
The objective function of the KCRC algorithm w = argmin Φ(y 0 ) − Φw 2 2 + λ w 2 2 can be directly solved in Equation ( 9) where and The regularized residuals r i (y 0 ) in each category can be calculated by Equation ( 12) where and The predicted label will be two with the lowest and the second lowest residuals.
Since the KCRC method may produce inaccurate predicted labels, we use two labels rather than one predicted label to avoid misclassification with the most similar category.Therefore, we classify the testing image in two steps.The procedure of first classification is shown from Equations ( 9) to (16) using KCRC above.
Then, during the period of the second classification, LLC codes in each testing sample will be mapped to the class-specific codebook of both predicted labels to generate two class-specific histograms.Each class-specific histogram will be respectively put into SVM classifiers to output the confidence of each label.The label with the largest sum of confidence will be the final result.

Experimental Data and Setup
The first dataset is a ground truth dataset consisting of 21 scene categories [18] named University of California, Merced (UC_MERCED) dataset.This dataset was manually extracted from aerial orthoimagery and downloaded from the United States Geological Survey (USGS) National Map.The 21 classes include agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.Each category contains 100 images of size 256 × 256 pixels with a resolution of 30 cm in the RGB color space.Sample images in each category of this dataset are shown in Figure 7.
21 classes include agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.Each category contains 100 images of size 256 × 256 pixels with a resolution of 30 cm in the RGB color space.Sample images in each category of this dataset are shown in Figure 7.The second dataset used in our experiments is the High-resolution Satellite Scene Dataset in Wuhan University (WHU-RS) satellite scene dataset [37].This dataset is a new publicly available dataset wherein all the images are collected from Google Earth (Google Inc.Mountain View, CA, USA).It consists of high resolution satellite scenes of 19 categories including airport, beach, bridge, commercial, desert, farmland, football field, forest, industrial, meadow, mountain, park, parking, pond, port, railway station, residential, river and viaduct.There are 50 images of size 600 × 600 pixels for each class.Sample images of each class in this dataset are shown in Figure 8.The second dataset used in our experiments is the High-resolution Satellite Scene Dataset in Wuhan University (WHU-RS) satellite scene dataset [37].This dataset is a new publicly available dataset wherein all the images are collected from Google Earth (Google Inc.Mountain View, CA, USA).It consists of high resolution satellite scenes of 19 categories including airport, beach, bridge, commercial, desert, farmland, football field, forest, industrial, meadow, mountain, park, parking, pond, port, railway station, residential, river and viaduct.There are 50 images of size 600 × 600 pixels for each class.Sample images of each class in this dataset are shown in Figure 8.
21 classes include agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.Each category contains 100 images of size 256 × 256 pixels with a resolution of 30 cm in the RGB color space.Sample images in each category of this dataset are shown in Figure 7.The second dataset used in our experiments is the High-resolution Satellite Scene Dataset in Wuhan University (WHU-RS) satellite scene dataset [37].This dataset is a new publicly available dataset wherein all the images are collected from Google Earth (Google Inc.Mountain View, CA, USA).It consists of high resolution satellite scenes of 19 categories including airport, beach, bridge, commercial, desert, farmland, football field, forest, industrial, meadow, mountain, park, parking, pond, port, railway station, residential, river and viaduct.There are 50 images of size 600 × 600 pixels for each class.Sample images of each class in this dataset are shown in Figure 8.In this paper, we randomly choose 20 images from each class for training and the rest for testing in the WHU-RS dataset and 50 images for training in the UC_MERCED dataset.In order to measure the performance of the proposed algorithm, we use four comparison approaches in different experiments, namely, the BOVW without keypoint selection [20], BOVW with the proposed keypoint selection method, existing methods incorporating the traditional class-specific codebook [21] and existing methods without the class-specific codebook.Dense SIFT features are extracted for each image and a three-level pyramid was applied for LLCSPM.Contrast experiments were made on thresholds of response value, distance threshold and different number of clusters to get the optimal keypoint selection parameter settings.We used the public LIBSVM package [38] and the classical radial basis function (RBF) kernel [39] was selected for multiclass classification with the same SVM parameters.For RBF Kernels, the penalty coefficient C and the kernel parameter γ were selected using a grid search by cross validation.The criterion for searching C and γ is as follows: C ∈ 2 −5 , 2 −4 , . . ., 2 4 , 2 5 , γ ∈ 2 −5 , 2 −4 , . . ., 2 4 , 2 5   The optimal parameter settings we finally get is C = 4, γ = 0.5.The experiment is run 5 times and the final accuracy will be average classification accuracy.The computer environment is based on Intel Core i7-3770 with 8GB of RAM.

Results of the Keypoint Selection Algorithm
It is well-known that the state-of-the-art keypoint selection algorithms are IB3 [40], DROP3 [41], ICF [42] and Iterative Keypoint Selection (IKS) [27], as mentioned in Section 2.1.Since IKS is an efficient algorithm in keypoint selection, it is used for comparison with the modified keypoint selection algorithm in both datasets to demonstrate the superiority of the modified method.
Table 1 shows the average number of Remaining keypoints by the first step and both steps of modified keypoint selection and baseline method IKS along with their standard deviation.As we can see, there is a large number of keypoints that must be selected from the training set, 1,513,532 and 939,657 keypoints over WHU-RS and UC_MERCED land-use classification, respectively.IKS obtains a slightly higher selection rate in UC_MERCED and WHU-RS classification.However, as can be seen in Table 2, a lot less time is needed to complete the vector quantization and the highest classification is achieved in both datasets with modified keypoint selection method since the algorithm removes indiscriminative keypoints.The first step, the filtering of keypoints with response values, obtains less computational time and a little bit higher classification accuracy.Although a large number of keypoints can be selected by IKS, IKS requires a lot more computational time with a little increase in classification accuracy.

Results of the Two-Step Classification Method
KCRC has demonstrated promising performance in classification, but it still needs to be improved.Figures 9 and 10 display the testing samples in respectively two datasets which are predicted incorrectly with KCRC combined with SPM method but classified correctly in our two-step classification approach.
As we can see in Figure 9, in most testing samples that KCRC has misclassified, the category may be misclassified into categories of their backgrounds.Therefore, our proposed algorithm is more robust for those images with backgrounds of similar color or texture such as port surrounded by grass, river surrounded by forests and bridge on the river and so on.Similarly, in Figure 10, we can see buildings with forests, fields with grass on and also rivers surrounded by forests.

Results of the Two-Step Classification Method
KCRC has demonstrated promising performance in classification, but it still needs to be improved.Figures 9 and 10 display the testing samples in respectively two datasets which are predicted incorrectly with KCRC combined with SPM method but classified correctly in our two-step classification approach.
As we can see in Figure 9, in most testing samples that KCRC has misclassified, the category may be misclassified into categories of their backgrounds.Therefore, our proposed algorithm is more robust for those images with backgrounds of similar color or texture such as port surrounded by grass, river surrounded by forests and bridge on the river and so on.Similarly, in Figure 10, we can see buildings with forests, fields with grass on and also rivers surrounded by forests.

Results of the Two-Step Classification Method
KCRC has demonstrated promising performance in classification, but it still needs to be improved.Figures 9 and 10 display the testing samples in respectively two datasets which are predicted incorrectly with KCRC combined with SPM method but classified correctly in our two-step classification approach.
As we can see in Figure 9, in most testing samples that KCRC has misclassified, the category may be misclassified into categories of their backgrounds.Therefore, our proposed algorithm is more robust for those images with backgrounds of similar color or texture such as port surrounded by grass, river surrounded by forests and bridge on the river and so on.Similarly, in Figure 10, we can see buildings with forests, fields with grass on and also rivers surrounded by forests.Table 3 shows the average classification accuracy of both datasets and their standard deviation under conditions of KCRC and two-step classification.As we can see, KCRC shows excellent classification performance with a classification accuracy of about 85% but it stills misclassifies about 15% of the testing samples.As shown in Table 3, two-step classification has a positive effect on classification results.In some testing samples where KCRC has misclassified, about 11% of testing samples, our proposed algorithm demonstrates better performance although nearly 1.9% of testing samples that KCRC has correctly classified are misclassified with our two-step classification method.
Figure 11 displays the classification accuracy in each category with three different contrast methods mentioned in Section 3.1 using the UC_MERCED dataset.As can be seen, the proposed algorithm outperforms the other two methods in almost all categories by at least 3% and BOW incorporating traditional class-specific codebook performs the second best followed by methods without a class-specific codebook.However, in some categories, such as forest and storage tanks, our proposed algorithm demonstrates a slightly lower classification accuracy than the traditional BOW model incorporating the class-specific codebook.Table 3 shows the average classification accuracy of both datasets and their standard deviation under conditions of KCRC and two-step classification.As we can see, KCRC shows excellent classification performance with a classification accuracy of about 85% but it stills misclassifies about 15% of the testing samples.As shown in Table 3, two-step classification has a positive effect on classification results.In some testing samples where KCRC has misclassified, about 11% of testing samples, our proposed algorithm demonstrates better performance although nearly 1.9% of testing samples that KCRC has correctly classified are misclassified with our two-step classification method.
Figure 11 displays the classification accuracy in each category with three different contrast methods mentioned in Section 3.1 using the UC_MERCED land-use dataset.As can be seen, the proposed algorithm outperforms the other two methods in almost all categories by at least 3% and BOW incorporating traditional class-specific codebook performs the second best followed by methods without a class-specific codebook.However, in some categories, such as forest and storage tanks, our proposed algorithm demonstrates a slightly lower classification accuracy than the traditional BOW model incorporating the class-specific codebook.Similarly, as we can see in Figure 12, the proposed approach yields the highest classification accuracy in most categories followed by the BOW model incorporating traditional class-specific codebook.However, in some categories including desert, meadow, port and river, the two-step classification method demonstrates a slightly lower classification accuracy.Similarly, as we can see in Figure 12, the proposed approach yields the highest classification accuracy in most categories followed by the BOW model incorporating traditional class-specific codebook.However, in some categories including desert, meadow, port and river, the two-step classification method demonstrates a slightly lower classification accuracy.Similarly, Figure 14 shows the confusion matrix using UC_MERCED land-use classification with the proposed method.As can be seen, only classification accuracy in tennis court, storage tanks and building category are below 0.9.In this dataset, these three categories are also misclassified into several categories.For example, buildings are confused with dense and medium residential areas and storage tanks are misclassified into airplanes, buildings and tennis courts.Figure 13 further shows the confusion matrix for the two-step classification algorithm using WHU-RS land-use dataset.As we can see, almost all categories perform well with an accuracy close to 1 except for desert, industrial and port with an accuracy below 0.85.Desert categories are confused with farmland, meadow scenes and industrial scenes are confused with commercial, park and residential scenes and port scenes are confused with beach, bridge and river scenes..These three categories are misclassified into more than one category.Similarly, Figure 14 shows the confusion matrix using UC_MERCED land-use classification with the proposed method.As can be seen, only classification accuracy in tennis court, storage tanks and building category are below 0.9.In this dataset, these three categories are also misclassified into several categories.For example, buildings are confused with dense and medium residential areas and storage tanks are misclassified into airplanes, buildings and tennis courts.Similarly, Figure 14 shows the confusion matrix using UC_MERCED land-use classification with the proposed method.As can be seen, only classification accuracy in tennis court, storage tanks and building category are below 0.9.In this dataset, these three categories are also misclassified into several categories.For example, buildings are confused with dense and medium residential areas and storage tanks are misclassified into airplanes, buildings and tennis courts.

Comparison with the State-of-the-Art
In order to prove the superiority of the proposed improved class-specific codebook, we compare its classification performance on both datasets with the state-of-the-art performance reported in the literature such as LDA [11], Improved Fisher Kernel [43], Vector of Locally Aggregated Descriptors (VLAD) [44] and promising GoogLeNet [17] under similar experimental setup.
As shown in Table 4, the proposed method achieves about 1.2% higher in classification accuracy, as compared with the best performance in GoogLeNet, which is famous as a deep learning method.Compared with other state-of-the-art methods except the high-level GoogLeNet method, our proposed method achieves more than 12% higher classification accuracy.
The superior performance, as compared with the current state-of-the-art results on both datasets, demonstrates the effectiveness of the proposed method for HRIs scene-level land-use classification.

Influence of Parameters in Keypoint Selection Algorithm
Three parameter settings, threshold of response value, distance threshold and the number of clusters k in "k-means", will all have an effect on the computational time and classification accuracy.Therefore, contrasting experiments with different parameter settings have been made to find the optimal parameter settings.The main aim of the optimal parameter setting is to achieve higher classification accuracy with less computational time.
Figures 15-20 show the computational time and classification accuracy obtained by different parameter settings in the proposed keypoint selection algorithm.
The response value of keypoints ranges from 0.02 to 0.065, so I chose a threshold of response value from 0.025 to 0.055 with an interval of 0.005.Figures 15 and 16 show that, if threshold of response value is equal to 0.025, the computational time reaches the minimum while the best performances in classification accuracy are obtained when threshold is 0.04 in both datasets.However, as we can see in these two pictures, changes in classification accuracy are much more significant and drastic compared with those in computational time.Therefore, we only take classification accuracy into consideration.As can be seen in Figure 16, keypoints with a response value below 0.04 are unstable and not useful for image representation, so removing these key points

Comparison with the State-of-the-Art
In order to prove the superiority of the proposed improved class-specific codebook, we compare its classification performance on both datasets with the state-of-the-art performance reported in the literature such as LDA [11], Improved Fisher Kernel [43], Vector of Locally Aggregated Descriptors (VLAD) [44] and promising GoogLeNet [17] under similar experimental setup.
As shown in Table 4, the proposed method achieves about 1.2% higher in classification accuracy, as compared with the best performance in GoogLeNet, which is famous as a deep learning method.Compared with other state-of-the-art methods except the high-level GoogLeNet method, our proposed method achieves more than 12% higher classification accuracy.The superior performance, as compared with the current state-of-the-art results on both datasets, demonstrates the effectiveness of the proposed method for HRIs scene-level land-use classification.

Influence of Parameters in Keypoint Selection Algorithm
Three parameter settings, threshold of response value, distance threshold and the number of clusters k in "k-means", will all have an effect on the computational time and classification accuracy.Therefore, contrasting experiments with different parameter settings have been made to find the optimal parameter settings.The main aim of the optimal parameter setting is to achieve higher classification accuracy with less computational time.
Figures 15-20 show the computational time and classification accuracy obtained by different parameter settings in the proposed keypoint selection algorithm.
The response value of keypoints ranges from 0.02 to 0.065, so I chose a threshold of response value from 0.025 to 0.055 with an interval of 0.005.Figures 15 and 16 show that, if threshold of response value is equal to 0.025, the computational time reaches the minimum while the best performances in classification accuracy are obtained when threshold is 0.04 in both datasets.However, as we can see in these two pictures, changes in classification accuracy are much more significant and drastic compared with those in computational time.Therefore, we only take classification accuracy into consideration.As can be seen in Figure 16, keypoints with a response value below 0.04 are unstable and not useful for image representation, so removing these key points can help to improve classification accuracy.In the later experiment, 0.04 is selected as the threshold of response value since it performs best in classification with the highest classification accuracy for SVM.As can be seen in Figures 17 and 18, with distance threshold from 5 to 26 with an interval of 3 since 26 is enough to remove redundant keypoints and number of clusters from 2 to 8, we get the average computational time and classification accuracy.As we can see in Figures 17 and 18, when the number of cluster k is equal to 4 and distance threshold is 14, we can get the highest classification accuracy from SVM while when k is equal to 2 and distance threshold is 26, computational time is the lowest.Similarly, in Figures 15 and 16, the variation yields much more significant changes in classification accuracy than computational time and the threshold corresponding to the computational time performs badly in terms of classification accuracy.Therefore, we choose 4 clusters and a distance threshold of 14, although they take 30 more minutes to achieve the highest classification accuracy for SVM.
Similarly, in Figures 19 and 20, we can reach a similar conclusion to Figures 17 and 18 according to the parameter setting achieving the highest classification accuracy, 4 clusters and distance threshold of 14, although 30 more minutes are spent for better performance in accuracy.As can be seen in Figures 17 and 18, with distance threshold from 5 to 26 with an interval of 3 since 26 is enough to remove redundant keypoints and number of clusters from 2 to 8, we get the average computational time and classification accuracy.As we can see in Figures 17 and 18, when the number of cluster k is equal to 4 and distance threshold is 14, we can get the highest classification accuracy from SVM while when k is equal to 2 and distance threshold is 26, computational time is the lowest.Similarly, in Figures 15 and 16, the variation yields much more significant changes in classification accuracy than computational time and the threshold corresponding to the lowest computational time performs badly in terms of classification accuracy.Therefore, we choose 4 clusters and a distance threshold of 14, although they take 30 more minutes to achieve the highest classification accuracy for SVM.
Similarly, in Figures 19 and 20, we can reach a similar conclusion to Figures 17 and 18 according to the parameter setting achieving the highest classification accuracy, 4 clusters and distance threshold of 14, although 30 more minutes are spent for better performance in accuracy.As can be seen in Figures 17 and 18, with distance threshold from 5 to 26 with an interval of 3 since 26 is enough to remove redundant keypoints and number of clusters from 2 to 8, we get the average computational time and classification accuracy.As we can see in Figures 17 and 18, when the number of cluster k is equal to 4 and distance threshold is 14, we can get the highest classification accuracy from SVM while when k is equal to 2 and distance threshold is 26, computational time is the lowest.Similarly, in Figures 15 and 16, the variation yields much more significant changes in classification accuracy than computational time and the threshold corresponding to the lowest computational time performs badly in terms of classification accuracy.Therefore, we choose 4 clusters and a distance threshold of 14, although they take 30 more minutes to achieve the highest classification accuracy for SVM.
Similarly, in Figures 19 and 20, we can reach a similar conclusion to Figures 17 and 18 according to the parameter setting achieving the highest classification accuracy, 4 clusters and distance threshold of 14, although 30 more minutes are spent for better performance in accuracy.

Influence of the Size of Vocabulary
Different sizes of visual vocabularies were tested on different sizes from 100 to 600 at intervals of 100 since the number of visual words has an effect on classification accuracy.
As can be seen in Figures 21 and 22, the classification accuracy of different methods changes with different sizes of visual vocabularies.When the number of visual vocabularies in all methods increases, classification accuracy improves gradually since a larger class-specific codebook may lead to more detailed image representation.Our proposed algorithm demonstrates a relatively high classification accuracy over all codebook sizes since each category has its unique class-specific codebook, leading to significant difference.The overall accuracy is improved in our proposed algorithm by at least 8% more than for existing methods incorporating the traditional class-specific codebook mentioned in Section 3.1.As we can see, if the visual vocabulary size is over 300, the classification performance improves little, which means a visual vocabulary size of 300 is detailed enough for image representation.Therefore, we choose 400 as the optimal visual vocabulary size for our proposed method.

Influence of the Size of Vocabulary
Different sizes of visual vocabularies were tested on different sizes from 100 to 600 at intervals of 100 since the number of visual words has an effect on classification accuracy.
As can be seen in Figures 21 and 22, the classification accuracy of different methods changes with different sizes of visual vocabularies.When the number of visual vocabularies in all methods increases, classification accuracy improves gradually since a larger class-specific codebook may lead to more detailed image representation.Our proposed algorithm demonstrates a relatively high classification accuracy over all codebook sizes since each category has its unique class-specific codebook, leading to significant difference.The overall accuracy is improved in our proposed algorithm by at least 8% more than for existing methods incorporating the traditional class-specific codebook mentioned in Section 3.1.As we can see, if the visual vocabulary size is over 300, the classification performance improves little, which means a visual vocabulary size of 300 is detailed enough for image representation.Therefore, we choose 400 as the optimal visual vocabulary size for our proposed method.

Influence of the Size of Vocabulary
Different sizes of visual vocabularies were tested on different sizes from 100 to 600 at intervals of 100 since the number of visual words has an effect on classification accuracy.
As can be seen in Figures 21 and 22, the classification accuracy of different methods changes with different sizes of visual vocabularies.When the number of visual vocabularies in all methods increases, classification accuracy improves gradually since a larger class-specific codebook may lead to more detailed image representation.Our proposed algorithm demonstrates a relatively high classification accuracy over all codebook sizes since each category has its unique class-specific codebook, leading to significant difference.The overall accuracy is improved in our proposed algorithm by at least 8% more than for existing methods incorporating the traditional class-specific codebook mentioned in Section 3.1.As we can see, if the visual vocabulary size is over 300, the classification performance improves little, which means a visual vocabulary size of 300 is detailed enough for image representation.Therefore, we choose 400 as the optimal visual vocabulary size for our proposed method.

Influence of Two-Step Classification
As shown in Fig 9 and 10, testing samples for the KCRC method may misclassify one test sample into its most similar category.Our proposed method results in two labels in KCRC, one is the label with minimum residual and the other is the label with the second minimum residual.
It is more accurate to map universal histogram to class-specific codebooks in these two categories rather than map universal histogram to each category.Two class-specific histograms can be respectively put into the SVM classifier for confidence in each label.Then we do a decision-level fusion to obtain the final classification result.Categories achieving high confidence under both classspecific histograms will be more likely to be the classification result.
For example, the categories of forest and river may have similar backgrounds like trees, which occupy a comparatively large area in one image.Therefore, residuals of forest and river are very close, which may easily result in misclassification.Our proposed method output forest and river as two possible labels.Assuming the testing sample belongs to river, the confidence of river is high in the river class-specific histogram and relatively high in the forest class-specific histogram.
As can be seen in Figures 11-14, the two-step classification method demonstrates a little lower accuracy in some categories.There may exist two reasons for this.On one hand, due to insufficient SIFT descriptors extracted from images in these categories, approximately below 100 descriptors, the number of existing visual vocabularies in training images of these two categories is smaller than that in other categories.Therefore, the number of visual words in the class-specific codebook of those two categories is limited and the descriptive ability of these class-specific codebooks is relatively low, thus leading to misclassification.On the other hand, these categories may be similar to at least two other categories.Therefore, both predicted labels are not the true label.Therefore, the output confidence by SVM in both predicted labels is relatively high while the confidence of the true label is relatively low, which may lead to inaccurate classification.

Strengths and Limitations
A two-step classification method based on a class-specific codebook is proposed in this study.This method has been successfully applied to two datasets of HRIs.The main advantage of the proposed approach is the improvement of computational efficiency in the vector quantization step and increased classification accuracy in the testing samples with similar backgrounds.Experimental results show that this method can achieve an overall classification accuracy of 93.7% and outperforms other state-of-the-art scene-level classification methods.
However, it is noted that some state-of-the-art methods outperform the proposed method in some categories.These categories are short of SIFT features or similar to at least two categories.In future works, we plan to fuse local and global features to decrease the effect of insufficient local descriptors and seek better decision-level fusion methods.

Influence of Two-Step Classification
As shown in Figures 9 and 10, testing samples for the KCRC method may misclassify one test sample into its most similar category.Our proposed method results in two labels in KCRC, one is the label with minimum residual and the other is the label with the second minimum residual.
It is more accurate to map universal histogram to class-specific codebooks in these two categories rather than map universal histogram to each category.Two class-specific histograms can be respectively put into the SVM classifier for confidence in each label.Then we do a decision-level fusion to obtain the final classification result.Categories achieving high confidence under both class-specific histograms will be more likely to be the classification result.
For example, the categories of forest and river may have similar backgrounds like trees, which occupy a comparatively large area in one image.Therefore, residuals of forest and river are very close, which may easily result in misclassification.Our proposed method output forest and river as two possible labels.Assuming the testing sample belongs to river, the confidence of river is high in the river class-specific histogram and relatively high in the forest class-specific histogram.
As can be seen in Figures 11-14, the two-step classification method demonstrates a little lower accuracy in some categories.There may exist two reasons for this.On one hand, due to insufficient SIFT descriptors extracted from images in these categories, approximately below 100 descriptors, the number of existing visual vocabularies in training images of these two categories is smaller than that in other categories.Therefore, the number of visual words in the class-specific codebook of those two categories is limited and the descriptive ability of these class-specific codebooks is relatively low, thus leading to misclassification.On the other hand, these categories may be similar to at least two other categories.Therefore, both predicted labels are not the true label.Therefore, the output confidence by SVM in both predicted labels is relatively high while the confidence of the true label is relatively low, which may lead to inaccurate classification.

Strengths and Limitations
A two-step classification method based on a class-specific codebook is proposed in this study.This method has been successfully applied to two datasets of HRIs.The main advantage of the proposed approach is the improvement of computational efficiency in the vector quantization step and increased classification accuracy in the testing samples with similar backgrounds.Experimental results show that this method can achieve an overall classification accuracy of 93.7% and outperforms other state-of-the-art scene-level classification methods.
However, it is noted that some state-of-the-art methods outperform the proposed method in some categories.These categories are short of SIFT features or similar to at least two categories.In future works, we plan to fuse local and global features to decrease the effect of insufficient local descriptors and seek better decision-level fusion methods.

Conclusions
Compared with existing BOW methods based on class-specific codebook, our proposed method demonstrates higher classification accuracy than state-of-the-art methods and less computational time compared with methods without keypoint selection.Unlike previous studies that have focused on mapping a universal histogram to each class-specific codebook, we propose a method that classifies the testing image in two steps, predicting two labels of one testing image, and maps the universal histogram to the class-specific codebook in these predicted categories.According to the largest sum of confidence output by the SVM classifier, we can get the final classification results.
The experiments showed the following: (1) Modified keypoint selection method is a useful and efficient way to select the discriminative keypoints from extracted descriptors.This method demonstrates lower computational cost and higher classification accuracy.(2) We proposed a method for generating class-specific codebook using MI.Vocabularies in the universal codebook will exist in only one specific class-specific codebook.This class-specific codebook will better reflect the information of a specific category.(3) By classifying the testing image in two steps, we can decrease the error caused by KCRC.Mapping universal histograms to relatively true labels can help to enlarge the differences between different categories.The proposed two-step classification method outperforms the state-of-the-art methods, in terms of the classification accuracy.
The following research can be taken into consideration in the future.First, descriptors extracted from some images are insufficient for generation of a descriptive class-specific codebook.Therefore, we need to increase the number of visual vocabularies in the class-specific codebook in these categories to enhance descriptive ability.Second, in order to better characterize both local fine details and global structures in images, experiments can be made on fusion of local and global features.Last but not least, we need to seek for better decision-level methods in order to classify testing samples with several similar categories.

Figure 1 .Figure 2 .
Figure 1.(a) Original Scale-Invariant Feature Transform (SIFT) features extracted from images (b) keypoint selection results using modified keypoint selection method presented in this text.

Figure 1 .Figure 1 .Figure 2 .
Figure 1.(a) Original Scale-Invariant Feature Transform (SIFT) features extracted from images (b) keypoint selection results using modified keypoint selection method presented in this text.

Figure 2 .
Figure 2. Similar categories that traditional bag-of-words (BOW) may misclassify (a) Forest and river (b) Forest and chaparral (c) Freeway and airport.

Figure 3 .
Figure 3. Differences between error terms of existing methods incorporating scene labels and our proposed method.

Figure 3 .
Figure 3. Differences between error terms of existing methods incorporating scene labels and our proposed method.

Figure 4 .
Figure 4. Overview of the improved class-specific BOW model for land-use scene classification.Figure 4. Overview of the improved class-specific BOW model for land-use scene classification.

Figure 4 .
Figure 4. Overview of the improved class-specific BOW model for land-use scene classification.Figure 4. Overview of the improved class-specific BOW model for land-use scene classification.

Figure 5 .
Figure 5. Central idea of removing keypoints (a) Clustered keypoints with "k-means" in SIFT feature space.(b) Original keypoints in one cluster (c) Selected keypoints with a distance threshold.

Figure 5 .
Figure 5. Central idea of removing keypoints (a) Clustered keypoints with "k-means" in SIFT feature space.(b) Original keypoints in one cluster (c) Selected keypoints with a distance threshold.

Figure 7 .
Figure 7. Examples of ground truth data in the UC_MERCED dataset.

Figure 8 .
Figure 8. Examples of ground truth data in the WHU-RS dataset.

Figure 7 .
Figure 7. Examples of ground truth data in the UC_MERCED dataset.

Figure 7 .
Figure 7. Examples of ground truth data in the UC_MERCED dataset.

Figure 8 .
Figure 8. Examples of ground truth data in the WHU-RS dataset.Figure 8. Examples of ground truth data in the WHU-RS dataset.

Figure 8 .
Figure 8. Examples of ground truth data in the WHU-RS dataset.Figure 8. Examples of ground truth data in the WHU-RS dataset.

Figure 9 .
Figure 9. Two Example images of categories misclassified with KCRC method but classified accurately with two-step classification method in WHU-RS dataset.

Figure 10 .
Figure 10.Two Example images of categories misclassified with KCRC method but classified accurately with two-step classification method in UC_MERCED dataset.

Figure 9 .
Figure 9. Two Example images of categories misclassified with KCRC method but classified accurately with two-step classification method in WHU-RS dataset.

Figure 9 .
Figure 9. Two Example images of categories misclassified with KCRC method but classified accurately with two-step classification method in WHU-RS dataset.

Figure 10 .
Figure 10.Two Example images of categories misclassified with KCRC method but classified accurately with two-step classification method in UC_MERCED dataset.

Figure 10 .
Figure 10.Two Example images of categories misclassified with KCRC method but classified accurately with two-step classification method in UC_MERCED dataset.

Figure 13
Figure13further shows the confusion matrix for the two-step classification algorithm using WHU-RS land-use dataset.As we can see, almost all categories perform well with an accuracy close to 1 except for desert, industrial and port with an accuracy below 0.85.Desert categories are confused with farmland, meadow scenes and industrial scenes are confused with commercial, park and residential scenes and port scenes are confused with beach, bridge and river scenes..These three categories are misclassified into more than one category.

Figure 13 .
Figure 13.Confusion matrix for the proposed algorithm using WHU-RS land-use dataset.

Figure 13
Figure13further shows the confusion matrix for the two-step classification algorithm using WHU-RS land-use dataset.As we can see, almost all categories perform well with an accuracy close to 1 except for desert, industrial and port with an accuracy below 0.85.Desert categories are confused with farmland, meadow scenes and industrial scenes are confused with commercial, park and residential scenes and port scenes are confused with beach, bridge and river scenes..These three categories are misclassified into more than one category.

Figure 13 .
Figure 13.Confusion matrix for the proposed algorithm using WHU-RS land-use dataset.

Figure 13 .
Figure 13.Confusion matrix for the proposed algorithm using WHU-RS land-use dataset.

Figure 14 .
Figure 14.Confusion matrix for the proposed algorithm using UC_MERCED land-use dataset.

Figure 14 .
Figure 14.Confusion matrix for the proposed algorithm using UC_MERCED land-use dataset.
Remote Sens. 2017, 9, 223 17 of 24 can help to improve classification accuracy.In the later experiment, 0.04 is selected as the threshold of response value since it performs best in classification with the highest classification accuracy for SVM.

Figure 15 .
Figure 15.Computational time for vector quantization with different thresholds of response value.

Figure 16 .
Figure 16.Classification accuracy with different thresholds of response value.

Figure 15 .
Figure 15.Computational time for vector quantization with different thresholds of response value.

Figure 15 .
Figure 15.Computational time for vector quantization with different thresholds of response value.

Figure 16 .
Figure 16.Classification accuracy with different thresholds of response value.

Figure 16 .
Figure 16.Classification accuracy with different thresholds of response value.

Figure 17 .
Figure 17.Classification accuracy with different distance thresholds and number of clusters over WHU-RS dataset.

Figure 18 .
Figure 18.Computational time for generating BOW features with different distance thresholds and number of clusters over WHU-RS dataset.

Figure 19 .
Figure 19.Computational time for generating BOW features with different distance thresholds and the number of clusters over UC_MERCED dataset.

Figure 17 . 24 Figure 17 .
Figure 17.Classification accuracy with different distance thresholds and number of clusters over WHU-RS dataset.

Figure 18 .
Figure 18.Computational time for generating BOW features with different distance thresholds and number of clusters over WHU-RS dataset.

Figure 19 .
Figure 19.Computational time for generating BOW features with different distance thresholds and the number of clusters over UC_MERCED dataset.

Figure 18 . 24 Figure 17 .
Figure 18.Computational time for generating BOW features with different distance thresholds and number of clusters over WHU-RS dataset.

Figure 18 .
Figure 18.Computational time for generating BOW features with different distance thresholds and number of clusters over WHU-RS dataset.

Figure 19 .
Figure 19.Computational time for generating BOW features with different distance thresholds and the number of clusters over UC_MERCED dataset.

Figure 19 .
Figure 19.Computational time for generating BOW features with different distance thresholds and the number of clusters over UC_MERCED dataset.

Figure 20 .
Figure 20.Classification accuracy with different distance thresholds and number of clusters over UC_MERCED dataset.

Figure 21 .
Figure 21.Classification accuracy with a different number of visual words in WHU-RS datasets.

Figure 20 .
Figure 20.Classification accuracy with different distance thresholds and number of clusters over UC_MERCED dataset.

Figure 20 .
Figure 20.Classification accuracy with different distance thresholds and number of clusters over UC_MERCED dataset.

Figure 21 .
Figure 21.Classification accuracy with a different number of visual words in WHU-RS datasets.Figure 21.Classification accuracy with a different number of visual words in WHU-RS datasets.

Figure 21 .
Figure 21.Classification accuracy with a different number of visual words in WHU-RS datasets.Figure 21.Classification accuracy with a different number of visual words in WHU-RS datasets.

Figure 22 .
Figure 22.Classification accuracy with a different number of visual words in UC_MERCED land-use dataset.

A
different number of training samples were tested from 10% to 80% of the size of training samples in one category at intervals of 10% since the number of training samples has an effect on classification accuracy.As can be seen in Figures23 and 24, classification accuracy improves gradually with the increase of the number of training samples since a larger number of training samples may lead to more accurate caculated MI value.A smaller number of training samples may not fully represent the characteristic of the category, leading to inaccurate assignment of some vocabularies.Therefore, a small number of training samples may result in a relatively inaccurate class-specific codebook in those partly represented categories.As we can see, if the number of training samples is above 50, the classification accuracy improves a little but larger numbers may cost a lot more time.Therefore, we choose 50 training samples for the UC_MERCED dataset.Similarly, in Figure24, if the number of visual words is below 20, the classification accuracy increases gradually and we choose 20 training samples for WHU-RS dataset.

Figure 23 .
Figure 23.Overall accuracies using different number of training samples in UC_MERCED dataset.

Figure 22 .
Figure 22.Classification accuracy with a different number of visual words in UC_MERCED land-use dataset.

4. 3 . 24 Figure 22 . 4 . 3 .
Figure 22.Classification accuracy with a different number of visual words in UC_MERCED land-use dataset.4.3.Influence of Number of Training SamplesA different number of training samples were tested from 10% to 80% of the size of training samples in one category at intervals of 10% since the number of training samples has an effect on classification accuracy.As can be seen in Figures23 and 24, classification accuracy improves gradually with the increase of the number of training samples since a larger number of training samples may lead to more accurate caculated MI value.A smaller number of training samples may not fully represent the characteristic of the category, leading to inaccurate assignment of some vocabularies.Therefore, a small number of training samples may result in a relatively inaccurate class-specific codebook in those partly represented categories.As we can see, if the number of training samples is above 50, the classification accuracy improves a little but larger numbers may cost a lot more time.Therefore, we choose 50 training samples for the UC_MERCED dataset.Similarly, in Figure24, if the number of visual words is below 20, the classification accuracy increases gradually and we choose 20 training samples for WHU-RS dataset.

Figure 23 .
Figure 23.Overall accuracies using different number of training samples in UC_MERCED dataset.

Figure 23 .
Figure 23.Overall accuracies using different number of training samples in UC_MERCED dataset.

Figure 24 .
Figure 24.Overall accuracies using different number of training samples in WHU-RS dataset.

Figure 24 .
Figure 24.Overall accuracies using different number of training samples in WHU-RS dataset.

Table 1 .
Number of average remaining keypoints in our method and IKS and their standard deviation.

Table 2 .
Comparison in average computational time for vector quantization and classification accuracy along with their standard deviation for both datasets.

Table 3 .
Classification results of both datasets with KCRC and our proposed algorithm.

Table 3 .
Classification results of both datasets with KCRC and our proposed algorithm.

Table 4 .
Compare classification accuracy of proposed method with state of art methods.

Table 4 .
Compare classification accuracy of proposed method with state of art methods.