Hierarchical Coding Vectors for Scene Level Land-Use Classiﬁcation

: Land-use classiﬁcation from remote sensing images has become an important but challenging task. This paper proposes Hierarchical Coding Vectors (HCV), a novel representation based on hierarchically coding structures, for scene level land-use classiﬁcation. We stack multiple Bag of Visual Words (BOVW) coding layers and one Fisher coding layer to develop the hierarchical feature learning structure. In BOVW coding layers, we extract local descriptors from a geographical image with densely sampled interest points, and encode them using soft assignment (SA). The Fisher coding layer encodes those semi-local features with Fisher vectors (FV) and aggregates them to develop a ﬁnal global representation. The graphical semantic information is reﬁned by feeding the output of one layer into the next computation layer. HCV describes the geographical images through a high-level representation of richer semantic information by using a hierarchical coding structure. The experimental results on the 21-Class Land Use (LU) and RSSCN7 image databases indicate the effectiveness of the proposed HCV. Combined with the standard FV, our method (FV + HCV) achieves superior performance compared to the state-of-the-art methods on the two databases, obtaining the average classiﬁcation accuracy of 91.5% on the LU database and 86.4% on the RSSCN7


Introduction
Scene level land-use classification aims to assign a semantic label (e.g., building and river) to a remote sensing image according to its content.As remote sensing techniques continue to develop, overwhelming amounts of fine spatial resolution satellite images have become available.It is necessary to develop effective and efficient scene classification methods to annotate the massive remote sensing images.
By far, the Bag of Visual Words (BOVW) [1,2] framework and its variants [3,4] based on spatial relations have become promising remote sensing image representations for land-use classification.The pipeline for the BOVW framework consists of five main steps: feature extraction, codebook generation, feature coding, pooling, and normalization.For BOVW, we usually extract local features from the geographical images, learn a codebook in the training set by K-means or Gaussian mixture model (GMM), encode the local features and pool them to a vector, and normalize this vector as the final global representation.The representation is subsequently fed into a pre-trained classifier to obtain the annotation result for remote sensing images.
In a parallel development, deep learning methods have attracted continuous attention in the computer vision community in recent years.Deep neural networks (DNNs) [5] build and train deep architectures to capture graphical semantic information, achieving a large performance boost compared to the previous hand-crafted system with mid-level features.Although their methods can describe the geographical images from low level features with a more abstract and semantic representation of deep structures, it is computationally expensive to directly train effective DNNs for scene level land-use classification.One important property of the DNNs is its hierarchical organization in layers of increasing processing complexity.We adopt a similar idea, and concentrate on a shallow but hierarchic layer framework based on off-the-shelf encoding methods [6,7].
Inspired by the success of DNNs in computer vision applications and encoding methods for remote sensing applications, we proposed Hierarchical Coding Vectors (HCV), a new representation based on hierarchically coding structures, for scene level land-use classification.We apply the traditional coding pipeline as corresponding to the layers of a standard DNN and stack multi-BOVW coding layers and one Fisher coding layer to develop the hierarchical feature learning structure.The complex graphical semantic information is refined by feeding the output of one layer into the next computation layer.Through hierarchical coding, the HCV contains richer semantic information and is more powerful to describe those remote sensing images.Our experimental results on the 21-Class Land Use (LU) and RSSCN7 geographical image databases demonstrate the excellent performance of our HCV for land-use classification.Furthermore, HCV provides complementary information to the traditional Fisher Vectors (FV).When combining traditional FV with our HCV, we obtain superior classification performance compared to the current state-of-the-art results on the LU and RSSCN7 databases.
There are two main contributions of our work:

‚
We devise the Hierarchical Coding Vectors (HCV) by organizing off-the-shelf coding methods into a hierarchical architecture and evaluate the parameters of HCV for land-use classification on the LU database.

‚
The HCV achieves excellent performance for land-use classification.Further, combining HCV with standard FV, our method (FV + HCV) outperforms the state-of-the-art performance reported on the LU and RSSCN7 databases.
The remainder of this paper is organized as follows.Section 2 discusses the related work on both computer vision and remote sensing applications.Section 3 describes the details of our proposed Hierarchical Coding Vectors (HCV).Section 4 presents the experimental results.Section 5 is the conclusion.

Related Work
In both the computer vision and remote sensing communities, the recent efforts in scene classification can be divided into three directions: (1) the development of more elaborate hand-crafted features (e.g., Scale Invariant Feature Transformation (SIFT) [8], Histogram of Oriented Gradient (HOG) [9], GIST [10], Local Binary Pattern (LBP) [11]); (2) more sophisticated encoding methods (e.g., Hard Assignment (HA) [12], Soft Assignment (SA) [6], Local Coordinate Coding (LCC) [13], Locality-constrained Linear Coding (LLC) [14], Vector of Locally Aggregated Descriptors (VLAD) [15], FV [7]), and (3) more complex classifiers (e.g., Support Vector Machine (SVM) [16], Extreme Learning Machine (ELM) [17]).Recently, the second direction (i.e., encoding methods) has attracted more attention and become an effective representation for scene level land-use classification.Typical encoding methods are based on the BOVW framework.The traditional BOVW methods, including HA, SA, LCC, and LLC, are designed from the perspective of activation concept to obtain 0-order statistics of the distribution from descriptors space, and the core issue is to decide which visual word will be activated in the 'visual vocabulary' and to what extent they will be activated.Then, the Fisher Kernel introduced by Jaakkola [18] has been used to extend the BOVW framework.It describes the difference between the distribution of descriptors in an input image and that of the 'visual vocabulary', encoding multi-dimensional information (0th, 1st, 2nd) from the descriptors space.The typical Fisher Kernel methods conclude Fisher Vector (FV) and Vector of Locally Aggregated Descriptors (VLAD).The VLAD can be viewed as a simplified nonprobabilistic version of the FV.
Some researchers have also attempted to use the multi-layers model to further improve the classification performance in the remote sensing community.Chen [3] stacks two BOVW layers with the HA coding method to represent the spatial relationship among local features.A two-layer sparse coding method is used in [19].The authors apply two different optimum formulas to guarantee the image sparsity and category sparsity simultaneously, improving the discriminability of the output coding result.In the computer vision community, the hierarchical structure helps DNN [5] to achieve a large performance boost.However, it is difficult to be directly applied for the scene level land-use due to its huge computational cost.Xiaojiang Peng et al. [20] stacked multiple Fisher coding layers to build a hierarchical network for action recognition in video.The Fisher coding method causes increasing dimensions of the layer output.Thus, the dimensions of the final representation exponentially increase with the number of layers.A dimensionality reduction method has to be used between calculation layers.Inspired by the success of DNNs in computer vision applications and encoding methods for remote sensing applications, we use the off-the-shelf encoding methods to construct the hierarchical structure and stack multi-BOVW coding layers with only one Fisher coding layer to solve the dilemma in [20].The overall framework and methods used in each layer of HCV are different from those in [20].Generally speaking, our HCV develops the hierarchical feature learning structure by stacking N + 2 coding layers, which produces a much higher level representation of richer semantic information and achieves superior performance for scene level land-use classification.

Hierarchical Coding Vector
The conventional coding methods effectively encode each local feature in an image into a high-dimensional space and aggregate these codes into a single vector by a pooling method over the entire image (followed by normalization).The representation describes the geographical image in terms of the local patch features, which cannot capture more global and complex structures.Deep neural networks [5] can model complex graphical semantic structures by passing an output of one feature computation layer as the input to the next and by hierarchical refining of the semantic information.Along the line of a similar idea, we devised a hierarchical structure by stacking multi-BOVW coding layers and one Fisher coding layer, which we call the Hierarchical Coding Vector.The architecture of the Hierarchical Coding Vector (HCV) is depicted in Figure 1.
We devised the HCV to describe the whole geographical image with higher level representation of richer semantic information by a hierarchical coding structure.As shown in Figure 1, the HCV framework contains N + 2 coding layers (N + 1 BOVW coding layers and one Fisher coding layer).The coding result of one coding layer is fed into the next as the input.These coding layers are then stacked into a hierarchical network.We used BOVW coding layers to describe the local patches.Multi-BOVW coding layer superposition does not trigger dimension disaster because of the stable coding dimension of BOVW methods.The BOVW coding layers refine the local semantic information layer-by-layer and then feed the information into the Fisher coding layer to produce global deep representation.Multi-BOVW coding layers provide a better coding 'material' for the Fisher coding layer, giving the global representation (i.e., HCV) stronger discriminability for scene classification.
Theoretically, a HCV with more coding layers can learn more complicated abstract features, but this may significantly increase the complexity of the model.Considering the effectiveness and efficiency, in this paper, we consider a HCV with two coding layers (i.e., one BOVW coding layer and one Fisher coding layer), because it has already provided compelling quality.The HCV can be generalized to more layers without difficulty.The BOVW coding layer uses a Soft Assignment (SA) [6] coding method to map the low-level descriptors X = (x 1 ,x 2 ,...,x k, . . .,x K )P R EˆK from the geographical image to the coding space D = (d 1 ,d 2 ,...,d k . . .,d K )P R MˆK using the K-means codebook is produced by Fisher vector (FV) coding.Finally, HCV is input into a classifier such as a Support Vector Machine (SVM) for scene-level land use classification.The detailed description of each layer is as follows.The parameters used in this paper are summarized in Table 1.
The conventional coding methods effectively encode each local feature in an image into a highdimensional space and aggregate these codes into a single vector by a pooling method over the entire image (followed by normalization).The representation describes the geographical image in terms of the local patch features, which cannot capture more global and complex structures.Deep neural networks [5] can model complex graphical semantic structures by passing an output of one feature computation layer as the input to the next and by hierarchical refining of the semantic information.Along the line of a similar idea, we devised a hierarchical structure by stacking multi-BOVW coding layers and one Fisher coding layer, which we call the Hierarchical Coding Vector.The architecture of the Hierarchical Coding Vector (HCV) is depicted in Figure 1.The BOVW coding layer maps the input descriptors XP R EˆK to the semi-local features FP R MˆT .The pipeline of the BOVW coding layer is shown in Figure 2. Let X be a set of D-dimensional local descriptors extracted from a geographical image XP R EˆK with densely sampled interest points.Through clustering, a codebook is formed with M entries B 1 P R EˆM .The codebook is used to express each descriptor and to develop the coding result DP R MˆK .Then, pooling and normalization methods are used to produce the local patch coding representation (i.e., a semi-local features FP R MˆT ).Finally, the features, F, are fed into the next Fisher coding layer as the input.

The BOVW Coding Layer
The BOVW coding layer maps the input descriptors X ).Finally, the features, F, are fed into the next Fisher coding layer as the input.  .Given a geographical image, we first extracted the D-dimensional local descriptors X with densely sampled interest points.The raw input local descriptors X were usually strongly correlated, which created significant challenges in the subsequent codebook generation [12].The feature preprocessing approach, Whitening, was used to realize the decorrelation.The overcomplete basis vectors (i.e., codebook ) were computed on the training set using the K-means clustering method [21].To retain spatial information, the dense local descriptors (e.g., Scale Invariant Feature Transformation (SIFT) [8]) were augmented with their normalized x, y location before codebook clustering.
We chose the SA coding method rather than another BOVW coding methods such as HA [12], LCC [13], and LLC [14], which led to strong sparsity in the semi-local features F. The strong sparsity caused great challenges in the next Fisher coding layer.SA chose to activate the entire codebook and used the kernel function of distance as the coding representation: SA: where β is the smoothing factor that controls the softness of the assignment, and the Euclidean distance e  is used.Smoothing factor β , the sole parameter in SA coding, determines the sensitivity of likelihood to the distance e  and is critical to the coding and classification performance.

BOVW Coding
The BOVW coding step was based on the idea of using overcomplete basis vectors to map the local descriptors XP R EˆK to the coding result DP R MˆK .
Given a geographical image, we first extracted the D-dimensional local descriptors X with densely sampled interest points.The raw input local descriptors X were usually strongly correlated, which created significant challenges in the subsequent codebook generation [12].The feature pre-processing approach, Whitening, was used to realize the decorrelation.The overcomplete basis vectors (i.e., codebook B 1 P R EˆM ) were computed on the training set using the K-means clustering method [21].To retain spatial information, the dense local descriptors (e.g., Scale Invariant Feature Transformation (SIFT) [8]) were augmented with their normalized x, y location before codebook clustering.
We chose the SA coding method rather than another BOVW coding methods such as HA [12], LCC [13], and LLC [14], which led to strong sparsity in the semi-local features F. The strong sparsity caused great challenges in the next Fisher coding layer.SA chose to activate the entire codebook and used the kernel function of distance as the coding representation: where β is the smoothing factor that controls the softness of the assignment, and the Euclidean distance ê is used.Smoothing factor β, the sole parameter in SA coding, determines the sensitivity of likelihood to the distance ê and is critical to the coding and classification performance.

Spatial Local Pooling
Spatial local pooling aggregates the coding result DP R MˆK into the semi-local features FP R MˆT , thus achieving greater invariance to image transformations and better robustness to noise and clutter.
Compared to the regions used in the traditional global pooling, the regions are much smaller and sampled much more densely in our HCV framework.The semi-local feature representation captures more complex image statistics with the spatial local pooling.
In the HCV, we performed the spatial local pooling in adjacent scales and spaces.The 2 ˆ2 pooling region is illustrated in Figure 2. The optimal spatial structure for local pooling will be evaluated in the following experiment.We used the Max-pooling method in this step, which avoids the semi-local features being strongly influenced by frequent yet often uninformative descriptors [22].
where f t is the tth element in the semi-local features F and the d k is the coding result.P refers to the local pooling region.The Max-pooling method has demonstrated its effectiveness in many studies [6,13,14,23].

Normalization
Normalization is used to make the semi-local features have the same scale.Unlike the traditional BOVW coding pipeline, we injected power normalization before the L 2 normalization method as a pre-processing step.
where 0 ď α ď 1 is a smoothing factor of normalization (we set α " 0.5 the same as [24]).Power normalization is usually used in the Fisher coding method to further improve the classification performance [7].Meanwhile, BOVW coding methods generally do not apply due to the minimal effect on the performance.However, in our proposed HCV framework, the output of the BOVW coding layer is not used for classification but as the input for the Fisher coding layer.The Fisher vector captures the Gaussian mean and variance differences between the input features and the codebook, and it is very sensitive to the sparsity of the input features.Power normalization deceases the sparsity of the semi-local features F and make their distribution smoother, improving the classification performance of HCV (with the experiment on the LU database, we found that the power-normalization can improve the classification accuracy 3%~5%).
To retain the spatial information, the semi-local features F were also augmented with their normalized x, y location before they were fed into the next layer.

The Fisher Coding Layer
The Fisher coding layer maps the input semi-local features FP R MˆT into the final global representation Hierarchical coding vector HCVP R Mˆ2N using the Fisher vector (FV) coding method.The pipeline of the Fisher coding layer is shown in Figure 3.All the semi-local features were decorrelated using Whitening technology before being fed into the Fisher coding layer.
, thus achieving greater invariance to image transformations and better robustness to noise and clutter.Compared to the regions used in the traditional global pooling, the regions are much smaller and sampled much more densely in our HCV framework.The semi-local feature representation captures more complex image statistics with the spatial local pooling.
In the HCV, we performed the spatial local pooling in adjacent scales and spaces.The 2 × 2 pooling region is illustrated in Figure 2. The optimal spatial structure for local pooling will be evaluated in the following experiment.We used the Max-pooling method in this step, which avoids the semi-local features being strongly influenced by frequent yet often uninformative descriptors [22].
where ft is the tth element in the semi-local features F and the dk is the coding result.P refers to the local pooling region.The Max-pooling method has demonstrated its effectiveness in many studies [6,13,14,23].

Normalization
Normalization is used to make the semi-local features have the same scale.Unlike the traditional BOVW coding pipeline, we injected power normalization before the L2 normalization method as a pre-processing step.

L2:
where 0 1 ≤ ≤ α is a smoothing factor of normalization (we set 0.5 α = the same as [24]).Power normalization is usually used in the Fisher coding method to further improve the classification performance [7].Meanwhile, BOVW coding methods generally do not apply due to the minimal effect on the performance.However, in our proposed HCV framework, the output of the BOVW coding layer is not used for classification but as the input for the Fisher coding layer.The Fisher vector captures the Gaussian mean and variance differences between the input features and the codebook, and it is very sensitive to the sparsity of the input features.Power normalization deceases the sparsity of the semi-local features F and make their distribution smoother, improving the classification performance of HCV (with the experiment on the LU database, we found that the powernormalization can improve the classification accuracy 3%~5%).
To retain the spatial information, the semi-local features F were also augmented with their normalized x, y location before they were fed into the next layer.

The Fisher Coding Layer
The Fisher coding layer maps the input semi-local features F   The FV coding method is based on fitting a parametric generative model (e.g., GMM) to the input semi-local features F and then encoding the derivatives of the log-likelihood of the model with respect to its parameters [25].The GMMs with diagonal covariance are used in our HCV framework, leading to a HCV representation that captures the Gaussian mean (1st) and variance (2nd) differences between the input semi-local features F and each of the GMM centers.
where tw n , µ n , σ n u n are the respective mixture weights, means, and diagonal covariance of the GMM codebook B 2 = (b 1 ,b 2 ,...b n ,...,b N )P R MˆN .f t is one semi-local feature fed into the Fisher coding layer and T is the number of the semi-local features.α t pnq is the soft assignment weight of the t-th semi-local features f t to the n-th Gaussian.
where Npf t ; µ n , σ n q is a M-dimensional Gaussian distribution and N is the size of GMM codebook.
Finally, global representation HCVP R Mˆ2N is obtained by stacking the first and second differences: The output vector is subsequently normalized using the power + L 2 scheme, and serves as the final scene representation of HCV.

Experiment
We now evaluate the effectiveness of the proposed HCV framework and traditional FV for remote sensing land-use scene classification using two standard public databases, the 21-class Land Use (LU) database and the RSSCN7 [26] database.The classification performances of the proposed method are compared with several state-of-the-art methods.

Experimental Data and Setup
The 21-Class Land Use (LU) database [1] is one of the first publicly available geographical image databases (http://vision.ucmerced.edu/datasets.html) with ground truth, which is collected by University of California at Merced Computer Vision Lab (UCMCVL).The database consists of 21 land-use classes, and each class contains 100 images of the same size (i.e., 256 pixels ˆ256 pixels).The pixel resolutions of all images are 30 cm per pixel.Sample images of each land-use class are shown in Figure 4. To be consistent with other researchers' experimental settings on the LU database [1,[27][28][29], the database was randomly partitioned into five equal subsets.Each subset contained 20 images from each land-use category.Four subsets were used for training, and the remaining subset was used for testing.
The RSSCN7 database [26] is the recently public remote sensing database (https://sites.google.com/site/qinzoucn/documents) and was released in 2015.It contains 2800 remote sensing scene images that are from seven typical scene categories.There are 400 images with sizes of 400 ˆ400 pixels for each class.Each scene category is of four different scales with 100 images per scale.Sample images from RSSCN7 are shown in Figure 5.The same experimental setup in [26] is used.Half of the images in each category were fixed for training and the rest for testing.
experiments.The experiments were repeated ten times by randomly selecting the training and testing data with the experimental settings above.The average classification accuracy was set as the evaluation index.In the paper, we adopted Scale Invariant Feature Transformation (SIFT) as the local feature and the SIFT features were extracted from the interest point every six pixels in both the x and y directions under four scales (16,24,32,48).The one vs.rest linear SVM classifier was employed in our experiments.The experiments were repeated ten times by randomly selecting the training and testing data with the experimental settings above.The average classification accuracy was set as the evaluation index.

Experimental Results
We evaluated the classification performance by the default parameters on the two databases.On the LU database, the classification accuracy of our proposed HCV was 90.5%.We also evaluated the traditional FV [7] with the same size of the GMM codebook in HCV.The classification accuracy of the traditional FV was 88.2%.On the RSSCN7 database, the results were similar (i.e., HCV: 84.7% and FV 82.6%).On the two databases, the HCV achieved better performance than the traditional FV, which has shown great success in computer vision [7,20,24,25,30].
Furthermore, the proposed HCV also provided complementary information to the traditional FV.We used the multiple kernel learning [31] method with the average kernel to combine HCV with FV.When combining FV and HCV, we achieved a mean classification accuracy of 91.8% on the LU database and 86.4% on the RSSCN7 database.
To further investigate the performance of HCV, FV, and the combination of the two, we illustrate the per-class accuracies of the LU database in Figure 6.

Experimental Results
We evaluated the classification performance by the default parameters on the two databases.On the LU database, the classification accuracy of our proposed HCV was 90.5%.We also evaluated the traditional FV [7] with the same size of the GMM codebook in HCV.The classification accuracy of the traditional FV was 88.2%.On the RSSCN7 database, the results were similar (i.e., HCV: 84.7% and FV 82.6%).On the two databases, the HCV achieved better performance than the traditional FV, which has shown great success in computer vision [7,20,24,25,30].
Furthermore, the proposed HCV also provided complementary information to the traditional FV.We used the multiple kernel learning [31] method with the average kernel to combine HCV with FV.When combining FV and HCV, we achieved a mean classification accuracy of 91.8% on the LU database and 86.4% on the RSSCN7 database.
To further investigate the performance of HCV, FV, and the combination of the two, we illustrate the per-class accuracies of the LU database in Figure 6.From Figure 6, we observe that the proposed HCV is effective for almost all geographical classes on the LU database.Except for the intersection, overpass, and sparse residential categories, the HCV has better or comparable performance to FV in all other categories.The performance improvement is especially profound over the Tennis Courts category, which is approximately 30%, as shown in Figure 6.
Figure 7 shows some geographical images from three categories on the LU database that were predicted correctly by HCV, but not by the traditional FV.The traditional FV misclassified the two images in Figure 7a as buildings and the two images in Figure 7b as runways.The rivers in the Figure 7b do not have any curves and can easily be misclassified as runways, even by a human observer.The two images in Figure 7a are similar to buildings, and the storage tanks are not in a conspicuous position.The four images in Figure 7c were misclassified as other classes (e.g., parking lot, river, and sparse residential) by the traditional FV.Those images contain visually deceptive information, which makes the recognition challenging.The correct classification requires sufficient semantic information.HCV described those geographical images correctly through higher level representation of richer semantic information by hierarchical coding structure.
Moreover, the classification performance was improved by the combination for almost all geographical classes, as shown in Figure 6, due to the complementarity between FV and HCV.By using HCV to capture the deep visual semantic information and combining FV with HCV, our From Figure 6, we observe that the HCV is effective for almost all geographical classes on the LU database.Except for the intersection, overpass, and sparse residential categories, the HCV has better or comparable performance to FV in all other categories.The performance improvement is especially profound over the Tennis Courts category, which is approximately 30%, as shown in Figure 6.
Figure 7 shows some geographical images from three categories on the LU database that were predicted correctly by HCV, but not by the traditional FV.The traditional FV misclassified the two images in Figure 7a as buildings and the two images in Figure 7b as runways.The rivers in the Figure 7b do not have any curves and can easily be misclassified as runways, even by a human observer.The two images in Figure 7a are similar to buildings, and the storage tanks are not in a conspicuous position.The four images in Figure 7c were misclassified as other classes (e.g., parking lot, river, and sparse residential) by the traditional FV.Those images contain visually deceptive information, which makes the recognition challenging.The correct classification requires sufficient semantic information.HCV described those geographical images correctly through higher level representation of richer semantic information by hierarchical coding structure.

Evaluation of the Parameters in HCV
In the proposed Hierarchical Coding Vector (HCV) framework, the dictionary size of each of the coding layers, the key parameter β in the SA coding method, and the different spatial structures in local pooling are the important parameters.We investigated these parameters on the LU database and chose the optimum HCV parameters for scene level land-use classification.The evaluation was carried out for one parameter at a time and the other ones were fixed to the default.The most important parameter (i.e., the codebook size of each coding layers) was investigated first and then we studied the key parameter β .In the end, the different spatial structures in local pooling were evaluated.Furthermore, we also evaluated the effect of the number of coding layers.

The Effect of Different Codebook Size
First, we estimated the optimum codebook sizes for each coding layer.The BOVW coding layer used the K-means codebook.The FV coding layer used the GMM codebook.We set β = 0.01 and the spatial structure as 2 × 2. The classification results of HCV with varying K-means/GMM codebook size on the LU database are listed in Table 2.
The sizes of the K-means and GMM codebooks are critical to the classification performance of HCV.Too small of a codebook cannot capture enough graphical statistics.Meanwhile, too large of a codebook can cause over-partitioning in the descriptor space.As shown in Table 2, the classification performance increased with the larger codebooks and reached a plateau (even decreased) when the codebooks' size exceeded a threshold for both K-means and GMM codebooks.Based on the experimental results, we chose the codebook size of K-means/GMM as 1000/8 in terms of the classification accuracy and computational complexity.Moreover, the classification performance was improved by the combination for almost all geographical classes, as shown in Figure 6, due to the complementarity between FV and HCV.By using HCV to capture the deep visual semantic information and combining FV with HCV, our method (FV + HCV) achieved very good classification performance.

Evaluation of the Parameters in HCV
In the proposed Hierarchical Coding Vector (HCV) framework, the dictionary size of each of the coding layers, the key parameter β in the SA coding method, and the different spatial structures in local pooling are the important parameters.We investigated these parameters on the LU database and chose the optimum HCV parameters for scene level land-use classification.The evaluation was carried out for one parameter at a time and the other ones were fixed to the default.The most important parameter (i.e., the codebook size of each coding layers) was investigated first and then we studied the key parameter β.In the end, the different spatial structures in local pooling were evaluated.Furthermore, we also evaluated the effect of the number of coding layers.

The Effect of Different Codebook Size
First, we estimated the optimum codebook sizes for each coding layer.The BOVW coding layer used the K-means codebook.The FV coding layer used the GMM codebook.We set β = 0.01 and the spatial structure as 2 ˆ2.The classification results of HCV with varying K-means/GMM codebook size on the LU database are listed in Table 2.
The sizes of the K-means and GMM codebooks are critical to the classification performance of HCV.Too small of a codebook cannot capture enough graphical statistics.Meanwhile, too large of a codebook can cause over-partitioning in the descriptor space.As shown in Table 2, the classification performance increased with the larger codebooks and reached a plateau (even decreased) when the codebooks' size exceeded a threshold for both K-means and GMM codebooks.Based on the experimental results, we chose the codebook size of K-means/GMM as 1000/8 in terms of the classification accuracy and computational complexity.To show the effect of β on the HCV more clearly, we selected five images from five different land-use classes and visualized those coding results under different values of β.The visualization result is illustrated in Figure 8.Each vertical column represents the coding result with a different value of β for the same image.Each horizontal row represents the coding result with the same value of β for the different images.The left-most column is the visualization of the semi-local feature f t P R M output by the BOVW coding layer, and the remaining part is the visualization of HCV.The visualizations of the semi-local feature f t (output of the BOVW coding layer) for the five different images are quite similar, so we have only displayed one representative of the feature f t for each value of β in Figure 8.When β is too small (e.g., ), SA coding is not sensitive to the distance ê between descriptors xk and codeword bm.The codebook is almost activated in the same intensity.The BOVW coding layer cannot capture enough discriminable image information, and the HCV is not able to represent the complex semantic structure.We can observe that the BOVW layer output seems to be meaningless and the HCV of the five images are very similar in this situation, as shown in Figure 8.It is easy to cause misclassification.With the increase of β , the SA coding method can express the distance information ê appropriately and the BOVW layer output appears to be undulating.The HCV output by the Fisher coding layer of different images shows the obvious difference and increasing classification performance is expected.When β becomes too large, the SA coding response decreases rapidly with the increasing distance ê .Figure 8 shows that the sparsity of the BOVW layer output increases and the HCV of the five images becomes similar.The increasing sparsity is a challenge for the Fisher vector coding and weakens the discriminability of the HCV.When β is too small (e.g., β " 10 ´5), SA coding is not sensitive to the distance ê between descriptors x k and codeword b m .The codebook is almost activated in the same intensity.The BOVW coding layer cannot capture enough discriminable image information, and the HCV is not able to represent the complex semantic structure.We can observe that the BOVW layer output seems to be meaningless and the HCV of the five images are very similar in this situation, as shown in Figure 8.It is easy to cause misclassification.With the increase of β, the SA coding method can express the distance information ê appropriately and the BOVW layer output appears to be undulating.The HCV output by the Fisher coding layer of different images shows the obvious difference and increasing classification performance is expected.When β becomes too large, the SA coding response decreases rapidly with the increasing distance ê. Figure 8 shows that the sparsity of the BOVW layer output increases and the HCV of the five images becomes similar.The increasing sparsity is a challenge for the Fisher vector coding and weakens the discriminability of the HCV.
With the visualization result, we found the value of parameter β is critical to the classification performance of HCV.We evaluated the effect of different values of β on the classification performance of HCV and determined the optimal value.Sizes of 1000 and 8 were our choices for the K-means codebook and GMM codebook, respectively.The spatial structure is 2 ˆ2.The classification accuracy of HCV for the different parameters β on the LU database is shown in Figure 9.The experiment results confirm our previous analysis.The parameter β is a key factor for HCV.
Too small or too large value of a β weakens the classification performance by a large margin.Based on the results in Figure 9, we chose β = 0.01.

The Effect of Different Spatial Structures in Local Pooling
Local pooling aggregates the coding results of SIFT features under four scales inside the spatial structure.We evaluated the effect of different spatial structures on the classification performance of HCV in this section.The five different spatial structures (1 × 1, 2 × 2, 3 × 3, 4 × 4, and 5 × 5) were evaluated on the LU database.The Max-pooling method was applied.We set β = 0.01 and the size of K-means/GMM codebook was 1000/8.The classification performance of different spatial structures for HCV is illustrated in Figure 10.The experiment results confirm our previous analysis.The parameter β is a key factor for HCV.Too small or too large value of a β weakens the classification performance by a large margin.Based on the results in Figure 9, we chose β = 0.01.

The Effect of Different Spatial Structures in Local Pooling
Local pooling aggregates the coding results of SIFT features under four scales inside the spatial structure.We evaluated the effect of different spatial structures on the classification performance of HCV in this section.The five different spatial structures (1 ˆ1, 2 ˆ2, 3 ˆ3, 4 ˆ4, and 5 ˆ5) were evaluated on the LU database.The Max-pooling method was applied.We set β = 0.01 and the size of K-means/GMM codebook was 1000/8.The classification performance of different spatial structures for HCV is illustrated in Figure 10.
As seen from Figure 10, the classification performance of HCV gradually decreases with the larger spatial structure, which can be explained by two factors: (1) the increasing spatial structure leads to the repeated expression of some mutation points, creating a new challenge in the FV coding; and (2) the number of the input points of the Fisher coding layer proportionately decreases with the larger spatial structure, weakening the discriminability of the HCV.
Based on the experiment results, the spatial structure 1 ˆ1 was applied in our HCV framework.Inside the 1 ˆ1 spatial structure, the coding results d k P R M of the SIFT features x k P R D under four scales were aggregated to semi-local feature f t P R M using the Max-pooling methods.
Local pooling aggregates the coding results of SIFT features under four scales inside the spatial structure.We evaluated the effect of different spatial structures on the classification performance of HCV in this section.The five different spatial structures (1 × 1, 2 × 2, 3 × 3, 4 × 4, and 5 × 5) were evaluated on the LU database.The Max-pooling method was applied.We set β = 0.01 and the size of K-means/GMM codebook was 1000/8.The classification performance of different spatial structures for HCV is illustrated in Figure 10.As seen from Figure 10, the classification performance of HCV gradually decreases with the larger spatial structure, which can be explained by two factors: (1) the increasing spatial structure leads to the repeated expression of some mutation points, creating a new challenge in the FV coding; and (2) the number of the input points of the Fisher coding layer proportionately decreases with the larger spatial structure, weakening the discriminability of the HCV.
Based on the experiment results, the spatial structure 1 × 1 was applied in our HCV framework.Inside the 1 × 1 spatial structure, the coding results dk From Figure 11, we can observe that the performance has been improved significantly from one layer (88.2) to two layers (90.5) due to the hierarchical structure.However, as the layer number continued to increase, there was no further substantial improvement in the classification performance due to the parameter tuning.With an increasing number of layers, the number of parameters to tune grows exponentially.The lack of the good parameter tuning for the larger models (i.e., three layers and four layers) prevented the optimal performance of HCV.This is a problem that needs to be solved in the future.
For a good tradeoff between effectiveness and efficiency, we only used two coding layers (i.e., one BOVW coding layer and one Fisher coding layer) to perform scene level land-use classification in this paper.

Comparison with the State-of-the-Art Methods
To prove the effectiveness of our proposed method, a comparison of its performance with the state-of-the-art performance reported in the literature was performed on the two public databases under the same experimental setup.The comparison results of LU database are reported in Table 3.
Although the MS-CLBP described in [27] achieves comparable performance with HCV, the Extreme Learning Machine (ELM) and Radial Basis Function (RBF) nonlinear kernel were used in From Figure 11, we can observe that the performance has been improved significantly from one layer (88.2) to two layers (90.5) due to the hierarchical structure.However, as the layer number continued to increase, there was no further substantial improvement in the classification performance due to the parameter tuning.With an increasing number of layers, the number of parameters to tune grows exponentially.The lack of the good parameter tuning for the larger models (i.e., three layers and four layers) prevented the optimal performance of HCV.This is a problem that needs to be solved in the future.
For a good tradeoff between effectiveness and efficiency, we only used two coding layers (i.e., one BOVW coding layer and one Fisher coding layer) to perform scene level land-use classification in this paper.

Comparison with the State-of-the-Art Methods
To prove the effectiveness of our proposed method, a comparison of its performance with the state-of-the-art performance reported in the literature was performed on the two public databases under the same experimental setup.The comparison results of LU database are reported in Table 3.
Although the MS-CLBP described in [27] achieves comparable performance with HCV, the Extreme Learning Machine (ELM) and Radial Basis Function (RBF) nonlinear kernel were used in their approach.The nonlinear classifier has to bear additional complexity and bear the poor scalability, which is important for real application.Our proposed method relies on the one vs.rest linear SVM classifier.The linear classifier makes the framework simpler and more conducive to practical application.The classification performance of our method should be improved further with a sophisticated classifier.
As shown in Table 3, our method (FV + HCV) outperformed the current state-of-the-art results on the LU database, which demonstrates the effectiveness of our method (FV + HCV) for remotely sensed land use classification.Furthermore, the statistical z-test was used to test whether the performance improvement is meaningful.The z-test is a hypothesis test based on the Z-statistic, which follows the standard normal distribution under the null hypothesis [32].It is often used to determine whether the difference between two means is significant.When the Z ě 1.96, the difference is significant (p ď 0.05).On the contrary, when the Z < 1.96, the difference is not significant (p > 0.05).A comparison of our method to other methods is provided in Table 3; p ď 0.05 for our method (FV + HCV).The minimum value of Z is 1.99 when compared to MS-CLBP and the p is still less than 0.05.The performance boost of our method is statistically significant.Table 3.Comparison of our approach (FV + HCV) with the state-of-the-art performance reported in the literature on the LU database under the same experimental setup: 80% of images from each class are used for training and the remaining images are used for testing.The average classification accuracy (mean ˘SD) is set as the evaluation index.

Computational Complexity
Many approaches with a nonlinear classifier have to pay a computational complexity O(n 2 ) or O(n 3 ) in the train phase and O(n) in the testing phase, where n is the training size.It implies a poor scalability for the real application.Our method, using a simple linear SVM, reduces the training complexity to O(n), and obtains a constant complexity in testing, while still achieving a superior performance.In the end, we evaluated the computation complexity of our method (HCV + FV) and used the 21-class land-use (LU) database to obtain the processing time.Our codes are all implemented in MATLAB 2014a and were run on a computer with an Inter (R) Xeon (R) CPU E5-2620 v2 @ 2.1GHZ and 32G RAM in a 64-bit Win7 operation system.As observed from our experiment, the train phase takes about 27 min and the average processing time for a test remote sensing image (size of 256 ˆ256 pixels) is 0.55 ˘0.02 second (including dense local descriptors extraction, HCV, and FV coding to get the final representation).

Conclusions
In this paper, we proposed using Hierarchical Coding Vectors (HCV), a novel representation based on hierarchically coding structures, for scene level land-use classification.We have shown that the traditional coding pipelines are amenable to stacking in multiple layers.Building a hierarchical coding structure is sufficient to significantly boost the performance of these shallow encoding methods.The experimental results on the LU and RSSCN7 databases demonstrate the effectiveness of our HCV representation.By combining HCV with the traditional Fisher vectors, our method (FV + HCV) outperforms the current state-of-the-art methods on the LU and RSSCN7 databases.

Figure 1 .
Figure 1.The architecture of the proposed Hierarchical Coding Vector (HCV).The representation of HCV is deeper with richer semantic information by constructing a hierarchical coding structure.SVMs, Support Vector Machines; FV, Fisher Vectors; BOVW, Bag of Visual Words; SIFT, Scale Invariant Feature Transformation.

Figure 1 .
Figure 1.The architecture of the proposed Hierarchical Coding Vector (HCV).The representation of HCV is deeper with richer semantic information by constructing a hierarchical coding structure.SVMs, Support Vector Machines; FV, Fisher Vectors; BOVW, Bag of Visual Words; SIFT, Scale Invariant Feature Transformation.
the BOVW coding layer is shown in Figure 2. Let X be a set of D-dimensional local descriptors extracted from a geographical image X E K × ∈  with densely sampled interest points.Through clustering, a codebook is formed with M entries B1 E M × ∈  .The codebook is used to express each descriptor and to develop the coding result D M K × ∈  .Then, pooling and normalization methods are used to produce the local patch coding representation (i.e., a semi-local features F T M × ∈ 

Figure 2 .
Figure 2. The pipeline of the BOVW coding layer.

3. 1
.1.BOVW CodingThe BOVW coding step was based on the idea of using overcomplete basis vectors to map the local descriptors X

Figure 2 .
Figure 2. The pipeline of the BOVW coding layer.
the Fisher vector (FV) coding method.The pipeline of the Fisher coding layer is shown in Figure3.All the semi-local features were decorrelated using Whitening technology before being fed into the Fisher coding layer.

Figure 3 .
Figure 3.The pipeline of the Fisher coding layer.Figure 3. The pipeline of the Fisher coding layer.

Figure 3 .
Figure 3.The pipeline of the Fisher coding layer.Figure 3. The pipeline of the Fisher coding layer.

Figure 6 .
Figure 6.Comparison of the pre-class accuracies of Hierarchical Coding Vector (HCV) with the Fisher Vector (FV) and the combination of the two on the LU database.

Figure 6 .
Figure 6.Comparison of the pre-class accuracies of Hierarchical Coding Vector (HCV) with the Fisher Vector (FV) and the combination of the two on the LU database.

Figure 7 .
Figure 7.Some images are predicted correctly by the HCV, but not by the FV on the LU database: (a) storage tanks images; (b) river images; (c) tennis courts images.

Figure 7 .
Figure 7.Some images are predicted correctly by the HCV, but not by the FV on the LU database: (a) storage tanks images; (b) river images; (c) tennis courts images.

2 .
The Key Parameter β in the SA Coding Method To show the effect of β on the HCV more clearly, we selected five images from five different land-use classes and visualized those coding results under different values of β .The visualization result is illustrated in Figure 8.Each vertical column represents the coding result with a different value of β for the same image.Each horizontal row represents the coding result with the same value of β for the different images.The left-most column is the visualization of the semi-local feature ft M ∈  output by the BOVW coding layer, and the remaining part is the visualization of HCV.The visualizations of the semi-local feature ft (output of the BOVW coding layer) for the five different images are quite similar, so we have only displayed one representative of the feature ft for each value of β in Figure 8.

Figure 8 .
Figure 8. Visual coding result of the Hierarchical Coding Vector (HCV) of different parameters on the LU database.Each vertical column represents the coding result of a different β for the same image.Each horizontal row represents the coding result of same β for different images.

Figure 8 .
Figure 8. Visual coding result of the Hierarchical Coding Vector (HCV) of different parameters on the LU database.Each vertical column represents the coding result of a different β for the same image.Each horizontal row represents the coding result of same β for different images.
codebook and GMM codebook, respectively.The spatial structure is 2 × 2. The classification accuracy of HCV for the different parameters β on the LU database is shown in Figure9.

Figure 9 .
Figure 9. Evaluation of the effect on the classification accuracy of HCV of the parameter β on the LU database.

Figure 9 .
Figure 9. Evaluation of the effect on the classification accuracy of HCV of the parameter β on the LU database.

Figure 10 .
Figure 10.Classification performance of different spatial structures for HCV on the LU database.Figure 10.Classification performance of different spatial structures for HCV on the LU database.

Figure 10 .
Figure 10.Classification performance of different spatial structures for HCV on the LU database.Figure 10.Classification performance of different spatial structures for HCV on the LU database.

4. 3 . 4 .
The Effect of the Number of Coding LayersWe also evaluated the effect of the number of coding layers.The classification accuracy over different number of coding layers in the HCV framework is shown in Figure11.One coding layer represents only the Fisher coding layer used in the HCV.Two coding layers contain one BOVW coding layer and one Fisher coding layer.Similarly, the three coding layers consist of two BOVW coding layers and one Fisher coding layer.Remote Sens. 2016, 8, 436 13 of 17 were aggregated to semi-local feature ft M ∈  using the Max-pooling methods.4.3.4.The Effect of the Number of Coding Layers We also evaluated the effect of the number of coding layers.The classification accuracy over different number of coding layers in the HCV framework is shown in Figure 11.One coding layer represents only the Fisher coding layer used in the HCV.Two coding layers contain one BOVW coding layer and one Fisher coding layer.Similarly, the three coding layers consist of two BOVW coding layers and one Fisher coding layer.

Figure 11 .
Figure 11.Evaluation of the effect on the classification accuracy of the number of coding layers.

Figure 11 .
Figure 11.Evaluation of the effect on the classification accuracy of the number of coding layers.

Table 2 .
Classification accuracy (%) of HCV with varying K-means/GMM codebook size on the LU database.

Table 2 .
Classification accuracy (%) of HCV with varying K-means/GMM codebook size on the LU database.

Table 4 .
Comparison of our approach (FV + HCV) with the state-of-the-art performance reported in the literature on the RSSCN7 database under the same experimental setup: half of images from each class are used for training and the rest are used for testing.The average classification accuracy (mean ˘SD) is set as the evaluation index.DBN: Deep Belief Networks.
* Our own implementation.